Web Scraping with PHP Libraries: 3 Ways to Try in 2023

Considerations for Web Scraping using PHP

In order to scrape a single web page, PHP has the built-in cURL library. It is one of the most popular libraries out there and is available by default.

However, most web scraping tasks are not limited to a single page. A typical web scraping project entails scraping an entire website. This requires crawling all the pages of the website and then spawning additional web scraping requests to fetch all the pages.

This also brings in one more consideration, which is the load on the server. Performing web scraping within the PHP code directly results in some performance overruns, since crawling a large website and scraping every page is an involved activity. Therefore, sometimes it is better to offload the web scraping chores to a third-party service.

Based on the above concerns, there are three approaches to achieving web scraping using PHP:

Using a Built-in Library: For simple and one-off webpage scraping tasks, it is better to use a built-in PHP HTTP client module. This is the simplest option and ideally suited when crawling the website is not a requirement.
Using a Scraping Library: A web scraping library specifically designed for crawling websites. This helps in identifying the internal links to web pages and streamlines the filtering of HTML page contents from crawled URLs. For web scraping tasks involving moderate to large websites, this is a good option.
Using a Web Scraping API: If your PHP application has to undertake large-scale web scraping projects, then continuously extracting web pages from within the PHP code leads to a lot of performance overheads. It results in deteriorated server performance when that PHP application has to also handle other requests from the user. In this case, it is better to leverage an external service for scraping the web pages.

We will show you how to perform web scraping using all these approaches, along with sample code in the form of PHP scripts. Therefore, if you want to follow along, ensure that you have the PHP8 runtime available in your development environment.

Approach 1: Scraping a Web Page using cURL

cURL is a widely used built-in PHP library for performing web-based operations, including API handling, or fetching a web page. It ships as a PHP module and is mostly enabled in the PHP runtime.

You can create a simple PHP script to scrape a web page using cURL. Create a new php file named php_curl.php and add the following code:




$GLOBALS['env'] = "production"; // development

if( $GLOBALS['env'] !== "development" )
{
	error_reporting(E_ERROR | E_WARNING | E_PARSE);
	@ini_set("display_errors", 1);
}
else
{
	error_reporting(E_ALL);
	@ini_set("display_errors", 0);	
}

function get_env()
{
	return ($GLOBALS['env'] == 'development') ? true : false;
}

function _is_curl_installed()
{
	if(in_array  ('curl', get_loaded_extensions())) {
    	return true;
	} else {
	    return false;
	}
}

function init_curl($url)
{
	$response = new stdClass();
	$response->status = 0;
	$response->error_message = '';
	$response->output = '';

	// check curl extension is enabled or not
	if( _is_curl_installed() )
	{
		$url_valid = filter_var($url, FILTER_VALIDATE_URL);
		if( $url_valid )
		{
			if( get_env() )
			{
				echo "\n URL ".$url." is valid \n";
			}
		}
		else
		{
			$response->status = 0;
			$response->error_message = $url. " is not a valid url.";
			$response->output = '';
			return $response;
		}

		// Initialize curl
		$ch = curl_init();
		 
		// URL for Scraping
		curl_setopt($ch, CURLOPT_URL, $url);
		 
		// set curl options
		curl_setopt_array($ch, array(
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_ENCODING => '',
            CURLOPT_MAXREDIRS => 10,
            CURLOPT_TIMEOUT => 0,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
        ));
		 
		$output = curl_exec($ch);

		// get server status code
		$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

		if($output === false)
		{
			if( get_env() )
			{
				echo "\nCurl error no: " . curl_errno($ch) . "\n";
				echo "Curl error: " . curl_error($ch) . "\n";
			    echo "Curl error description: " . curl_strerror(curl_errno($ch)) . "\n";
			}

		    $response->status = $httpcode;
			$response->error_message = curl_strerror(curl_errno($ch));
			$response->output = '';
		}
		else
		{
		    $response->status = $httpcode;
			$response->error_message = '';
			$response->output = $output;
		} 
		// Closing cURL
		curl_close($ch);

		return $response;
	}
	else
	{
		if( get_env() )
		{
			echo "\nCurl is not enabled. Please enable curl extension in php.ini.\n";
			echo "Change ;extension:curl => extension:curl in your php.ini and try again.\n";
		}
		$response->status = 0;
		$response->error_message = "Curl is not enabled";
		$response->output = '';
	}
}


// Start

// check arguments
if( $argc == 2 )
{
	$url = $argv[1];
}
else
{
	echo "Could not get URL from command line option\n";
	echo "Usage: php php_curl.php http://www.example.com\n";
	die();
}

// call curl request
$response = init_curl($url);

// check response for errors.
if( $response->status == 200 )
{
	// output the response
	echo $response->output;
}
else
{
	if( $response->status != 0 )
	{
		echo "Server HTTP Status : ".$response->status." \n";
	}

	if( $response->output != "" )
	{
		echo $response->output;	
	}
	if( $response->error_message != "" )
	{
		echo "Server Response : ".$response->error_message;
	}
}

die();
?>

This script takes the URL to be scraped as an argument and calls the init_curl() function which handles the scraping task. Inside it, the two library functions curl_init( ) and curl_exec( ) are called with appropriate parameters to send HTTP requests for fetching the content of the web page pointed by the URL.

An additional error handling code snippet is added to check for valid URL format, as well as response status. Moreover, checks are also added at the beginning to make sure the cURL PHP module is installed. In case your PHP environment does not have it installed, you can enable it from the php.ini config under your PHP installation directory.

Here is how you can invoke the script.

And the scraped web page is displayed in the terminal.

Approach 2: Scraping Web Pages using Goutte

Goutte is a popular, open-source PHP web scraping library. Using Goutte, you can crawl an entire website and define filters to scan and extract specific web page content.

Check out the official GitHub repository for Goutte to install the library under your PHP environment.

Create a PHP file named php_goutte.php and add the following code:



require 'vendor/autoload.php';
use Goutte\Client;
use Symfony\Component\HttpClient\HttpClient;

$GLOBALS['env'] = "production"; // development

if( $GLOBALS['env'] !== "development" )
{
	error_reporting(E_ERROR | E_WARNING | E_PARSE);
	@ini_set("display_errors", 1);
}
else
{
	error_reporting(E_ALL);
	@ini_set("display_errors", 0);	
}

function get_env()
{
	return ($GLOBALS['env'] == 'development') ? true : false;
}

function _is_goutte_installed()
{
	try{
		$client = new Client();
	}
	catch(Exception $e)
	{
		if( get_env() )
		{
			echo "\n Goutte Library is not installed.\n";
		}
		return false;
	}
	return true;
}

function goutte_request($url)
{
	$response = new stdClass();
	$response->status = 400; // Set Status to Client Error.
	$response->error_message = '';
	$response->output = '';

	// check curl extension is enabled or not
	if( _is_goutte_installed() )
	{
		$url_valid = filter_var($url, FILTER_VALIDATE_URL);
		if( $url_valid )
		{
			if( get_env() )
			{
				echo "\n URL ".$url." is valid \n";
			}
		}
		else
		{
			$response->error_message = $url. " is not a valid url.";
			$response->output = '';
			return $response;
		}

		// Initialize Client
		$client = new Client(HttpClient::create(['timeout' => 60]));
		 
		try{
			$client->request('GET', $url);
			$responseObject = $client->getResponse();
			$http_status = $responseObject->getStatusCode();

			if( is_object($responseObject) )
			{
				$output = $responseObject->getContent();
			}
			else
			{
				$output = '';
			}
			$response->status = $http_status;
			$response->error_message = '';
			$response->output = $output;

			return $response;
		}
		catch(Exception $e)
		{
			$response->error_message = $e->getMessage();
			$response->output = '';
		}
		return $response;
	}
	else
	{
		if( get_env() )
		{
			echo "\nGoutte\Client Library not found.\n";
		}
		$response->error_message = "Goutte library not found.";
		$response->output = '';
		return $response;
	}
}


// Start

// check arguments
if( $argc == 2 )
{
	$url = $argv[1];
}
else
{
	echo "Error Occured\n";
	echo "Could not get URL from command line option\n";
	echo "Usage: php php_goutte.php http://www.example.com\n";
	die();
}

// call request request
$response = goutte_request($url);

// check response for errors.
if( $response->status == 200 )
{
	// output the response
	echo $response->output;
}
else
{
	if( $response->status != 0 )
	{
		echo "Server HTTP Status : ".$response->status." \n";
	}

	if( $response->output != "" )
	{
		echo $response->output;	
	}
	if( $response->error_message != "" )
	{
		echo "Server Response : ".$response->error_message;
	}
}

die();
?>

This code imports the Goutte Client and initializes it with HTTPClient to initiate a GET request on the URL. The URL to be scraped is passed as an argument.

Save the file and invoke the PHP script.

And you get a response containing the URL content.

The above script was a very simple demonstration of web scraping using Goutte. However, Goutte can do much more because it has some intelligent features to perform screen scraping to navigate to specific links on a web page and also scrape data by filtering simple HTML DOM elements and attributes. It also supports form submission, so it is possible to use it on websites where the content is available behind an authentication wall.

Approach 3: Scraping Web Pages using Abstract Web Scraping API

For large-scale scraping projects, it is better to leverage a service that offers a proxy to distribute the scraping requests globally. Abstract Web Scraping API is one of the most reliable options for this purpose. It supports millions of proxies and IP addresses from across the globe and offers customizable extraction options.

To access this API, signup for your free Abstract account to get access to all the APIs. Once logged in, you can access the Web Scraping API from the dashboard.

Once you access the Web Scraping API console, you can see your API primary key.

This is a unique key generated by Abstract for your account. Make a note of this key. You can try the live test to see how the API responds after extracting data.

Let's use this API inside a PHP script. For this, create a new PHP file named php_abstract.php and add the following code:




$GLOBALS['env'] = "production"; // development

if( $GLOBALS['env'] !== "development" )
{
	error_reporting(E_ERROR | E_WARNING | E_PARSE);
	@ini_set("display_errors", 1);
}
else
{
	error_reporting(E_ALL);
	@ini_set("display_errors", 0);	
}

function get_env()
{
	return ($GLOBALS['env'] == 'development') ? true : false;
}

function _is_curl_installed()
{
	if(in_array  ('curl', get_loaded_extensions())) {
    	return true;
	} else {
	    return false;
	}
}

function init_curl($url)
{
	$response = new stdClass();
	$response->status = 0;
	$response->error_message = '';
	$response->output = '';

	// check curl extension is enabled or not
	if( _is_curl_installed() )
	{
		$url_valid = filter_var($url, FILTER_VALIDATE_URL);
		if( $url_valid )
		{
			if( get_env() )
			{
				echo "\n URL ".$url." is valid \n";
			}
		}
		else
		{
			$response->status = 0;
			$response->error_message = $url. " is not a valid url.";
			$response->output = '';
			return $response;
		}

		// Initialize curl
		$ch = curl_init();

		//Build Abstract URL for scraping request
		$url = "https://scrape.abstractapi.com/v1/?api_key=&url=".$url;
		 
		// URL for Scraping
		curl_setopt($ch, CURLOPT_URL, $url);
		 
		// set curl options
		curl_setopt_array($ch, array(
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_ENCODING => '',
            CURLOPT_MAXREDIRS => 10,
            CURLOPT_TIMEOUT => 0,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
        ));
		 
		$output = curl_exec($ch);

		// get server status code
		$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

		if($output === false)
		{
			if( get_env() )
			{
				echo "\nCurl error no: " . curl_errno($ch) . "\n";
				echo "Curl error: " . curl_error($ch) . "\n";
			    echo "Curl error description: " . curl_strerror(curl_errno($ch)) . "\n";
			}

		    $response->status = $httpcode;
			$response->error_message = curl_strerror(curl_errno($ch));
			$response->output = '';
		}
		else
		{
		    $response->status = $httpcode;
			$response->error_message = '';
			$response->output = $output;
		} 
		// Closing cURL
		curl_close($ch);

		return $response;
	}
	else
	{
		if( get_env() )
		{
			echo "\nCurl is not enabled. Please enable curl extension in php.ini.\n";
			echo "Change ;extension:curl => extension:curl in your php.ini and try again.\n";
		}
		$response->status = 0;
		$response->error_message = "Curl is not enabled";
		$response->output = '';
	}
}


// Start

// check arguments
if( $argc == 2 )
{
	$url = $argv[1];
}
else
{
	echo "Could not get URL from command line option\n";
	echo "Usage: php php_curl.php http://www.example.com\n";
	die();
}

// call curl request
$response = init_curl($url);

// check response for errors.
if( $response->status == 200 )
{
	// output the response
	echo $response->output;
}
else
{
	if( $response->status != 0 )
	{
		echo "Server HTTP Status : ".$response->status." \n";
	}

	if( $response->output != "" )
	{
		echo $response->output;	
	}
	if( $response->error_message != "" )
	{
		echo "Server Response : ".$response->error_message;
	}
}

die();
?>

Before saving this file, ensure to replace the <YOUR_ABSTRACTAPI_KEY> with the primary API key allotted to your Abstract Web Scraping API console.

This code is very similar to the earlier approach of using cURL. However, the key difference here is that cURL is used to make HTTP requests to the Abstract Web Scraping API instead of directly fetching the content from the URL.

This is an important consideration for a real-world scraping project. That’s because this approach ensures that the scraping requests are processed by proxies maintained by Abstract API instead of exposing the server IP addresses where the PHP cRUL request is sent.

You can run this script in the same way as the earlier cURL approach.

Now, the Abstract Web Scraping API does the heavy lifting of scraping the URL and the cURL library captures the API response, which is finally displayed by the script.

FAQs

Can PHP be used for web scraping?

Yes, PHP can be used for web scraping. PHP as a programming language for the web offers a few options for scraping a web page. For small scraping tasks, developers can use the built-in cURL library. PHP also has a rich ecosystem of web scraping libraries such as Goutte which offers some intelligent options for crawling the internal links of a website to scrape selected content. You can use this library in conjunction with an external scraping service such as Abstract Web Scraping API to massively distribute the web scraping requests across millions of proxies.

How to scrape a website using PHP?

PHP offers built-in support for web scraping. Using the cURL module, developers can pass URL arguments and extract the contents of the web page. However, note that this is a very basic form of web scraping, which does not scale well. For scraping entire websites, there is a better option than cURL, such as Goutte which is a web scraping library designed to crawl an entire website. Additionally, you can also use an external service, like Abstract Web Scraping library to massively distribute the scraping requests across millions of IP addresses.

How to get data from another website in PHP?

You can extract data from another website using one of the PHP web scraping libraries. For extracting the entire content of a web page, a cURL request does an excellent job. In case the requirement is to filter the data based on certain conditions, such as HTML tags, you can use one of the PHP libraries specifically designed for web scraping. Goutte is a popular choice for web scraping using PHP. Additionally, you can also use an external web scraping service, such as the Abstract Web Scraping API. With this API, you can scale the scraping request to many proxies and IP addresses to prevent the other website from blocking your PHP server’s IP address due to too many scraping requests.

Web Scraping with PHP Libraries

Table of Contents:

Heading

Heading

Considerations for Web Scraping using PHP

Approach 1: Scraping a Web Page using cURL

Approach 2: Scraping Web Pages using Goutte

Approach 3: Scraping Web Pages using Abstract Web Scraping API

FAQs

Can PHP be used for web scraping?

How to scrape a website using PHP?

How to get data from another website in PHP?

Related Articles

What is Fraud Prevention

gRPC vs. REST: Definitions and Key Differences

Microservices vs APIs: Differences and Definition

GraphQL vs. REST: When to use each

Discover the Best 13 Crypto APIs

How to Crop Images in JavaScript