Getting web page contents using PHP

Getting web page contents using PHP

1. Getting web page contents.

Let $strURL be the URL of the web page we want to retrieve. Then the natural way of getting web page contents is:

$fd = fopen ($url, “rb”);
while (!feof ($fd))
$buffer .= fgets($fd, 4096);
fclose ($fd);

BUT Versions prior to PHP 4.0.5 do not handle HTTP redirects. Because of this, directories must include trailing slashes. Thus there are two choices. Eigher write

if (substr($strText, -1) != ‘/’) $strText .= “/”;

or use fsockopen function to connect to the web server, send the request, check for “Location …” line in the header, repeat. See the sampe in the article Retreiving web page contents handling HTTP redirects.

2. Building the list of all links at the web page.

Here we have to use the power of regular expressions.

preg_match_all(”/[ \”‘>\r\n\t#>][^>]*>/isU”,
$strPageText,
$aUrls);

now the array $aUrls[1] contains all the links contained in $strPageText.

3. Creating the class to collect all the links in the given web site starting from given URL.

class CLinkScanner
{
var $aUrlsToProcess;
/* $aUrlsToProcess is associative array of url’s not yet scanned for links.
If $page is to be processed, $aUrlsToProcess$page = true */
var $aProcessedUrls;
/* $aProcessedUrls is associative array of url’s already scanned for links.
If $url is already processed, $aProcessedUrls[$url] = true */
var $strSiteBaseUrl;
/* Algorithm won’t process url’s which don’t begin with $strSiteBaseUrl. */

/*
Function RetrieveLinks scans $strText for links.
If new links are found, they are added to $aUrlsToProcess.
*/
function RetrieveLinks($strPageText, $strBaseUrl)
{
preg_match_all(
“/]*HREF[^=]*=[ ‘\”\n\r\t]*([^ \”‘>\r\n\t#]+)[ \”‘>\r\n\t#>][^>]*>/isU”,
$strPageText,
$aUrls);
foreach($aUrls[1] as $strUrl)
{
trim($strUrl);
// skipping email addresses
if (substr($strUrl, 0, 7) == “mailto:”) continue;
// skipping javascript code
if (substr($strUrl, 0, 11) == “javascript:”) continue;
// if $strUrl is not in the canonical form, adding current web page url
if (substr($strUrl, 0, 7) != “http://”)
{
if ($strBaseUrl[strlen($strBaseUrl)-1] != ‘/’ && $strUrl[0] != ‘/’)
$strUrl = $strBaseUrl.’/’.$strUrl;
else
$strUrl = $strBaseUrl.$strUrl;
}
/* If $strUrl points outside of web site, skip it. */
if (strlen($strUrl) strSiteBaseUrl) ||
substr($strUrl, 0, strlen($this->strSiteBaseUrl)) !=
$this->strSiteBaseUrl) continue;

/* If web page $strUrl is now scanned for links, adding
it to the list of not yet processed url’s. */
if (isset($this->aProcessedUrls[$strUrl]) == false)
$this->aUrlsToProcess[$strUrl] = true;
}
}

/* Now, creating a function which will repeatly call
RetrieveLinks until the list of url’s to be processed is empty. */
function Start()
{
do
{
// getting first URL from the list of url’s to be processed
reset($this->aUrlsToProcess);
$strUrl = key($this->aUrlsToProcess);
// removing that URL from the list of url’s to be processed
unset($this->aUrlsToProcess[$strUrl]);
// adding that URL to the list of already processed url’s
$this->aProcessedUrls[$strUrl] = true;

/* Here using CDWHttpFile class to retreive the web page with url $strUrl.
You can see CDWHttpFile source code in the article
Retreiving web page contents handling HTTP redirects.*/
$httpFile = new CDWHttpFile($strUrl);
if ($httpFile->bResult == true) // if the web page is retrieved
{
/* In case if we got to another URL because of HTTP redirect,
adding new url to the list of processed URL’s, and removing it
(if it exists there) from the list of URL’s to be processed. */
$strUrl = $httpFile->strLocation;
$this->aProcessedUrls[$strUrl] = true;
unset($this->aUrlsToProcess[$strUrl]);
// Finally, retreiving links
$this->RetrieveLinks($httpFile->strFile, $httpFile->strLocation);
}
// Repeating untill the list of URL’s to be processed is empty.
} while (count($this->aUrlsToProcess) != 0);
}

/* Finishing up, writing a function which will start the whole process. */
function Process($strBaseUrl, $strEntryUrl) // starting from $strUrl
{
$this->strSiteBaseUrl = $strBaseUrl;
// Adding entry point to the list of URL’s to be processed.
$this->aUrlsToProcess[$strUrl] = true;
$this->Start(); // Starting the link retrieval process.
}
};

4. Usage.

Now all we have to do is to execute the following code:

$pageScanner = new CLinkScanner();
$pageScanner->Process(”http://www.domain-name.com”,”http://www.domain-name.com/sub-domain/”);
foreach($pageScanner->aProcessedUrls as $strUrl => $bTrue)
echo “$strUrl
“;

which will retrieve all the links from the web site http://www.domain-name.com starting from http://www.domain-name.com/sub-domain/ and then will print them.

Leave a Reply


All material @ copyrighted by chrisranjana.com. If you want to link to this article you are welcome to do so. Unauthorized publication is strictly prohibited. This developer tutorial website contains articles by Php programmers , Software developers, Mysql programmers and asp c# programmers. This website also contains ajax tutorials and advanced mysql sql stored procedures and functions tutorials and sample codes.