TCT: Crawl a Website

Code
Feb 18, 2010
Tags: PHP, Thursday Code Tip, and Tip

DISCLAIMER: I would like to say I do not condone doing this. Better ways, more legal, ways to get content from someone. But sometimes this is asked of you by your boss. DO NOT STEAL CONTENT.

For this weeks Thursday Code Tip I will show how to use PHP to crawl a website to gather content. First we start by selecting the URL to crawl:

$sURL = "http://www.defvayne23.com/";

Next we get the content of the page:

$sContent = file_get_contents($sURL);

Now to use REGEX to get what we want. You can learn patterns here. Below we search for the text within a H1 tag.

$sPattern = '/<h1>([a-z0-9\s]+)<\/h1>/i';
preg_match($sPattern, $sContent, $aMatches);

The above won’t return anything because I link all my H1′s. So lets modify it so it will account for the links, but not gather them.

$sPattern = '/<h1><a [^>]+>([a-z0-9\s]+)<\/a><\/h1>/i';

Now that we account for the anchor the above should return:

Array
(
    [0] => Defvayne23
    [1] => Defvayne23
)

The first part of the array is the HTML it found including the h1 and anchor. The second is just the text that we where looking for.

Here it is all together:

1 $sURL = "http://www.defvayne23.com/";
2 $sContent = file_get_contents($sURL);
3 $sPattern = '/<h1><a [^>]+>([a-z0-9\s]+)<\/a><\/h1>/i';
4 preg_match($sPattern, $sContent, $aMatches);
5 $sHeader = $aMatches[1];