PHP: Find and extract all links from a HTML string.

This is a PHP tutorial on how to extract all links and their anchor text from a HTML string. In this guide, I will show you how to fetch the HTML content of a web page and then extract the links from it. To do this, we will be using PHP’s DOMDocument class.

Let’s jump right in and take a look at a simple example:

//Get the page's HTML source using file_get_contents.
$html = file_get_contents('https://en.wikipedia.org');

//Instantiate the DOMDocument class.
$htmlDom = new DOMDocument;

//Parse the HTML of the page using DOMDocument::loadHTML
@$htmlDom->loadHTML($html);

//Extract the links from the HTML.
$links = $htmlDom->getElementsByTagName('a');

//Array that will contain our extracted links.
$extractedLinks = array();

//Loop through the DOMNodeList.
//We can do this because the DOMNodeList object is traversable.
foreach($links as $link){

    //Get the link text.
    $linkText = $link->nodeValue;
    //Get the link in the href attribute.
    $linkHref = $link->getAttribute('href');

    //If the link is empty, skip it and don't
    //add it to our $extractedLinks array
    if(strlen(trim($linkHref)) == 0){
        continue;
    }

    //Skip if it is a hashtag / anchor link.
    if($linkHref[0] == '#'){
        continue;
    }

    //Add the link to our $extractedLinks array.
    $extractedLinks[] = array(
        'text' => $linkText,
        'href' => $linkHref
    );

}

//var_dump the array for example purposes
var_dump($extractedLinks);

In the code above:

  1. We sent a GET request to a given web page using PHP’s file_get_contents function. This function will return the HTML source of the URL as a string.
  2. We instantiated the DOMDocument class.
  3. In order to load the HTML string into our newly-created DOMDocument object, we used the DOMDocument::loadHTML function.
  4. After that, we used the getElementsByTagName function to search our HTML for all “a” elements. As I’m sure you already know, the <a> tag is used to define a hyperlink. Note that this function will return a traversable DOMNodeList object.
  5. We created an empty array called $extractedLinks, which will be used to neatly package all our retrieved links.
  6. Because the DOMNodeList object is traversable, we are able to loop through each <a> tag using a foreach loop.
  7. Inside our foreach loop, we retrieved the link text using the nodeValue property. To retrieve the actual link itself, we used the getAttribute function to retrieve the href HTML attribute.
  8. If the link is blank or starts with a hashtag / anchor link, we skip it by using the continue statement.
  9. Finally, we store the link’s details in our $extractedLinks array.

If you run the PHP above, the script will dump out an array of all links that were found on the Wikipedia homepage. Note that these links can be relative or absolute.