PHP: Find and extract all links from a HTML string.

This is a PHP tutorial on how to extract all links and their anchor text from a HTML string. In this guide, I will show you how to fetch the HTML content of a web page and then extract the links from it. To do this, we will be using PHP’s DOMDocument class.

Let’s jump right in and take a look at a simple example:

In the code above:

  1. We sent a GET request to a given web page using PHP’s file_get_contents function. This function will return the HTML source of the URL as a string.
  2. We instantiated the DOMDocument class.
  3. In order to load the HTML string into our newly-created DOMDocument object, we used the DOMDocument::loadHTML function.
  4. After that, we used the getElementsByTagName function to search our HTML for all “a” elements. As I’m sure you already know, the <a> tag is used to define a hyperlink. Note that this function will return a traversable DOMNodeList object.
  5. We created an empty array called $extractedLinks, which will be used to neatly package all our retrieved links.
  6. Because the DOMNodeList object is traversable, we are able to loop through each <a> tag using a foreach loop.
  7. Inside our foreach loop, we retrieved the link text using the nodeValue property. To retrieve the actual link itself, we used the getAttribute function to retrieve the href HTML attribute.
  8. If the link is blank or starts with a hashtag / anchor link, we skip it by using the continue statement.
  9. Finally, we store the link’s details in our $extractedLinks array.

If you run the PHP above, the script will dump out an array of all links that were found on the Wikipedia homepage. Note that these links can be relative or absolute.

Facebook Comments