PHP: Find all images in a HTML string.

This is a guide on how to find and extract all image elements from a string containing HTML. In this tutorial, we will fetch the HTML content of an external web page before extracting all of the images from it. Essentially, we will be scraping the web page for images.

Take a look at the following example:

//Send a GET request to the URL of the web page using file_get_contents.
//This will return the HTML source of the page as a string.
$htmlString = file_get_contents('https://en.wikipedia.org/wiki/Main_Page');

//Create a new DOMDocument object.
$htmlDom = new DOMDocument;

//Load the HTML string into our DOMDocument object.
@$htmlDom->loadHTML($htmlString);

//Extract all img elements / tags from the HTML.
$imageTags = $htmlDom->getElementsByTagName('img');

//Create an array to add extracted images to.
$extractedImages = array();

//Loop through the image tags that DOMDocument found.
foreach($imageTags as $imageTag){

    //Get the src attribute of the image.
    $imgSrc = $imageTag->getAttribute('src');

    //Get the alt text of the image.
    $altText = $imageTag->getAttribute('alt');

    //Get the title text of the image, if it exists.
    $titleText = $imageTag->getAttribute('title');

    //Add the image details to our $extractedImages array.
    $extractedImages[] = array(
        'src' => $imgSrc,
        'alt' => $altText,
        'title' => $titleText
    );
}

//var_dump our array of images.
var_dump($extractedImages);

In the example above, we scraped Wikipedia’s homepage using the file_get_contents function. This function returns the HTML of the page in a string format, which we can then load into the DOMDocument object.

The DOMDocument object allows us to find all img tags without having to resort to using regular expressions. By using the getElementsByTagName function, we can tell it to return a DOMNodeList of the elements that we want.

In the case above, we told the DomDocument object to return all the img tags in the HTML source that we fetched with file_get_contents. We then looped through those img tags while fetching their src, title and alt attributes.

If you run the code above, you should get an output that is similar to this:

Note that in some cases:

  • The title tag will not exist. In that case, a blank string will be returned.
  • The alt tag will also be blank.
  • The URL found in the src attribute of the img tag may be relative. i.e. It might not include the domain name or the HTTP protocol. In those cases, you will have to “fix” the links yourself.

Related: Scrape links with PHP.