The Facebook crawler DDOS.

Facebook’s crawler is an untamed beast. If you are not prepared, it can cause you a lot of problems. Especially if you use a lot of story images.

The crawler that scrapes HTML page content is not too bad as it actually pays attention to the “og:ttl” meta tag:

<meta property="og:ttl" content="2419200" />

The og:ttl tag above tells Facebook not to re-scrape the page for another 28 days (2,419,200 seconds). By doing this, you can effectively rate-limit the crawler. If you do not use this tag, it will use the Expires header. If your server does not return an Expires header, it will default to 7 days (604,800 seconds).

The image crawler will keep hitting the same resources.

Unfortunately, the Facebook crawler that fetches images specified in the og:image tag is an entirely different beast. You cannot tell it to back off. It will scrape the same image multiple times in quick succession while ignoring any of your cache-related headers.

My initial presumption was that the Facebook crawler would request an image once and then cache it on their servers.

This presumption was dangerously wrong.

Within weeks, my server started to strain under the amount of requests from Facebook IP addresses.

When the crawler sees an image resource in the og:image tag, it downloads it from multiple locations. This means that up-to 20 different Facebook crawlers from 20 different geographical locations will request the image and download it.

It will then return every 30 days or so and do the exact same thing.

Essentially, this means that one single og:image could result in about 240 requests from the Facebook crawler per year. If you have 100 images, then that number becomes 24,000 requests per year. If you have 1,000 images, then you will be serving about 240,000 requests per year.

So be warned, especially if you plan on serving dynamically-generated images. Put a good cache in place between your server and Facebook or offload the images to a separate server entirely.