The Facebook crawler DDOS.

Facebook’s crawler is an untamed beast and if you are not prepared, it will cause you a lot of problems. Especially if you use a lot of story images.

The crawler that scrapes HTML page content is not too bad as it actually pays attention to the “og:ttl” meta tag:

The og:ttl tag above tells Facebook not to re-scrape the page for another 28 days (2,419,200 seconds). By doing this, you can effectively rate-limit the crawler. If you do not use this tag, it will use the Expires header. If your server does not return an Expires header, it will default to 7 days (604,800 seconds).

Unfortunately, the Facebook crawler that fetches images specified in the og:image tag is an entirely different beast. It cannot be tamed or told to back off. It will scrape the same image multiple times in quick succession while ignoring any of your cache-related headers.

My initial assumption was that the Facebook crawler would request an image once and then cache it on their servers.

This assumption was dangerously wrong, as my server soon started to strain under the amount of requests from Facebook IP addresses.

When the crawler sees an image resource in the og:image tag, it downloads it from multiple locations. This means that up-to 20 different Facebook crawlers from 20 different geographical locations will request the image and download it. It will then return every 30 days or so and do the exact same thing.

Essentially, this means that one single og:image could result in about 240 requests from the Facebook crawler per year. If you have 100 images, then that number becomes 24,000 requests per year. If you have 1,000 images, then you will be serving about 240,000 requests per year.

So be warned, especially if you plan on serving dynamically-generated images. Put a good cache in place between your server and Facebook or offload the images to a separate server entirely.

Facebook Comments