Dan Birken

home

Who exactly is crawling my site?

17 Oct 2013

Background

I run a small website that is a mainly SEO oriented site. It attracts about ~500 visitors per day, and over 80% of them come from SEO. The site has about 30,000 different pages on it.

For a site with these properties, you’d expect a high percentage of the traffic to come from web crawlers, given the number of pages as compared to the daily traffic. So while looking at the server logs and seeing all the bot accesses run by my tail -f, I decided to do an investigation about who exactly is doing all these requests. Log data is from Sept 1st to Oct 17th. Search traffic data comes from Google Analytics from the same period.

Bots vs Humans

First, by looking at the logs it is clear that bots do make up the majority of the requests and bandwidth:

Is Robot Requests Bandwidth (MBs)
Yes 418,814 77.97% 6,596 59.44%
No 118,327 22.03% 4,501 40.56%

Top 10 Bots

Robot User Agent Requests Bandwidth (MBs)
bingbot Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) 199,725 37.18% 1,560 14.06%
Googlebot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 40,359 7.51% 264 2.38%
Googlebot Mobile Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html) 36,340 6.77% 196 1.77%
AhrefsBot Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/) 33,893 6.31% 122 1.10%
Baidu Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html) 26,325 4.90% 510 4.60%
MJ12bot Mozilla/5.0 (compatible; MJ12bot/v1.4.4; http://www.majestic12.co.uk/bot.php?+) 13,458 2.51% 181 1.64%
proximic Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php) 10,657 1.98% 1,382 12.46%
A6 A6-Indexer/1.0 (http://www.a6corp.com/a6-web-scraping-policy/) 6,109 1.14% 40 0.36%
ADmantX ADmantX Platform Semantic Analyzer - ADmantX Inc. - www.admantx.com - support@admantx.com 5,993 1.12% 1,130 10.18%
msnbot/2.0b msnbot/2.0b (+http://search.msn.com/msnbot.htm) 4,959 0.92% 106 0.96%

All requests from Googlebot and bingbot were validated. This would have made almost no difference though, the logs only contained 6 spoofed bingbot user agents and 10 spoofed googlebot user agents.

Observations

Crawl to search traffic efficiency

For the bots that actually do drive search traffic, I looked at how their pages crawled compared to the search traffic they actually drove to the site.

Search Engine Bots Pages Requests Bandwidth (MBs) Search Hits Driven Crawl Requests / Search hit Crawl MB / Search hit
Google
(Google, AOL)
Googlebot, Googlebot-Mobile 76,699 460 19,881 3.86 0.02
Bing
(Bing, Yahoo)
bingbot, msnbot 204,684 1,667 3,366 60.81 0.50
Baidu Baiduspider 26,325 510 5 5,265.00 102.17
Yandex YandexBot 2,013 25 1 2,013.00 25.00

Observations

Final Thoughts

When I first was going through the logs, I didn’t expect to actually change anything because of this exercise. However after looking at the data, I’ve decided to block all crawlers to the site other than Google or Bing. The new robots.txt looks like this:

User-Agent: Googlebot
Allow: /

User-Agent: Googlebot-Mobile
Allow: /

User-Agent: msnbot
Allow: /

User-Agent: bingbot
Allow: /

# Adsense
User-Agent: Mediapartners-Google
Allow: / 

User-Agent: *
Disallow: /

There is little value in allowing other search engines to crawl the site, and I don’t see a point in allowing the other random services to use the data on the site for purposes that have no benefit to me. And in the case of proximic and ADmantX, their crawling is incredibly inefficient and wasteful of bandwidth.

It turns out I’m not alone in adding these types of restrictions. Yelp blocks everybody but Google, Bing, ia_archiver (archive.org), ScoutJet (Blekko) and Yandex. LinkedIn also has a similar opt-in robots.txt, though they have whitelisted a larger number of bots than yelp.

Update - Oct 18

There were some good comments on hacker news saying that blocking all other crawlers other than Google and Bing is shortsighted and unfriendly to new search engines. This is a good point. So while I still believe a whitelist approach makes sense, I should make sure to whitelist any crawlers which behave and are attempting to drive traffic to my site (IE, search engines). So here is the new robots.txt:

User-Agent: Googlebot
Allow: /

User-Agent: Googlebot-Mobile
Allow: /

User-Agent: msnbot
Allow: /

User-Agent: bingbot
Allow: /

# Adsense
User-Agent: Mediapartners-Google
Allow: / 

# Blekko
User-Agent: ScoutJet
Allow: / 

User-Agent: Yandex
Allow: / 

User-agent: baiduspider
Allow: / 

User-agent: DuckDuckBot
Allow: / 

# CommonCrawl
User-agent: ccbot
Allow: / 

User-Agent: *
Disallow: /

This approach will require me to pay attention to new search engines and whitelist their crawlers as they come up, but it saves me the effort of having to manually blacklist every new crawler that tries to crawl the site.

Discuss on Hacker News