Who exactly is crawling my site?

Background

I run a small website that is a mainly SEO oriented site. It attracts about ~500 visitors per day, and over 80% of them come from SEO. The site has about 30,000 different pages on it.

For a site with these properties, you’d expect a high percentage of the traffic to come from web crawlers, given the number of pages as compared to the daily traffic. So while looking at the server logs and seeing all the bot accesses run by my tail -f, I decided to do an investigation about who exactly is doing all these requests. Log data is from Sept 1st to Oct 17th. Search traffic data comes from Google Analytics from the same period.

Bots vs Humans

First, by looking at the logs it is clear that bots do make up the majority of the requests and bandwidth:

Is Robot	Requests		Bandwidth (MBs)
Yes	418,814	77.97%	6,596	59.44%
No	118,327	22.03%	4,501	40.56%

Top 10 Bots

Robot	User Agent	Requests		Bandwidth (MBs)
bingbot	Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)	199,725	37.18%	1,560	14.06%
Googlebot	Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)	40,359	7.51%	264	2.38%
Googlebot Mobile	Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)	36,340	6.77%	196	1.77%
AhrefsBot	Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)	33,893	6.31%	122	1.10%
Baidu	Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)	26,325	4.90%	510	4.60%
MJ12bot	Mozilla/5.0 (compatible; MJ12bot/v1.4.4; http://www.majestic12.co.uk/bot.php?+)	13,458	2.51%	181	1.64%
proximic	Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)	10,657	1.98%	1,382	12.46%
A6	A6-Indexer/1.0 (http://www.a6corp.com/a6-web-scraping-policy/)	6,109	1.14%	40	0.36%
ADmantX	ADmantX Platform Semantic Analyzer - ADmantX Inc. - www.admantx.com - support@admantx.com	5,993	1.12%	1,130	10.18%
msnbot/2.0b	msnbot/2.0b (+http://search.msn.com/msnbot.htm)	4,959	0.92%	106	0.96%

All requests from Googlebot and bingbot were validated. This would have made almost no difference though, the logs only contained 6 spoofed bingbot user agents and 10 spoofed googlebot user agents.

Observations

4 out of the top 10 bots crawling the site have no pretense of crawling to drive traffic back to it:
- Selling seo services: AhrefsBot
- Advertising targeting: proximic, A6, ADmantX
Both proximic and ADmantX are a small percentage of requests but a much larger percentage of bandwidth. This is because neither of their robots support gzip compression of html, resulting in much larger request size for the heavily compressable, text heavy site.

This wastes my site’s bandwidth, but more importantly this wastes their bandwidth. Tip: If you are writing a web crawler, support compression. For a text heavy site like this one, bots which support compression are getting 10-15x as many pages for the same amount of bandwidth (and they are getting them a lot faster as well).

Bingbot does a lot more crawling than all other bots, most notably more than Googlebot. Here is a direct comparison of their crawling of various types of pages on the site:

Type of page	Bingbot requests	Googlebot requests	Bingbot-Googlebot Ratio
Homepage	146	139	1.05
Newer long tail pages	47,944	28,685	1.67
Older long tail pages	134,206	45,118	2.97

While bingbot does crawl more than Googlebot across the board, the biggest difference is in the long-tail pages, and even moreso in the older long-tail pages. My guess is that Googlebot slows down its crawl rate more than bingbot on older pages that don’t change very often.

Crawl to search traffic efficiency

For the bots that actually do drive search traffic, I looked at how their pages crawled compared to the search traffic they actually drove to the site.

Search Engine	Bots	Pages Requests	Bandwidth (MBs)	Search Hits Driven	Crawl Requests / Search hit	Crawl MB / Search hit
Google (Google, AOL)	Googlebot, Googlebot-Mobile	76,699	460	19,881	3.86	0.02
Bing (Bing, Yahoo)	bingbot, msnbot	204,684	1,667	3,366	60.81	0.50
Baidu	Baiduspider	26,325	510	5	5,265.00	102.17
Yandex	YandexBot	2,013	25	1	2,013.00	25.00

Observations

Google, due to the fact that it drives the majority of search traffic and it crawls more efficiently, outpaces all other engines when comparing crawled pages to traffic driven.
Baidu and Yandex drive neglible traffic for the amount of pages they crawl (granted: this site is targeted towards people in the US so this isn’t entirely surprising)

Final Thoughts

When I first was going through the logs, I didn’t expect to actually change anything because of this exercise. However after looking at the data, I’ve decided to block all crawlers to the site other than Google or Bing. The new robots.txt looks like this:

User-Agent: Googlebot
Allow: /

User-Agent: Googlebot-Mobile
Allow: /

User-Agent: msnbot
Allow: /

User-Agent: bingbot
Allow: /

# Adsense
User-Agent: Mediapartners-Google
Allow: / 

User-Agent: *
Disallow: /

There is little value in allowing other search engines to crawl the site, and I don’t see a point in allowing the other random services to use the data on the site for purposes that have no benefit to me. And in the case of proximic and ADmantX, their crawling is incredibly inefficient and wasteful of bandwidth.

It turns out I’m not alone in adding these types of restrictions. Yelp blocks everybody but Google, Bing, ia_archiver (archive.org), ScoutJet (Blekko) and Yandex. LinkedIn also has a similar opt-in robots.txt, though they have whitelisted a larger number of bots than yelp.

Update - Oct 18

There were some good comments on hacker news saying that blocking all other crawlers other than Google and Bing is shortsighted and unfriendly to new search engines. This is a good point. So while I still believe a whitelist approach makes sense, I should make sure to whitelist any crawlers which behave and are attempting to drive traffic to my site (IE, search engines). So here is the new robots.txt:

User-Agent: Googlebot
Allow: /

User-Agent: Googlebot-Mobile
Allow: /

User-Agent: msnbot
Allow: /

User-Agent: bingbot
Allow: /

# Adsense
User-Agent: Mediapartners-Google
Allow: / 

# Blekko
User-Agent: ScoutJet
Allow: / 

User-Agent: Yandex
Allow: / 

User-agent: baiduspider
Allow: / 

User-agent: DuckDuckBot
Allow: / 

# CommonCrawl
User-agent: ccbot
Allow: / 

User-Agent: *
Disallow: /

This approach will require me to pay attention to new search engines and whitelist their crawlers as they come up, but it saves me the effort of having to manually blacklist every new crawler that tries to crawl the site.

Discuss on Hacker News

Dan Birken

Who exactly is crawling my site?

Background

Bots vs Humans

Top 10 Bots

Observations

Crawl to search traffic efficiency

Observations

Final Thoughts

Update - Oct 18