I run a small website that is a mainly SEO oriented site. It attracts about ~500 visitors per day, and over 80% of them come from SEO. The site has about 30,000 different pages on it.
For a site with these properties, you'd expect a high percentage of the traffic to come from web crawlers, given the number of pages as compared to the daily traffic. So while looking at the server logs and seeing all the bot accesses run by my
tail -f, I decided to do an investigation about who exactly is doing all these requests. Log data is from Sept 1st to Oct 17th. Search traffic data comes from Google Analytics from the same period.
Bots vs Humans
First, by looking at the logs it is clear that bots do make up the majority of the requests and bandwidth:
|Is Robot||Requests||Bandwidth (MBs)|
Top 10 Bots
|Robot||User Agent||Requests||Bandwidth (MBs)|
|bingbot||Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)||199,725||37.18%||1,560||14.06%|
|Googlebot||Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)||40,359||7.51%||264||2.38%|
|Googlebot Mobile||Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us) AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117 Safari/6531.22.7 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)||36,340||6.77%||196||1.77%|
|AhrefsBot||Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)||33,893||6.31%||122||1.10%|
|Baidu||Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)||26,325||4.90%||510||4.60%|
|MJ12bot||Mozilla/5.0 (compatible; MJ12bot/v1.4.4; http://www.majestic12.co.uk/bot.php?+)||13,458||2.51%||181||1.64%|
|proximic||Mozilla/5.0 (compatible; proximic; +http://www.proximic.com/info/spider.php)||10,657||1.98%||1,382||12.46%|
|ADmantX||ADmantX Platform Semantic Analyzer - ADmantX Inc. - www.admantx.com - firstname.lastname@example.org||5,993||1.12%||1,130||10.18%|
All requests from Googlebot and bingbot were validated. This would have made almost no difference though, the logs only contained 6 spoofed bingbot user agents and 10 spoofed googlebot user agents.
4 out of the top 10 bots crawling the site have no pretense of crawling to drive traffic back to it:
- Selling seo services: AhrefsBot
- Advertising targeting: proximic, A6, ADmantX
Both proximic and ADmantX are a small percentage of requests but a much larger percentage of bandwidth. This is because neither of their robots support gzip compression of html, resulting in much larger request size for the heavily compressable, text heavy site.
This wastes my site's bandwidth, but more importantly this wastes their bandwidth. Tip: If you are writing a web crawler, support compression. For a text heavy site like this one, bots which support compression are getting 10-15x as many pages for the same amount of bandwidth (and they are getting them a lot faster as well).
Bingbot does a lot more crawling than all other bots, most notably more than Googlebot. Here is a direct comparison of their crawling of various types of pages on the site:
Type of page Bingbot requests Googlebot requests Bingbot-Googlebot Ratio Homepage 146 139 1.05 Newer long tail pages 47,944 28,685 1.67 Older long tail pages 134,206 45,118 2.97
While bingbot does crawl more than Googlebot across the board, the biggest difference is in the long-tail pages, and even moreso in the older long-tail pages. My guess is that Googlebot slows down its crawl rate more than bingbot on older pages that don't change very often.
Crawl to search traffic efficiency
For the bots that actually do drive search traffic, I looked at how their pages crawled compared to the search traffic they actually drove to the site.
|Search Engine||Bots||Pages Requests||Bandwidth (MBs)||Search Hits Driven||Crawl Requests / Search hit||Crawl MB / Search hit|
- Google, due to the fact that it drives the majority of search traffic and it crawls more efficiently, outpaces all other engines when comparing crawled pages to traffic driven.
- Baidu and Yandex drive neglible traffic for the amount of pages they crawl (granted: this site is targeted towards people in the US so this isn't entirely surprising)
When I first was going through the logs, I didn't expect to actually change anything because of this exercise. However after looking at the data, I've decided to block all crawlers to the site other than Google or Bing. The new robots.txt looks like this:
User-Agent: Googlebot Allow: / User-Agent: Googlebot-Mobile Allow: / User-Agent: msnbot Allow: / User-Agent: bingbot Allow: / # Adsense User-Agent: Mediapartners-Google Allow: / User-Agent: * Disallow: /
There is little value in allowing other search engines to crawl the site, and I don't see a point in allowing the other random services to use the data on the site for purposes that have no benefit to me. And in the case of proximic and ADmantX, their crawling is incredibly inefficient and wasteful of bandwidth.
It turns out I'm not alone in adding these types of restrictions. Yelp blocks everybody but Google, Bing, ia_archiver (archive.org), ScoutJet (Blekko) and Yandex. LinkedIn also has a similar opt-in robots.txt, though they have whitelisted a larger number of bots than yelp.
Update - Oct 18
There were some good comments on hacker news saying that blocking all other crawlers other than Google and Bing is shortsighted and unfriendly to new search engines. This is a good point. So while I still believe a whitelist approach makes sense, I should make sure to whitelist any crawlers which behave and are attempting to drive traffic to my site (IE, search engines). So here is the new robots.txt:
User-Agent: Googlebot Allow: / User-Agent: Googlebot-Mobile Allow: / User-Agent: msnbot Allow: / User-Agent: bingbot Allow: / # Adsense User-Agent: Mediapartners-Google Allow: / # Blekko User-Agent: ScoutJet Allow: / User-Agent: Yandex Allow: / User-agent: baiduspider Allow: / User-agent: DuckDuckBot Allow: / # CommonCrawl User-agent: ccbot Allow: / User-Agent: * Disallow: /
This approach will require me to pay attention to new search engines and whitelist their crawlers as they come up, but it saves me the effort of having to manually blacklist every new crawler that tries to crawl the site.
Discuss on Hacker News