BingBots or MSNBots Over-crawling?
There are instances when bots from Bing or MSN are overcrawling the website crippling the website to be too very slow or even cause it to do down.
What generally happens?
Microsoft Bing web crawlers sometimes get “out of control crawling mode” causing websites to go down. (Why? Ask Microsoft. I have not heard a good explanation on why this even happens however I’ve witnessed this on pretty large websites.)
How to fix BingBot OverCrawling by controlling Crawl Rate?
Typically, web admins or database admins will notice notice that an application is generating lots of select queries taking 600 sec to 1500 sec which are logged into the database query logs. Upon further investigation, the source IPs of those queries can be identified as web servers – such as 188 188.8.131.52, or 143 184.108.40.206 etc. There are 100s of such Microsoft IPs (could be even thousands), which are mainly bots.
- Kill those resource intensive processes. It’s not unlikely if there were 500-1,000 queries / connections from such IPs. This rogue behavior is causing havoc on the web-server and the database server.
- Next, using Firewall (there are alternative ways to this.), block those known IPs. What can be even more annoying is there could be more of such IPs not intially visible. Note, Micorsoft uses multiple bots to crawl a website. This is a temporary step and we’ll come back to this later when the site is running again.
- Once the resource eating processes stopped, and rogue bots blocked, the website should be back via server reboot.
- Now, the site is back up, it’s time to log on to the Bing Webmaster Tools.
- First of, we need to verify the the rogue IPs are really from Microsoft.
- Go to “Diagnostics & Tools” > “Verify Bingbot”, and enter the offending IP addresses one by one.
- e.g. 188 220.127.116.11 – Microsoft Corporation (MSFT) or 143 18.104.22.168 or 139 22.214.171.124
- Once these are verified as Bingbot IP addresses, we can confirm that these are registered as Bot-Name “msnbot-207-46-13-130.search.msn.com” and further steps can be taken.
- Now, we’re ready to control the Crawl-Rate of BingBots
- Next, navigate to “Configure My Site” > “Crawl Control” within the Bing Webmaster Tools.
- The default view is “All Day” as shown below. Figure: Default View
- We need to update it to Custom view, where we can control and define the crawl rate. Figure: Custom view
- Additional steps can be taken in Robots.txt by adding Crawl Delay
- Below 2 lines will tell bots to crawl a page every 5 sec only. This helps if the website has bandwidth constraints. However, this also reduces the total crawl and may result in less content being indexed.
- User-agent: msnbot
- This is advanced feature, where bots are instructed to crawl a page only every 5 sec.
- Finally, it’s time to come back to the Firewall and whitelist the IPs, otherwise Bing will received HTTP = 400-499 errors as it would be unable to crawl the website.
Figures: Before (default crawl) vs After (Custom Crawl)
No one knows for sure why Bing Bots sometime acts the rogue way they do – that is to excessively crawl the sites to the point of bringing them down. Microsoft Engineers need to answer this as well as why they use multiple IPs to crawl the same website at the same time. Anyhow, for websmasters and web admins BingBots over-crawling causes serious headache from Bandwidth consumption, to traffic loss, and revenue loss as the site may go down. Solution presented here is in controlling the BingBots behavior via Bing Webmaster Tools and robots.txt directive.