Late last month, I noticed my posts weren’t getting picked up by google. Usually, I can post an article and within an hour, Google will have it indexed (although not cached). After it happened the second time, I logged into my google webmaster tools account. I immediately found the problem. The word ERRORS appeared next to my sitemap listing.
I started digging into my webmaster tools account and located the error in the sitemap section:
URL timeout: robots.txt timeout. We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.
Evidently, google tries to poll your robots.txt file prior to grabbing your sitemap. If it can’t get the robots.txt file or a 404 not found error, it gives up on trying to pull the sitemap or any other pages on your site and stops indexing it completely. I resubmitted the sitemap and waited for google to try again. The next morning, I noticed in my webmaster tools that it had tried and failed, again. (see the picture below for the effects on google crawling your site – can you guess where the timeouts were occurring in the crawl activity chart?)
robots.txt timeout effects
I double-checked my robots.txt file and could get to it with no problem. I tried a proxy to access it from another source. Again, no problem. A timeout error can be tricky to diagnose, since it depends on the route and network between Google and your website.
Since I could get to it from several spots, I didn’t think it was network related, so I started inspecting my logfiles for the site. Oddly enough, there were NO requests from google listed in the visitors log. This is what pretty much solidified what the problem was. The request was being blocked by my webhost, inside their network somewhere. I quickly jumped into support chat with them and they did some quick checks.
“No, nothing’s blocked”, they said. I re-explained the problem in a trouble ticket and found another example of the problem in one of their forums. The techs found the problem this time. Evidently, one of google’s IP addresses had been auto-flagged by their network firewall as a denial of service attack. This was due to the multiple sequential page requests in a short time as google followed my sitemap.
They unblocked the IP address, and like flipping a light switch, google was indexing my site again. Fortunately, I think the problem is finally resolved. This is something else to watch for if your site suddenly stops being indexed by google.