beecherbowers.com

of all the places on the internet, you ended up here.

google sitemap timeout errors

5 Comments »

Late last month, I noticed my posts weren’t getting picked up by google. Usually, I can post an article and within an hour, Google will have it indexed (although not cached). After it happened the second time, I logged into my google webmaster tools account. I immediately found the problem. The word ERRORS appeared next to my sitemap listing.

I started digging into my webmaster tools account and located the error in the sitemap section:

URL timeout: robots.txt timeout. We encountered an error while trying to access your Sitemap. Please ensure your Sitemap follows our guidelines and can be accessed at the location you provided and then resubmit.

Evidently, google tries to poll your robots.txt file prior to grabbing your sitemap. If it can’t get the robots.txt file or a 404 not found error, it gives up on trying to pull the sitemap or any other pages on your site and stops indexing it completely. I resubmitted the sitemap and waited for google to try again. The next morning, I noticed in my webmaster tools that it had tried and failed, again. (see the picture below for the effects on google crawling your site – can you guess where the timeouts were occurring in the crawl activity chart?)

robots.txt timeout effects

robots.txt timeout effects

I double-checked my robots.txt file and could get to it with no problem. I tried a proxy to access it from another source. Again, no problem. A timeout error can be tricky to diagnose, since it depends on the route and network between Google and your website.

Since I could get to it from several spots, I didn’t think it was network related, so I started inspecting my logfiles for the site. Oddly enough, there were NO requests from google listed in the visitors log. This is what pretty much solidified what the problem was. The request was being blocked by my webhost, inside their network somewhere. I quickly jumped into support chat with them and they did some quick checks.

“No, nothing’s blocked”, they said. I re-explained the problem in a trouble ticket and found another example of the problem in one of their forums. The techs found the problem this time. Evidently, one of google’s IP addresses had been auto-flagged by their network firewall as a denial of service attack. This was due to the multiple sequential page requests in a short time as google followed my sitemap.

They unblocked the IP address, and like flipping a light switch, google was indexing my site again. Fortunately, I think the problem is finally resolved. This is something else to watch for if your site suddenly stops being indexed by google.

About Beecher Bowers

5 Responses

[...] This is a technical problem. If you want for info, it’s better to read this article: http://beecherbowers.com/2009/02/03/google-sitemap-timeout-errors/ [...]

  • [...] editing the robots.txt file and creating a sitemap, I think their site traffic will be picking up within a week or so. Just another gotcha to watch [...]

  • My webhost came back to me and said they didn’t block the IPs (they provided the IPs on my ticket support reply). Been having this problem for a few days now after I changed my post title structures..dang

  • The googlebot IPs probably change over time, it might be best to check with your host and let them know what’s going on. My host checked their firewall on request and was able to determine it. Their firewall may autoresolve the IPs to hostnames that they could identify it.

  • I’ve just been hit the exact same issue – not been crawled for whole of April. Any more info on the IP addresses used by Googlebots?

    Thanks
    Dan

  • Leave a Reply