Crawl Errors: Increasing Your Site's Googlebot Crawl Budget, Part 412/09/2014
Gateway timeout... File not found. Authorization Required. Forbidden! INTERNAL SERVER ERROR!
If search bots are spending time on your site just running into walls and spinning around in circles, that's time they won't be spending crawling your carefully-optimized content, and as we know from part 1 of this series , the more often Googlebot crawls a page, the better that page will rank.
But persistent HTTP errors and other bot-confounding behaviors just squander your crawl budget, limiting the potential for improvement elsewhere. In this installment, we'll look at the different types of crawl errors, what they mean, and how to discover and monitor them on your site.
Broken links just happen. Writers mangle links while editing content. Developers mangle links when the logic to auto-generate those links fails to account for some potential condition (or if they're just ignorant of basic on-page technical SEO). With multi-user CMS environments, people fat-finger the delete button or change settings that can cause important pages to suddenly go missing and return "404 Not Found" errors. The bots will follow these links repeatedly.
404s are okay in and of themselves; they're not going to hurt your rankings unless more of your indexed pages are returning 404 than not (e.g., high-turnover content like individual URLs of for-sale listings, or large-catalog e-commerce with fluctuating availability). Outside of such extreme cases, Google knows that accidents happen and will re-crawl previously-good URLs that have gone 404 just in case they've come back.
That's a great service, but your site's crawl budget is still eroded by every request that Googlebot spends handling an error response instead of discovering valuable content, so it's obviously important to monitor 400-level errors and resolve whatever you can so they stop happening, whether that's restoring content on that URL so it can return 200, or 301-redirecting it to a functioning URL with similar content in order to preserve the page authority that the now-404ing URL had accumulated.
Google does a decent job of detecting and reporting "soft 404s", where the site responds to a request for a non-existent URL with a temporary redirect (302 instead of 301) to a page that reads like a 404 page but returns 200 instead. Many CMSes do this automatically, and it's especially common on IIS web servers and .NET-based web applications, but it's an insane, zero-sum ritual to put the bots through, and utterly squanders your crawl budget.
If Googlebot is reporting 401s or 403s, it's probably on a type of URL it shouldn't be trying to crawl in the first place (such as a login page or images directory) which at the very least means there's some links on the site to nofollow, but the best bet is to block them in robots.txt and, if possible, on-page via a meta robots tag.
It's not uncommon to see random, transient 500-level errors, especially on higher-traffic sites when the crawler visits during a traffic spike. They're generally nothing to worry about if they're irregular and uncommon, but if you're seeing repeat occurrences of the same 500-level error (particularly "500 Internal Server Error"), it's important to figure out why.
If the URL returning 500 is malformed, figure out where Googlebot is finding that link and fix it there. If properly-formed URLs are resulting in 500 errors and you can repeat this yourself in a browser, then there's something wrong with either the application logic (i.e., the CMS) or the configuration options for the web server software (i.e., Apachie, IIS, nginx, etc.) that's causing the web server to completely choke on the request.
From the horse's mouth
The easiest way to discover crawl errors on your site is by checking your Google Webmaster Tools and Bing Webmaster Tools. That's both, not either. This blog series is specifically about Googlebot crawl budget due to limited industry knowledge about Bingbot crawl prioritization relative to what we know from Google, but Bing Webmaster Tools can give you more information than you'd get from GWMT alone, and it's safe to assume for the purposes of keeping your site in shape that anything Bingbot detects is detectable by Googlebot as well, regardless of whether they report it.
One nice feature of Google's crawl errors report is the ability to mark a listed error as having been fixed. It doesn't cause an immediate re-crawl of the page, but it removes it from the list until the next time Googlebot sees the same error. Use this feature as you fix the issues, with the goal of keeping the report empty. That way you're more likely to notice if the issue is reported again on a subsequent crawl, in which case it's time to go back to the drawing board.
Crawl it yourself
There's dozens of tools out there to crawl your entire site and report any errors it encounters, though it still takes a keen technical eye to interpret the reports accurately. As mentioned in previous installments of this series, ScreamingFrog SEO Spider remains a top choice, not the least of which is because you can run it locally and exert discrete control over how it crawls your site, and it generates perfect reports for fixing crawl errors. The free version stops crawling after 500 URLs, which for many sites is more than enough.
Logfiles. Yes, logfiles.
It's a common mistake to assume that the webmaster tools reports are telling you everything you need to know about potential crawl issues with your site. In fact, the only way to get a truly comprehensive view of actual crawl errors encountered on your site is by analyzing your server logs for search bot activity.
The most common method is to filter the logs down to just the bot traffic, then run it through your favorite logfile analysis tool to generate reports of activity by status code and by bot. For Dirigo's clients, we use a combination of off-the-shelf and custom in-house tools to analyze and monitor search bot activity.
The technically savvy can perform ad-hoc logfile analysis using the command-line tool grep, but if you're averse to regular expressions, there are some commercial bot log analysis tools available that will do the trick.
I would love to recommend the one I co-developed with Vanessa Fox while at Nine By Blue as part of the Blueprint analytics offering, but it's been since acquired by Rimm-Kaufman Group along with the rest of Nine By Blue, and rather than open up Blueprint for self-service, they've apparently decided to keep it for themselves as a competitive advantage. Not that I can blame them; it did kick ass.
The closest commercial analogue to the enterprise-class search bot activity analysis component of Blueprint would be Botify , a SaaS platform that collects and parses your log data nightly to generate a dizzying array of pretty reports about your site's bot activity.
For large datasets (higher-traffic sites) though, it may be more cost effective to configure your own bot reports in a general-purpose log analysis engine such as Splunk or Sawmill ‐ if you prefer free software, Piwik or logfilt .
But how much log data do you need? For small sites or sites with light traffic, analyze a whole month of data. For sites with a moderate traffic volume, you want at least a solid week's worth for accurate bot activity reporting. If the logs run into the gigabytes per day, you should be able to get a fair representative sample out of a single 24-hour period, as long as the size of the log data falls within the daily median range (i.e., a typical day).
- Crawl errors kill crawl budgets.
- Broken links happen, but keep fixing them.
- Soft 404 errors give Googlebot these Kafkaesque nightmare visits to your site that can only end in madness (THANKS MICROSOFT!)
- Irregular 500-level errors are NBD, but if they get more frequent and stay random, you probably need more server bandwidth.
- Repeated errors that aren't just 404s are usually from links that shouldn't be crawlable, but if they definitely should, then the developers screwed something up.
- Watch the crawl error reports in both Google and Bing Webmaster Tools, but don't rely on it.
- Use site-crawling tools to discover errors, hopefully before the bots do.
- Analyze your web server access logs for 302s and 400-500-level errors, filtering for bots if possible.
In the next installment we'll cover page speed issues from the perspective of crawl budget, so check out Part 5: Page Speed