Feeding the Bots: Increasing Your Googlebot Crawl Budget, Part 312/04/2014
This part of the series was supposed to be about sitemap feeds (hence the title), but no discussion of how and what to feed the bots to increase your crawl budget would be complete without revisiting the single most important factor: freshness.
Freshness? I'm talkin' sushi-grade, son!
The more often you blog on your site, add new products, get social shares (particularly on Google Plus), and organically gain backlinks from relevant sites to these new pieces of content, the more often Googlebot will visit.
In terms of organic search traffic, the difference between the typical pattern of occasional blog posting with unpredictable spans of time between posts and a routine editorial schedule that pumps out weekly or at least bi-weekly blog posts is, if you aren't aware, pretty dramatic most of the time.
It's typical to see a tripling or quadrupling of organic referrals and crawl frequency on sites that go from blog-as-quarterly-newsletter to really finding a voice and posting about the site's topics on a routine schedule.
The Map Is Not the Territory, But It Sure Does Help You Get There Faster
The need for XML sitemap feeds arose from the realization that no matter how advanced the crawling logic, Googlebot was simply not going to be able to reach all the web content hiding behind CMSes that weren't built with crawlability in mind, so they provided this method for webmasters to give us an explicit list of URLs we want crawled.
Far from being a prioritization mandate, Google accepts these webmaster-provided lists as, shall we say, helpful suggestions as to how they should prioritize the crawl of your site. Just as they don't necessarily index every URL they crawl on their own, they probably aren't going to index every URL specified in a sitemap feed either.
The selection of URLs to crawl from a sitemap feed is still prioritized by the same crawl budget factors as a fully self-directed crawl, but regardless Googlebot will be more likely to crawl the low-PR pages more frequently than if they weren't listed in your sitemap, so ensure that your XML sitemap feeds are complete , meaning they include every distinct canonical URL you would like to have indexed.
Many CMSes generate them for you, and a few of them even do it correctly such as Joost de Valk's Wordpress SEO plugin , but if you aren't getting what you need from the CMS-generated sitemaps, you can always find a script to generate your own or let a tool like ScreamingFrog do it for you. Just be sure to exclude anything you don't want crawled, such as PPC landing pages, members-only content ‐ alternate blog listings by date, category or author, etc.
OMG, XML or TXT!?
Sitemap feeds don't even have to be in XML format. You can submit the URL for a plain text file (or even a fancy Unicode one if your URLs contain non-English characters) that simply lists one URL per line and nothing else. Works great, and for whatever this can be worth to the Google-fearing among us, in my entire career I've never seen evidence that supplying this 1-column CSV format vs. the more complex and potentially error-prone XML format makes any difference to the crawl rate, indexation rate, organic traffic rate or organic rankings.
The sole benefit of using the XML format is that it gives you an opportunity to supply additional metadata about each URL, which the bots may take under consideration:
- Relative priority in the form of a single-decimal value between 0.0 and 1.0 that's supposed to be like a proposed crawl-rate heatmap of your content heirarchy.
- Change frequency , as in "daily", "weekly", "monthly", "yearly"; great for turning the crawl rate way down on the truly static content, freeing up more budget for the fresh stuff and for the buried gems you want them to find.
- Last-updated timestamp which is an excellent way to shift crawler attention to just-refreshed content they'd pretty much left for dead at this point.
If the bots do take these supplied values into consideration, and if you're smart about how you distribute those signals in the XML sitemaps, you can influence the reallocation of crawl budget and encourage more frequent crawling of the areas of your site that need attention.
Inversely, if you don't know or can't choose a reasonable value for a particular tag for some URLs, then don't just make up a value; drop the tag altogether for those URLs.
- Fresh content is like yummy sushi just caught this morning, and Googlebot loooooooooves it some good sushi.
- Sitemap feeds can encourage more frequent crawling of important pages.
- A sitemap can just be a plain list of URLs in a text file, but come on, what's the fun in that?
- XML sitemaps can give Googlebot a map of the relative priority and update frequency of each page on your site.
- May improve crawl frequency, which genearlly improves rankings.
- Don't just guess your way through the quiz; minimize the signal-to-noise ratio by entirely omitting any XML tags for which you have no meaningful value.
In the next installment we'll look at all the errors Googlebot is getting that are chewing up your crawl budget, so check out Part 4: Crawl Errors