Increasing Your Site's Googlebot Crawl Budget, Part 111/18/2014
Since Google's Caffeine update in 2010 which enabled more frequent updates to the index after pages have been crawled, the average organic Google ranking for any given page (and thus its organic traffic volume) has been shown to be strongly correlated to how recently it was crawled, on the order of a roughly two-week window.
This means that pages crawled within the past two weeks will, generally speaking, rank better than pages that haven't been crawled for at least two weeks.
...but, doesn't Googlebot crawl my site every day?
Probably, yes. In fact, it might even be visiting multiple times per day. However, it's a common misconception that Google crawls your entire site every time it visits. Or even the first time it visits a new site. Or ever at all, necessarily.
The reality is that if an entire site does get crawled, it's not because of any imperative for Googlebot to be comprehensive. It's because each of those pages have earned, either directly or indirectly, the limited attention of the crawler. It's kind of a popularity contest crossed with an endless lap race.
According to Google's Matt Cutts in an interview with Eric Enge , every site that Googlebot crawls is assigned a crawl budget which is based on two major areas of concern:
Google's original secret sauce for search result ranking which grades page authority on a scale of 0-10 based on the the number and quality of links pointing to it. Despite having added at least 200 other ranking signals to improve the quality and relevance of their search results, the term "PageRank" is often misused as conversational shorthand for referring to Google's ranking algorithm as a whole.
Since adding all those other ranking factors, Google has been discouraging site owners from obsessing about their PageRank for the purposes of improving their rankings (so much so that they removed the PR indicator from the Google Toolbar, though you can still look it up using third party tools ), primarily because optimizing for those other factors are more likely to improve a site's value to actual searchers rather than just search engines.
Googlebot uses PageRank to prioritize the pages to crawl on a website. In this series on crawl budget, we're going to refer to PageRank only in the abstract. That is to say, we're not concerning ourselves with the PR value itself, only concerned with whether a given webpage's PR is "high" or "low" relative to the rest of the site, and even then only as convenient shorthand for describing the condition of a page's authority signals (more on that below).
2. Host load
Google wants to crawl your site quickly because, well, they have a lot of sites to crawl. The fastest way to crawl a site is by making as many concurrent requests as you can. However, if a site is receiving too many simultaneous requests it can become very slow to respond, so Googlebot crawls are analyzed like a load test and its crawl rate for each site is dynamically adjusted based on recent performance. Sites that have demonstrated the capacity to handle higher concurrency rates get a larger crawl budget since Google can "go deeper" in the same allotted time.
There are many factors with the potential to impact host load, ranging from the web server itself to the way a page is coded. We'll cover these factors in detail in a later installment.
First Step: Consolidate
What's the easiest way to get a higher percentage of your site crawled more often?
Answer: Make your site smaller!
Over time, most sites end up having some extraneous and semi-duplicated content, and most dynamic web sites have more than one valid URL for at least some if not all of the content on the site. In either case, your crawl budget is being wasted on them, and their page authority is diluted because of them.
Crawl your site with a tool such as ScreamingFrog SEO Spider and look for duplicate or outdated pages that are crawlable but have been superceded by newer content on other pages. Often they remain crawlable due to links in content that were not updated or redirected.
You also want to look for similar thin-content pages that could be merged into a single page with multiple sections. The goal here is twofold:
- to concentrate the authority signals spread among those pages into a single page (or fewer pages, anyway) that would benefit from the combined authority of each individual page, and
- to improve crawl efficiency by reducing the number of distinct crawlable URLs on the site to as few as possible, thereby increasing each remaining URL's slice of the crawl budget, ensuring these URLs will be crawled more often.
Aside from obvious content duplication or overlap, most sites are wasting a ton of page authority on the proliferation of multiple valid URLs for a given page. It's critical to your crawl efficiency to ensure that all crawlable URLs are properly canonicalized, meaning:
- the canonical (read "preferred" ) URL for each page has been determined, including scheme (HTTP or HTTPS), hostname (w's or no?), and any URL parameters that affect the content of the page (page number, product ID, etc.)
- internal links to each page uniformly specify the canonical URL, and do so in either absolute or root-relative URL format
- the canonical URL for each page is enforced by 301 redirect
- the canonical URL for each page is specified in a rel=canonical tag in absolute format (not necessary if canonical URL enforcement 301 redirects are comprehensive)
- all optional URL parameters (those that don't change the content of the page (sort order, search terms, breadcrumb paths, always-true flags, etc.) are filtered explicitly in Webmaster Tools.
We typically see a dramatic improvement in crawl budget and organic traffic (sometimes 2-3x), particularly for large sites on SEO-afterthought CMSes, just from cleaning up the URL canonicalization issues alone.
- Google has no imperative to crawl your entire website.
- Googlebot is given a limited but dynamic crawl budget based largely on PageRank and page speed.
- URLs are prioritized for crawling based on PageRank.
- Highly-similar or identical page content should be consolidated to concentrate their PageRank.
- Pick a canonical URL for each page and make sure Googlebot can figure it out.
- Eliminating duplicate content and/or reducing the number of URLs competing for crawl budget will improve the crawl frequency for the remaining pages and will likely improve organic search traffic volume.
In the next installment we'll dive into specific PageRank factors that impact crawl budget, so check out Part 2: Page Authority Signals.