Page Authority Signals: Increasing Your Site's Googlebot Crawl Budget, Part 211/25/2014
This is the second installment of a five-part series (read Part 1).
As stated in the previous installment , the specific pages Googlebot chooses to crawl are prioritized based on their relative PageRank. For the purpose of crawl budget, PageRank is impacted by the following factors:
- Page authority ‐ how many links (including internal) point to the page
- link structure ‐ how many clicks it takes to get to the page in order to crawl it
- document age ‐ how long the page has existed in the index
- freshness ‐ how recently the page content was updated
Let's take a look at each of these factors in detail:
If you look at an Internal Links report in Google Webmaster Tools, the target pages pre-sorted to the top of the list are those with the highest number of internal links pointing to them (that Googlebot can see). Compare this with the list of pages in a Landing Pages report for organic Google traffic in Google Analytics, and you're likely to find them to be quite similar.
There are two ways to shift page authority around on your site, and they're not even mutually exclusive:
- Increase the number of internal links pointing to pages that you want to rank well
- Decrease the number of internal links pointing to the other pages
You might say, "well, just nofollow them and be done with it," and that's definitely a step in the right direction, but even nofollow'd links have to be processed during the crawl, and they still may count for something in unknowable ways that are subject to change, which makes a nofollow'd link a bit of a risk that would otherwise not exist if the link didn't exist. Better to control for that -- and for the possibility of it impacting the crawl budget -- by actually paring down the links themselves whenever possible, and save rel="nofollow" for the tactic of last resort.
If your site employs horizontal pull-down menus, are they hierarchical such that they provide direct access to any page at any level from any other page (also known as mega-menus), like an HTML sitemap? Are those links crawlable? Then your site's internal link graph is basically uniform in all dimensions aside from the tiny differences formed by links found in content, which isn't super helpful.
Move the complete dump of all internal links to an HTML sitemap, and pare this navigation down to provide trailheads for common visitor tasks, then leave it up to each section to provide sub-page navigation.
If your site is organized into sections with multiple sub-pages, are you providing contextual navigation in each section? Look for opportunities here to add links to pages that need a boost, either sub-pages in the section ‐ cousin pages in other sections (or the index page of other sections).
If your site offers sorted listings of content (e.g., products in a category, blog entries, etc.) that are separated into multiple pages, chances are good that the higher page numbers aren't being crawled very frequently and have a very low PageRank (if any).
First, check your assumptions about the need for pagination in each case. Are there enough entries to justify it, and if not, how fast do you expect the list to grow? The best solution may actually be to de-paginate that listing such that all entries appear on page 1, and not by way of infinite scroll as Google still doesn't have a handle on crawling infinitely-scrolling pages.
Second, assuming pagination is appropriate, have you implemented Google's guidelines for paginated content? The best practice is to not only implement the rel=next/prev tags to define a page-series, but also rel=canonical tags on each member page in that series which specify the URL for that particular page. Savvy technical SEOs would also recommend appending "Page X" to the title and H1 tags for the listing, just for good measure.
These signals give the search engine the opportunity to not only roll-up the value signals from each member of the series into an entity representing the series (thus improving both the crawl rate and organic visibility for the entire series), but also gives them more context for deciding which page in a series is most relevant to a particular search query (because it just might not be page 1).
Google would prefer sites to employ a flat link structure (not to be confused with URL structure), which is one that minimizes the number of clicks (well, crawl steps ) to get from the high-PageRank pages to the low-PageRank pages you want crawled more often.
As mentioned above, having an HTML sitemap (not to be confused with XML sitemap feeds) is a great start. Make sure your HTML sitemap stays up-to-date, including adds and drops.
Also consider any opportunity for cross-linking related content, including both manual additions and automated approaches such as YARPP for WordPress .
Basically, the longer a URL has been in the index, the more often it will be crawled. The date they first crawled a particular URL is stored forever and does not transfer by 301-redirect if you change the URL like PageRank does, so there's not much you can do about this factor.
It's interesting to note that if you bring back a page that's gone 404, it will hit the ground running (in terms of PageRank) from where it left off the last time it was returning 200, so it's worthwhile to check your broken links for anything you could conceivably restore to this currently-broken URL that would add value to the site.
Since the Caffeine update, Google has been able to update their index almost as fast as they can crawl the web, giving them a whole new dimension of detail on which sites (and which pages on a given site) are getting updated regularly, and which ones are responsible for that weird smell in the refrigerator.
A page with no recent updates is less likely to be crawled than a page that was updated more recently. To make sure your important pages continue to pass the sniff test, don't just push the peas around on the plate, but make routine enhancements and temporary/time-sensitive additions where possible. The idea is to increase the real value the page is providing to the web, and if you're doing that, you're already on the right track.
- The most important PageRank signals you can optimize are those related to page authority .
Page authority can be optimized by:
- acquiring more high-quality backlinks
- rethinking global navigation links
- either implementing proper pagination markup or just depaginating altogether
- consolidating thin or duplicate content into fewer URLs
- implementing a proper URL canonicalization scheme
Internal link structure can be optimized by:
- minimizing click distance (crawl steps) between high-PR and low-PR pages
- improving cross-linking between related pages
- providing a comprehensive HTML sitemap
- Google never forgets the first time it crawled a URL, and there's not a damn thing you can do about it.
- Stale content smells like rotting leftovers
- Stale content is less likely to be crawled than recently-added content, or even recently-updated stale content.
In the next installment of our series on crawl budget we'll take a look at sitemap feeds, so check out Part 3: Feeding the Bots