Page Speed: Increasing Your Site's Googlebot Crawl Budget, Part 505/01/2015
Large-scale studies have shown that conversions and revenue decrease as page load time increases, and Google knows that their users are generally less likely to bounce back to the SERPs from sites that load faster than their ranking competitors, so it makes perfect sense that Google would give faster websites a bump.
But what's Google's definition of "fast"? Their general guideline for average page load time is to keep it under two seconds.
Getting anywhere near that number will not happen just by chance of good web development work; there are several technical layers of potential inefficiencies that each require careful consideration and fine-tuning to achieve maximum performance from the entire web stack, and these will vary from site to site.
While I couldn't possibly hope to provide a comprehensive reference on website performance tuning in the scope of a blog about increasing your Googlebot crawl budget, what follows is a high-level overview of the broad scope of factors that have the potential to improve page speed in ways meaningful to Googlebot and pals.
Server Response Time: Savoring the Last Byte
Though not directly a ranking factor ( anymore ), Google considers your web server's average response time to Googlebot's requests to be a crawl-rate-limiting factor that can impact your site's crawl budget.
It's easy to imagine that search engine crawlers like Googlebot would be crawling websites as fast as they possibly can, but in just about any case, "as fast as Google possibly can" would kill your server, so instead they determine a crawl rate that is minimally-disruptive to your site's ability to continue serving visitor traffic during a crawl.
Google refers to your site's apparent capacity for concurrent requests, and to the scale of Googlebot's activity on your site in general, as host load . Google Webmaster Tools reports basic host load statistics in the Crawl Stats report , where you can see both daily and running averages for three different host-load-related metrics: crawl rate in pages/day, bandwidth usage in kb/day, and response time in ms/request:
Average response time depends on the combined performance of:
- your web host's datacenter's internal network & backbone connectivity
- the physical server hardware configutration
- the web server software configuration
- the web application (CMS, e-commerce platform, etc.) that generates the site's pages
Any improvements to your site's responsiveness will increase its capacity for serving concurrent visitors. When Googlebot detects this, it will try to crawl faster next time. Faster crawling means efficient crawling, freeing up more of your crawl budget for deeper and more frequent crawling of the low-PageRank content.
But where do we begin? Let's start at the very bottom of the stack and work our way up:
Domain Name Servers
Before a visitor session can even begin on your site, they have to either type in your domain name or click on a link (such as in a Google SERP), then wait for the domain name lookup. If it's an extremely popular site, the visitor's ISP may have the answer cached for quick turnaround, but the rest of the time the visitor's experience is briefly at the mercy of your Domain Name Service provider. The slower their nameserver is to respond, the longer the visitor waits before your page has even been requested from your web host.
There also may be configuration issues with the domain. You can run a pretty comprehensive diagnostic suite via the free DNS Report tool at DNSStuff.com .
If you're relying on either your hosting provider or your domain registrar for nameservers, consider a specialized DNS hosting solution such as CloudFlare , Google Cloud DNS or EasyDNS to minimize potential name-resolution delays.
Datacenter / Hosting provider
Not all datacenters are created equal. Even datacenters with triple-redundant backbone connection and $100,000 switches can have awkward peering routes that lead to ping times that are 3-5x worse than competitors due to excessive number of hops (more than 5 or 6) to get from browser to server, so run some traceroute tests to your current host vs. others (and from diverse ISP types) to find a datacenter with superior routing performance.
The best hosting plans offer unlimited and unmetered bandwidth or a flat-rate wire-speed network connection, but if your plan traffic-shapes the server's connection to a bitrate that's considerably less than the wire speed of the server's network interface (10 Mbps, 100Mbps or 1Gbps), then your site will suffer from a low concurrency ceiling that can wreck your crawl budget if you aren't closely monitoring traffic volume and network utilization or don't have a non-distruptive bandwidth upgrade path to follow as your traffic volume continues to grow.
Of course the best option for ensuring server responsiveness is always going to be a dedicated physical server hosting a single website. If your site is hosted on a multi-tenant server, meaning there are other web sites on it (yours or those of other hosting customers), then your site is competing with these other sites for the server's resources. If a dedicated server is truly not an option, consider at least moving to a Virtual Private Server where the competition is decreased by evenly dividing the server's resources among a smaller number of tenants.
If your CMS or web server software offers a page-caching mechanism to speed up responses to common requests, use it! Try experimenting with the cache expiration interval to find the best trade-off between high performance under heavy traffic loads (i.e., longer intervals between cache refreshes) and bearable performance under low traffic loads (i.e., more frequent cache refreshes).
Check to see if GZIP is enabled on your server . If not, you're wasting 60-80% of your (and your visitors') bandwidth by delivering your site content to visitors at full uncompressed size. Enabling this will not contribute meaningfully to your server's CPU overhead, but it will contribute meaningfully to your page speed as perceived by Googlebot and by humans, especially on mobile devices.
Perception of Time: The event horizon of window.onLoad()
Unfortunately for the most inspired and talented web developers of the world, server-side response time as measured by Googlebot is not the only page speed factor that Google will reward with competitive advantages; they also consider a page's real-world page load time ‐ as in the time it takes for the page's onLoad() event to fire ‐ to be a site-quality signal that, for some, has led to as much as a 40% growth in organic traffic.
Front-end optimizations seek to minimize both the number of HTTP requests and the amount of data that needs to be downloaded in order to render the page. Here's a brief overview of the factors to explore:
This is Cache Control to Major DOM…
Holding off the inevitable
Defer as many scripts, but especially the externally-hosted scripts, to load after the window.onLoad() event fires, as this event is what stops Google's page speed timer.
Fewer and smaller
Rather than building a web UI from three dozen little individual image files, graphical UI elements should be combined into as few sprite images as is necessary. The SpriteMe bookmarklet and SpritePad web app are just two examples of several great little tools for auto-generating sprites.
There are a variety of plugins available for major CMSes that can take care of this for you, and for the higher-end web dev crowd there are also some terrific front-end tooling workflows out there now , such as with the combination of Sass & Gulp , and for C# .NET developers like our merry band of wizards here at Dirigo, the SquishIt library has also proven indispensible.
Toss your cookies, far and wide
Instead of serving static resources from a path on the website's primary domain, move them to their own subdomain ( e.g., static.yourdomain.com ) under a separate site configuration on your web server so you can serve them without the cookies from your web application. Serving from a cookieless domain avoids a ton of network overhead from otherwise having to transfer half a dozen cookies, some with very large values, with each and every image file, CSS file, JS file, etc.
The next level of sophistication above employing a cookieless subdomain for static resources is to employ a CDN (Content Distribution Network) to push those resources out to the edges of the cloud, closer to wherever your visitors are than your web host. You can pay a number of places like Rackspace or Google , or you may qualify for CloudFlare 's generous free tier.
- Google wants your page to load in under two seconds.
- Google not only cares about network and server latency (average response time), but they also clock the time until onLoad() fires, and judge you for it.
- Above-average server response times are going to limit your crawl budget until they're resolved.
- Not all DNS providers are created equal.
- It pays to shop around for a particularly well-connected web host that doesn't limit your throughput.
- Dedicated hosting beats multi-tenant hosting in speed due to reduced competition for server resources, and has become too inexpensive an option to ignore for even the smallest of web-based SMBs.
- Turn on GZip compression already, you wasteful, wasteful shrew!
- The web server should be caching the results of dynamic server-side code (e.g., PHP, Python, ASP.NET, etc.) whenever possible, and sending >= 1-week cache-control headers for everything else.
- No more individual CSS, JS and UI-element image files than necessary. Combine!
- Consider employing a modern front-end tooling workflow to keep the page dependencies manageable and minified.