Canonicalization and Duplicate Content Issues01/10/2014
Every webpage has a URL which identifies it, and though this was originally intended to be distinctive, most webpages can in reality be accessed from more than one URL. This is the nature of web server software, and it takes careful consideration of the necessary assumptions made by search engines in order to prevent search visibility issues.
For a variety of technical and business-practice reasons, most websites end up having more indexable URLs than they have distinct pages of useful content, and most have pages which deliver little value to searchers. Much of technical SEO is concerned with reducing the number of indexable URLs to the absolute minimum among the pages with the highest-quality content. This maximizes the efficient use of the search engine crawler bots' limited time budget for the site, ensuring that the fresh content and refreshes of old content make it into the search index sooner than later.
Crawl budget concerns aside, the goal of reducing the number of duplicate URLs indexed by search engines is to consolidate all the diluted authority or, put more broadly, the value signals (link equity, content topicality & freshness, social engagement, site quality, bounce rate, etc.) assigned to each URL for the purpose of ranking them in search results.
Search engines are smart, but they still can't understand which URL—among all the possible variations was intended to be the official URL for a particular page of content, so canonicalization is the process of signaling this intent to the search engines through a variety of officially-supported methods. If you don't, they're forced to make a guess, and they're not always right.
Canonicalization refers to methods by which all possible variations of a webpage's URL are collapsed for the search engines into the preferred (i.e., canonical ) version of that webpage's URL, such that the search engines treat any non-canonical URL variants they may encounter as indistinct from the canonical version, rather than standing as a distinct URL with duplicate content.
Duplicate content can refer to pages with:
- 100% word-for-word copies of the entire content of another page
- near-duplicated pages, where an insignificant percentage of the content is original
- subsets of original content from other pages
Since canonicalization is primarily concerned with the URL itself, it may be helpful to refer to the basic structure of a webpage URL. Consider the following example:
Let's break that down into chunks:
If any of these components of a newly-discovered URL (with the exception of the fragment identifier on the right end) are different from any previously-crawled URL, search engines are forced to assume the URL is distinct, and that this was intentional on the part of the website publisher (regardless of the uniqueness of its content relative to any others).
Though the domain portion of a URL is case-insensitive (as they are handled by the browser), the rest of the URL is case-sensitive by design, which means mixed case URLs can be considered by search engines to be unique pages despite identical content. Google may attempt to defer to the all-lowercase version of a URL if the content is identical, but if it encounters a link to that URL which also includes query parameters (such as "?source=blog" or "sort=reverse"), then this is again assumed to be a distinct URL
In most cases where fully-duplicate content is detected, search engines will make an effort to guess as to which URL is the original, based on the date of its first encounter with either URL, which URL was first linked-to from another URL already in their index, or other comparative factors between sites featuring the same page of content, but their assumptions aren't always correct, so it's best to send explicit signals to the search bots whenever possible.
It is not uncommon to work with clients who have literally thousands of canonicalization issues with their site. It depends on the server platform (multiple severe issues tend to be more common with IIS than Apache for instance, due largely to easily-overlooked default settings) as well as the CMS or e-commerce app that's driving the site.
Simply cleaning up a mess like that tends to vastly improve search engine rankings all by itself, since it concentrates so many value signals and even improves some of them directly. You really can't unlock a website's full organic search traffic potential without getting your technical SEO house in order, and that starts with best-practices canonicalization . .
Leveraging decades of search industry experience, Dirigo offers world-class comprehensive technical SEO audits right in-house, including explanations of each issue encountered, relative prioritization of each issue, and detailed recommendations for resolving it safely and efficiently. We'll even deliver our findings in-person, walking through the audit at your own speed and technical comfort level.
If you have the internal resources to implement our recommendations, we can work with your technical team to ensure that each issue is resolved accurately and completely, or if you don't have the resources available, Dirigo can provide full technical implementation directly in-house as well.