Canonicalization Defined (part 1 of 3)
Ian Lurie Sep 29 2009
This is part one of a 3-part series on canonicalization. Why 3 parts? Because it ended up being too dang long. Parts 2 and 3 will go live tomorrow and Thursday, respectively.
It’s a long word, I know, but canonicalization (not to be confused with ‘canonization’) is at the heart of SEO. Get it right and the world is your oyster. Get it wrong and you’re slogging uphill forever.
According to Wikipedia, canonicalization is “a process for converting data that has more than one possible representation into a ‘standard’ canonical representation”.
For our purposes, canonicalization means ‘having one address and only one address for one page of my web site’.
BTW, it’s spelled with one ‘N’. There’s no such thing as ‘cannonical’.
You’re still puzzled. I can practically hear your eyebrows knitting together. So here’s an example of a canonicalization problem:
I run a blog called Cocoa Heaven, all about chocolate. On it, let’s say there’s a page about dark chocolate bacon cupcakes (there really is). That page exists at:
OK, no problem. That address represents the location of my article.
But maybe I link to it from another page on my site at:
These are two different canonical addresses. We’re representing the same information at two different virtual locations.
D’oh. That’s a canonicalization problem. Even though you and I know perfectly well they’re the same thing, search engines don’t.
A search engine comes along, crawls the link from the home page to the non ‘www’ address, then crawls the ‘www’ link. It sees two unique web addresses with duplicate content.
Now the search engine has to decide which page to index. Search engines try to filter out duplicates. This is not a penalty – it’s their effort to provide unique, relevant results.
This filtering will create 3 problems for your search engine optimization efforts:
- Diluted link authority. Say two bloggers visit the bacon cupcake article. One finds it at the non ‘www’ address, by clicking the home page article link. The other happens upon it from the post where I mistakenly used the ‘www’ address. They each link back to it, but they’ve used different canonical addresses. Since every link is a vote, your vote’s been split. Instead of a single address having 2 votes, each canonical address has 1.
- The content flip. If both canonical addresses have the same authority, you may find they ‘flip flop’ in the search results, with one address showing up one day and the other showing up the next. I can’t prove it but I strongly suspect this hurts your SEO efforts, as your content doesn’t ‘age’.
- The maintenance nightmare. Your marketing team dutifully interlinks blog posts on your site. But they use both versions of the address. 3 years later, you log in to move some pages around, and find you have to chase around to find both canonical addresses. Annoying.
In most cases, number 1 is the real crisis. If the canonicalization problem is minor (such as mixing ‘www’ and non-’www’ addresses) then it’s all about loss of link authority.
However, other forms of canonicalization issues can throw an entire site structure into flux, and make number 2 into the biggest problem. When that happens, large portions of your site may drop out of the index. I’m talking cats-and-dogs-living-together, find-your-rosary-beads kind of issues.
Types of SEO canonicalization problems
From most to least serious, here are the types of canonical issues I’ve run into:
- Session IDs. For whatever reason, your site tacks a unique session ID onto every page, so www.mysite.com becomes www.mysite.com?jsessionid=asdf230498q234. This is unique for every visit, so there are infinite canonical addresses for every page on the site. There’s Trouble in River City.
- Inconsistent URLs. You have a dynamic site that generates URLs like www.mysite.com?catid=1&subcatid=234&prodid=33. Not a problem, SEO-wise. Unfortunately, you can reach the same product at www.mysite.com?prodid=33 and www.mysite.com?catid=1&prodid=33, and all 3 links are used at random. Not good.
- The blown rewrite. You’ve just set up a nice, clean URL structure on your site, so that you now have URLs like www.mysite.com/shoes/running. But many pages on the site still use www.mysite.com?catid=1&subcatid=234, and that doesn’t redirect to the new, friendlier URL. Yikes.
- The tracking code. You use a special tracking code like www.mysite.com/?source=a0923 for every link on your site, so that you know what folks click. Or you use those codes for banners you place on ad networks. Either way, those links are now in the wild, and create huge canonical tangles. Get your conditioner out.
- Default page confusion. On your site, www.mysite.com/shoes/ and www.mysite.com/shoes/index.php go to the same page. Sadly, you and/or a few dozen partner web sites use these two links interchangeably. Another yikes, but easy to fix (learn how in Part 3).
- WWW mixups. I covered this one above. Both ‘www’ and non-’www’ addresses work on your site. There’s no redirection. And you’ve used both versions interchangeably. Sigh. Don’t worry, though – it’s fixable.
- Case issues. To a computer, ‘a’ and ‘A’ are different characters. If you capitalize part of your URL one time, and don’t capitalize it the next, you may cause all sorts of duplication problems. Hard to detect once it’s done, so I suggest keeping everything lower case.
There are more. If you think of some, post ‘em below.
On to part 2: How to detect canonical problems on your site. It’s easy-peasy!
Things you can do
Ian Lurie is CEO and founder of Portent Inc. He is co-author of the 2nd edition of the Web Marketing All-In-One for Dummies and wrote the sections on SEO, blogging, social media and web analytics. He's recorded training for Lynda.com, writes regularly for the Portent Blog and has been published on AllThingsD, Forbes.com and TechCrunch. And, Ian speaks at conferences around the world, including SearchLove, MozCon, SIC and ad:Tech. Read More