Fix canonicalization problems (part 3 of 3)

duct-tape Featured

Ian Lurie Oct 1 2009

This is part 3 in a series about canonicalization issues. Part 1 defined canonicalization. Part 2 gave advice for tracking down canonical problems on your site. This article deals with fixing the problems you just found.

Now that you’ve found your canonicalization problems, you need to fix them.
I’ve got 5 solutions for ya:

1: Just fix it

The best way to fix canonicalization problems is to fix them.
If you link to your home page 4 different ways, pick one and make your links consistent.
If you added query strings like ?link=1234 all over your site so that you could track clicks, get rid of them. Use something like event tracking in Google Analytics, instead.
Got session IDs all over the place? Get rid of them, and use cookie variables.
Repair whatever it is that’s creating multiple URLs for one page of content.
This is hard work. Doing most things right involves hard work. The payoff, though, is that you don’t have to depend on weird, semi-supported tags like rel=canonical or huge webs of complex 301 redirects.
And, if you really fix the problem, then the fix scales: New pages and content will behave themselves, and you’ll have less work in the long run. Anything else is a duct tape. Which, contrary to popular myth, won’t fix everything.
duct-tape.jpg
Good: Works forever. Makes your site well-coded. Builds good karma. Won’t fail when the search engines buy each other or change their minds about standards or whatever.
Bad: Requires higher thought.

2: Robots META tag

You can use the robots META tag to hide all but one version of the guilty pages.
Say you’ve got a canonicalization problem that looks like this:

http://www.mysite.com/products/

http://www.mysite.com/products/?referrer=homepage

http://www.mysite.com/products/?referrer=catpage

…where all of those URLs go to the exact same page.
You can fix the problem by telling search engines to ignore the page at all but the first URL. Add this in the <head/> element:

<meta name=”robots” content=”noindex,nofollow”>

Important: You need to use some kind of conditional logic to only show that robots tag when there’s a ‘referrer’ attribute in the URL. Here’s what it’d look like in plain English:
IF there’s a thing called “referrer” in the URL, then insert <meta name=”robots” content=”noindex,nofollow”> in the page.
And in PHP:

if (.$_GET['referrer']) {
echo “<meta name=”robots” content=”noindex,nofollow”>”
}

I’m at best a rookie PHP developer, so let me know if I screwed this up.

Without the conditional logic, you’ll hide every instance of the page, including the nice short one.
Good: Easy. Appeals to the spaghetti programmer in me.
Bad: Somehow, there’s always one case you miss. Next developer down the line will probably delete it, laughing at you the entire time. Only works on dynamic sites.
robot-meta-1.jpg

3: Use robots.txt

Continuing the example from above, you could use regular expressions to exclude all urls that include “referrer” from the search engine index.
Something as simple as:

User-agent: *
Disallow: /*?referrer=

might do the trick.
Good: It’s so easy. One little line in the robots.txt file and you’re all set. Sweet!
Bad: If done wrong, may cause your site to fall into a black hole. Also, different search engines support robots.txt differently.

4: Use 301 redirects

If you have a case where the problem stems from inconsistent linking practices like:

http://www.mysite.com/

http://www.mysite.com/index.html

http://mysite.com/index.html

http://mysite.com

…where all four URLs point at your home page, you can use a 301 redirect to fix it.
Set up a 301 redirect from each of the 3 URLs you don’t want indexed to the one that you do. When search engines visit your site, they’ll scoot over to the correct page and index that one.
They’ll even apply most of the link authority from the incorrect URLs to the correct one.
This is also your best bet if external sites are linking to the wrong home page URL.
Good: Easy (if you have server access). Approved by all search engines. Can also be done using a scripting language like PHP. Works for external links to your site, too.
Bad: Tedious. Requires server access (or a programmer). Done wrong, may create endless loops that turn your data center into a mushroom cloud.
server-gone-nuclear.jpg

5: Webmaster tools

Google Webmaster Tools will let you exclude parameters in the toolset. Log into Google Webmaster Tools, then go to Settings and click ‘Adjust parameter settings’. Using the ‘referrer=’ example from #2, you’d do this:
google-parameter-handling.gif
Voila. Googlebot will strip out those URL attributes.
You can also set your preferred domain on the same screen:
google-preferred-domain.gif
Good: It’s just forms and stuff.
Bad: May not prevent canonicalization problems. Depends on Google and whatever’s going on in its pointy little head. Not supported by Yahoo! or Bing.

What now?

Now you know what canonicalization is, how to find problems with it, and 5 possible solutions. Start by checking your site. If you find a problem, sit down with your development team (if you have one) and work out a solution. Start at #1 as the best solution and work your way down.
Oh, and a piece of advice: Don’t tell anyone there’s a 2-5. #1 is what you want. Use 2-5 when all hope is lost.
Back to part 2: How to detect canonical problems on your site. It’s easy-peasy!

Related Reading