Using fuzzy logic to redirect broken links: Geekery

Ian Lurie

One of the easiest ways to improve your link popularity is to fix old, busted links from other sites.
For example: Say I have an online store. One product page is www.mysite.com/bike-tire/tubeless. A nice webmaster decides to link to that page, but uses www.mysite.com/bike/tubless. Because that link is incorrect, anyone going to the ‘tubless’ page gets a 404 error. And, I don’t get any link authority from the other site. Bummer.
I contact the linking site’s webmaster and ask them to fix it, but it turns out their webmaster is on a 10-month sabbatical and no one knows how to edit the site. Bummer, again.
I can get at least some of the authority back, though, by setting up a 301 redirect from the ‘tubless’ URL to the correct, ‘tubeless’ one. That’s building links by looking between the cushions, aka out-executing your competition.
It’s a great solution, but it doesn’t scale. If you have a reasonably large database-driven site – say an online store with 1500 products – you could have 5000+ pages on your site. And you may have 1000+ broken incoming links. You’ve got to go through those lists, matching up broken links with relevant redirection targets. Egads.
I could just every broken link to my home page. But that’s not ideal:

  • First, anyone clicking on a link that reads “Tubeless bike tires” expects to land on a page about tubeless bike tires. If they instead land on my home page, they’ll be instantly confused.
  • Second, I’m unlikely to keep a lot of link authority if the redirect ignores the reasonable surfer model. I have no proof, but an illogical or irrelevant redirect can’t possibly hold link authority as well as a relevant, sensible one.

This process screams for automation: Why not compare each broken link to a list of good ones, and then pick the best match, all using, I dunno, a search tool?
I tried to come up with some great name, like Linkinator, or Link Baby Link, but so far I’m at a loss, so…

Presenting, The Automated Fuzzy Logic 301 Redirect Finder Thingy Gadget

aka TAFLRFTG. Sigh.
In plain language, this soon-to-be-renamed tool works like this:

  1. I upload a file that lists all good URLs on the website.
  2. Use Apache Lucene to build a searchable index of those URLs.
  3. Then I upload a file that lists all bad URLs I find in, say, Google Webmaster Tools.
  4. Search for the closest ‘fuzzy’ match between the bad URL and the good URLs.
  5. If the match is close enough, save the good URL as the redirection target.
  6. Repeat until done.

In this case, ‘fuzzy’ has nothing to do with ‘cuddly’. It means ‘a computerized guess’. I don’t make these terms up, so don’t ask me where this came from.

If you’re a bit nerdier, try this:

Upload good urls text file;
Upload bad urls text file;
Build Lucene index of good urls;
Set i = 1;
Begin loop from 1 to length(bad urls text file) {
Grab bad url[i];
Search Lucene index for closest fuzzy match;
IF find match THEN store match;
ELSE store home page URL;
End loop;
Write list of redirect URLs to a text file;
Take deep breath;
Smile.

The result is a nice text file that lists each bad link, plus the URL to which I should redirect.

Results, in real life

Lucene works amazingly well, matching up stuff like “/2009/09/seo-101-canonicalization-1.htm)” with “/2009/09/seo-101-canonicalization-1.htm”.
It also finds more subtle stuff. Someone linked to: “/2010/03/get-geeky-grep-Search%20Engine%20Optimisation-tool.htm” and the thingy found “/2010/03/get-geeky-grep-seo-tool.htm” as the best 301 redirect target.
It did this with no human intervention: I fed it a list of bad urls, and a list of good ones, and clicked ‘go’.
Using Lucene in place, though, the thingy runs an analysis of 1000 broken links and 20000 good ones in under 20 seconds.

HUGE props to Raymond Camden, whose example of using Lucene on ColdFusion got me started, and to my CTO, Branden Root, whose Java knowledge kept me from going even more insane.

Related stuff

Start call to action

See how Portent can help you own your piece of the web.

End call to action
0

Comments

  1. Brilliant!
    If a site is using a CMS, you should be able to implement this in real time (cue a horde of TAFLRFTG plugins…). The CMS maintains an index of its good pages. Any time there is a page request that doesn’t match a good url or an existing redirect, you could fuzzy match the request with a good one and store the redirect.

Leave a Reply

Your email address will not be published. Required fields are marked *

Close search overlay