Using fuzzy logic to redirect broken links: Geekery

Featured

Ian Lurie May 20 2010

One of the easiest ways to improve your link popularity is to fix old, busted links from other sites.
For example: Say I have an online store. One product page is www.mysite.com/bike-tire/tubeless. A nice webmaster decides to link to that page, but uses www.mysite.com/bike/tubless. Because that link is incorrect, anyone going to the ‘tubless’ page gets a 404 error. And, I don’t get any link authority from the other site. Bummer.
I contact the linking site’s webmaster and ask them to fix it, but it turns out their webmaster is on a 10-month sabbatical and no one knows how to edit the site. Bummer, again.
I can get at least some of the authority back, though, by setting up a 301 redirect from the ‘tubless’ URL to the correct, ‘tubeless’ one. That’s building links by looking between the cushions, aka out-executing your competition.
It’s a great solution, but it doesn’t scale. If you have a reasonably-large database-driven site – say an online store with 1500 products – you could have 5000+ pages on your site. And you may have 1000+ broken incoming links. You’ve got to go through those lists, matching up broken links with relevant redirection targets. Egads.
I could just every broken link to my home page. But that’s not ideal:

  • First, anyone clicking on a link that reads “Tubeless bike tires” expects to land on a page about tubeless bike tires. If they instead land on my home page, they’ll be instantly confused.
  • Second, I’m unlikely to keep a lot of link authority if the redirect ignores the reasonable surfer model. I have no proof, but an illogical or irrelevant redirect can’t possibly hold link authority as well as a relevant, sensible one.

This process screams for automation: Why not compare each broken link to a list of good ones, and then pick the best match, all using, I dunno, a search tool?
I tried to come up with some great name, like Linkinator, or Link Baby Link, but so far I’m at a loss, so…

Presenting, The Automated Fuzzy Logic 301 Redirect Finder Thingy Gadget

aka TAFLRFTG. Sigh.
In plain language, this soon-to-be-renamed tool works like this:

  1. I upload a file that lists all good URLs on the web site.
  2. Use Apache Lucene to build a searchable index of those URLs.
  3. Then I upload a file that lists all bad URLs I find in, say, Google Webmaster Tools.
  4. Search for the closest ‘fuzzy’ match between the bad URL and the good URLs.
  5. If the match is close enough, save the good URL as the redirection target.
  6. Repeat until done.

In this case, ‘fuzzy’ has nothing to do with ‘cuddly’. It means ‘a computerized guess’. I don’t make these terms up, so don’t ask me where this came from.

If you’re a bit nerdier, try this:

Upload good urls text file;
Upload bad urls text file;
Build Lucene index of good urls;
Set i = 1;
Begin loop from 1 to length(bad urls text file) {
Grab bad url[i];
Search Lucene index for closest fuzzy match;
IF find match THEN store match;
ELSE store home page URL;
End loop;
Write list of redirect URLs to a text file;
Take deep breath;
Smile.

The result is a nice text file that lists each bad link, plus the URL to which I should redirect.

Results, in real life

Lucene works amazingly well, matching up stuff like “/2009/09/seo-101-canonicalization-1.htm)” with “/2009/09/seo-101-canonicalization-1.htm”.
It also finds more subtle stuff. Someone linked to: “/2010/03/get-geeky-grep-Search%20Engine%20Optimisation-tool.htm” and the thingy found “/2010/03/get-geeky-grep-seo-tool.htm” as the best 301 redirect target.
It did this with no human intervention: I fed it a list of bad urls, and a list of good ones, and clicked ‘go’.
Using Lucene in place, though, the thingy runs an analysis of 1000 broken links and 20000 good ones in under 20 seconds.

HUGE props to Raymond Camden, whose example of using Lucene on ColdFusion got me started, and to my CTO, Branden Root, whose Java knowledge kept me from going even more insane.

Go ahead, try it

I’ll definitely be putting this tool out for public use, once I make it prettier, and figure out how to keep it from melting servers to slag. In the mean time, if you’re a ColdFusion developer, you can download my sample code. All of the necessary plumbing is in this one file:
sample-analyzer.zip
Since ColdFusion is designed so that even programming morons like me can use it, you should be able to interpret the code and convert it to your own language, too.
Enjoy, and let me know if you have suggestions.

Related stuff

tags : conversation marketingtools

1 Comments

  1. Brilliant!
    If a site is using a CMS, you should be able to implement this in real time (cue a horde of TAFLRFTG plugins…). The CMS maintains an index of its good pages. Any time there is a page request that doesn’t match a good url or an existing redirect, you could fuzzy match the request with a good one and store the redirect.

Comments are closed.