Ian Lurie // Aug 10 2012
I’ve broken this out into lots of steps. You could do it all in one or two steps with a shell script or other geekery. I wrote this to keep each step simple, and get you into Excel as quickly as possible, instead.
What I’ve never done, though, is shown folks how they can quickly find those busted external links using basic tools. So, here goes:
With a log file, you can find broken external links that Google hasn’t. Google Webmaster Tools only shows you broken links found by Google. GWT ignores:
Don’t you want all those links from Twitter? How about all the old .edu links you used to have, but lost when you took down the target pages?
Hell yes. Here’s how you can find them using your log files:
If you’re using OS X or Linux, you have everything you need except, possibly, a spreadsheet program. Google Docs will work, or OpenOffice for big files, or Excel for the coolest stuff (like pivot tables).
If you’re on Windows, you’ll want to install CYGWIN—that gives you all of the command-line tools I talk about in this post.
If you run your own site, you can download the log files yourself. Otherwise, though, you’re going to have to ask someone else to get ‘em for you, and that’s rarely popular. Here’s how you can make the process less painful:
The key here: Make this an easy process. The first concern of whoever you ask for the files will be: “Is this a lot of work for me?” and “Is this a security issue?” Answer those concerns before they’re raised.
I’ve spent weeks, literally, trying to get log file access from a client. Usually, that’s because no one knows what I’m talking about. If you run into this, try these steps, in this order:
If the site’s located with a hosting company:
If the site’s self-hosted or managed by an internal team:
Got the files? Great! Time to get to work.
Now, you can go download the log files. You probably have a bunch of compressed files up on a server somewhere. They’ll look like the right-hand side of my FTP window:
Download them to your machine. Decompress them using whatever utility makes sense. If these are .gz files, you can extract them using the GUNZIP command:
That will extract every file in this folder with a .gz on the end, and leave you with something like this:
Log files may be compressed using ZIP, or something else. You can find the right extraction tool using, I dunno, Google?
Ideally, you need a single log file. To combine the log files, use the CAT command:
cat access_log > biglog.txt
The above command will:
If the files are really huge you may have to keep them separate. But that’ll only be an issue if, once combined, the final file is multiple gigabytes in size. GREP is really good at processing huge files.
You need to find all of the broken external links. So, you’ll need four pieces of data:
With those four items, you can find all of the external broken links visited by browsers other than Googlebot.
Now to the good stuff. You’ve got one gigantic log file. You can use the GREP command to search through that file at super speed.
Use this command, changing the htm and file names as relevant:
grep "\.htm*[[:space:]]404[[:space:]]" biglog.txt > errors.txt
This command will:
This can take a minute or two.
You may need to change the .htm. We’re using to exclude all of the requests for .gif, .png and other non-html files. We only care about pages this time around. If your site uses php, and all of the URIs end with .php, you’ll have to change .htm to .php.
We need to remove all 404 errors generated by Googlebot. GREP can do the job, again. Use this command:
grep -v "Googlebot" errors.txt > errors-no-google.txt
This command will:
Notice how fast GREP ran that command? Pretty nifty, huh?
When I ran through this exercise on my laptop, I took a .5 gigabyte biglog.txt file and trimmed it down to a 904kb file that just contained the errors I needed. It took a total of 5 minutes, start to finish. Try this in Excel and you’ll see smoke rising from your computer. GREP is so cool that I’ve written about it before.
Using whatever spreadsheet software you prefer, import the errors-no-google file as a space-delimited text file:
You won’t need most of the columns. Only three columns really matter:
You can delete the rest of the columns. Then insert a new row at the top of the page and label the columns:
That’ll let you indulge in some data processing niftiness later on.
Oh, and save the damned spreadsheet. Nothing sadder than losing all your data because your cat strolled across the keyboard.
Put your cursor in the heading row you created in step 6 and click the filter button:
Now you can sort and/or filter our stuff you don’t need. For example, I may not want to see all of those ‘-’ referrers:
And I probably only want to see external broken links, so I can filter out all referrers that include this site’s domain name:
Note that I used ‘does not contain’ for the second filter. Read up on Excel’s filter tool. It’s your friend.
Phew. Finally. We can find some external links. Take a look at the result:
It’s a link goldmine!!! Every row represents a broken link from another site.
Now you can use a pivot table or other spreadsheet awesomeness to find the biggest problems:
Or, you can just browse through the raw data. Either way, you’ll find great, easy incoming links.
Prioritize broken links like this:
None of this work means a thing if no one fixes the links! Here are the ways to fix them, from best to worst:
Always use options 1-3 before 4. A permanent redirect is a very imperfect solution, and best applied when you have no other options. 301 redirects will reroute authority for a while, but eventually the authority ‘decays’. Plus, a high number of 301 redirects on a site can wreak havoc with Google and Bing. Both search engines’ crawlers will give up if they see too many redirect ‘hops’.
This post has over 1800 words. At this point you’re probably ready to stab me. Please don’t. I like my insides in.
And, this isn’t nearly as hard as it seems. With practice, you’ll be zipping through all these steps in under an hour. It’s by far the quickest, easiest way to improve site authority.
Ian Lurie is founder and CEO of Portent Inc., an internet marketing agency that has provided internet marketing, including PPC, SEO, social and analytics services, since 1995. Read More