Ian Lurie // Mar 10 2011
Log files may not seem sexy. But if you want to do some better on- and offsite optimization, in less time, then you need to learn to use them. But first, a little history.
In college and law school, I raced bikes. That’s bicycles. I sucked. And I had to wear lycra shorts, which apparently were supposed to be sexy, but make me feel like a dark, shiny sausage to this day. It was fun, though:
At that point, most racers used these tires called sew-ups or tubulars. These tires have the tube sewn inside the tire. They’re literally glued onto your wheels, and the wheels are super-light and super-responsive because they don’t have to have any of the hardware to hold a tire and tube in place. Racing on tubulars literally transformed me to an almost-acceptable road racer.
But, tubulars are for advanced riders. Gurus. Experts, even. And, if you’re gonna use ‘em, you’d better know how to glue them on.
The glue you use to attach a tire to a rim generally sticks to everything except the rim—your fingers, the furniture, your hair, your clothes. Just not the damned wheel.
It may seem like a ridiculous task. Why not have someone else do it?! Or just throw the tire on there with a few dabs of glue. But if you improperly glue a tubular on to a wheel, this is what happens:
I don’t know the source for this image. If you know it, please let me know so I can give credit.
Not so advanced.
Now, as you gracefully flip over your handlebars and watch the road/velodrome rise up to greet you with a wave and a smile, your brain is saying something I can’t print. But the loose translation, from the part of your brain that’s still functioning, is something like:
But I’m advaaaannnnnnnccceeeddddddd
In SEO, log files are like glue. If you really want to get ahead and stay there, you have to learn that basic stuff that holds everything together. So, if you want to avoid those ass-on-asphalt moments, you want to learn to look through the log files.
Wait right there. I can see you’re about to snort and head back to Google Analytics, or Coremetrics, or whatever. Don’t.
Tools like Sawmill and Urchin are nifty, too. But they don’t let you drill down and analyze the data with the level of precision you want and need.
All those tools miss stuff. So learn the log files.
So, finally, here’s the process:
First off, you need a list of bots: Data you can use to identify the search engine crawlers based on their IP address, and on the user-agent identifier in your log file.
My favorite is Fantomaster’s SpiderSpy database. I’ve got no special affiliation with them. It’s just a rockin’ database that has yet to let me down.
Now, you need to get your web server log files. Depending on how busy your site is, these files might be a couple million lines for a month, or a couple million for an hour’s worth of data.
The files can be quite large, so get them ZIPped before you download them.
Don’t tell me you can’t get at your logs!!!! You want to be advanced? Find a way. If your web host, IT person or webmaster won’t hand ’em over, torture can work:
But truthfully, bribery works better. I’ve gotten log files with as little as a ‘please’ and as much as a case of Mt. Dew. Whatever works. You need this data.
Try to get at least 1 week’s worth of data. If you have a huge, busy site and the log files are huge, the good news is that you can probably get enough data with just a day’s worth of logs.
For both of these datasets, you need to retrieve, at a minimum:
If you don’t know how to read a log file, check this article I wrote a while back.
I wrote a Python script that does all of this for me. If you know how to use the command line, you can download the script and try it for yourself. I make no guarantees, but it’s saved me a lot of time.
You can also use Adam Audette’s most excellent log filter script.
Or you can try importing the whole log file into Excel, space-delimited. But Excel will choke on any file bigger than 60-70,000 lines. You’ll likely need to learn one of the following:
Regardless, generate the two datasets.
Then pull those into whatever tool you want to use. Excel usually works for me at this point, because the filtered dataset is smaller. On to step 4.
I usually start my analysis by looking for problems with image indexation. Filter your list for Google or MSN’s imagebots. Then look for 404 or other errors:
Bam. Easy, quick fixes for image indexation.
You can also look for repeated transfers of smaller images:
Those are likely imagebots wasting time to access navigation buttons and such. Find those, and set them to ‘304’ status, meaning ‘unchanged’. Or find another way to tell bots to skip them. Step 5 is where we get to the really nerdgasmic stuff, though.
Here’s the really good stuff, though: I want to see if Google’s wasting crawl budget and efficiency on crappy URLs, bad redirects and such. So, I go back to my same Excel sheet and try a quick look, filtering just for Googlebot. Then I filter out clicks from things like PPC ads, so I get a good look at important urls.
OK, first thing is we’ve got session IDs getting crawled. That’s duplicate content and wasted budget. Plus, they’re 302-redirecting somewhere else. That’s probably a well-intended fix that’s causing even more problems:
Next, UTM codes are in links, somewhere, and they’re getting crawled, which is undoubtedly creating duplicate content:
And here’s the thing: None of this stuff shows up in Google Webmaster Tools. What the hell?!!! Googlebot is crawling them, so we know it’s wasting time, and that we’re spending crawl budget on these links.
You never would’ve found this stuff without going into the log file. So nyah.
OK, now I’ll jump to links. Easiest way to build links? Find the ones you have, but are broken.
Google Webmaster Tools helps a little. Bing helps a little. But not much.
So, I take my other script, which runs a report based on busted links, collecting the referrers and such. I pull that into Excel and again set up a filter, checking for broken links from edu sites:
I’ve got a great .edu link that’s broken?!! I’ll build some content at the target URL and get a great link in a few minutes.
Oh, and by the way – that link didn’t show up in Google or Bing webmaster tools.
None of this stuff is sexy. No one’s going to call you an SEO ninja for doing it. But SEO is about lots of little details. You win at SEO with a tool belt, not a black belt. So put in the time to figure this stuff out. Or, end up on the asphalt.
Up to you.
Ian Lurie is founder and CEO of Portent Inc., an internet marketing agency that has provided internet marketing, including PPC, SEO, social and analytics services, since 1995. Read More