How to: Read a web site log file
Sooner or later, if you’re a real internet marketing nerd, you’re going to have to learn to read a web server log file. Sorry, but it’s true. There’s data in there that you simply cannot get anywhere else.
Earlier this week, I wrote about using grep to survey web server log files for SEO-related data. That was a little premature, so today I’m going to explain the whys and hows of reading a basic log file.
Why do I have to learn to read log files?
Because log files are to internet marketing what watts are to cycling: The only truly accurate measure of what’s going on when you try to propel yourself.
Google Analytics is great, but it has its shortcomings:
- It can slow down or stop working for brief periods that always seem to occur when you’re showing something to a client.
- It doesn’t track search engine spider or ‘bot’ traffic.
- It doesn’t store server response codes, such as 404 page not found, or 304 not changed.
- You can’t fully attribute each visit in a conversion funnel, because it uses first/last click attribution.
- Some idiot can delete your account. Poof.
The basics: What are log files?
Web servers are kind of obsessive. They record everything that happens: Every image download, error, page load, etc.. And they put it all in a text file – a plain old, simple text file – called a log.
If you visit mysite.com/mypage.html, there’s a conversation between your web browser and my web server. It goes like this:
Browser: Yo. How’s it going? You there?
Server (scribbles note that this particular browser contacted it): Yep, I’m here. What do you need?
Browser: Lessee, I need mypage.html, plus all of the images and other stuff included in the page.
Server (scribbles a list of what’s being requested): OK, one sec.
Server: OK, here you go.
Server (scribbles a note that the files were delivered without any problems, when the delivery happened, and other details)
Every time I wrote ‘scribbles’, the server recorded something in the log file. So this one little visit to one page on my web site actually creates at least four lines in my server log: One for each image, one for the .html page, and one for the scripts.js file.
And that, Virginia, is where log files come from.
Getting your log files
If you’re hosted through a service like Media Temple, you can download your log files from your control panel. Look for an item called ‘raw log files’ or ‘download raw log files’.
If you’re at a big company, or have someone else managing your site, you’ll need to ask them for the log files.
Make sure you have your server configured to save log files! Also, make sure your server is logging as much information as possible. Server software may come pre-configured to store only some of the data you need.
The best way to learn is to dive in. So, go grab your web site logs, or download the sample I’ve got: samplelog.txt. I’ve scrambled a lot of the file to protect the innocent. If you see something odd, that’s why.
Anatomy of a log file
Go ahead and open up the log file. If you’re using your own, make sure the file isn’t enormous. A busy server can generate log files in excess of 1 gigabyte in a single day – if you try to open that with a text editor, your computer may decide to end it all.
Here’s a look at the log file I’m using:
The first few lines describe the server and tell me when this file was generated. Then comes the good stuff: Line 4 lists the fields in the log file. Think of fields as columns in a spreadsheet. The fields are the headers of each column. This information is important, because different servers may record logs in slightly different ways.
After that, you get line after line of what looks, at first glance, like a cat ran around on a keyboard and then sent you the result. But it does mean something: Every line is one request by a visiting web browser. A request occurs when a visiting browser says “Lessee, I need this file and that file, please.” Put all those requests together and you’ve got a moment-by-moment history of every move someone/something makes on your site.
If you want to know the history, though, you have to speak the language. Keep going…
Hint: In your text editor, make sure you have ‘soft wrap’ or ‘wrap lines’ turned off. That way, every line will continue, straight out across the screen.
Log file field descriptions
I’m not going to go through every log file field. If you want the full list let me know. Here are the ones you want to know about. The names of your header fields may be a little different, but they should be easy to match up with these:
date: The date of this request.
time: You guessed it – the time this request began.
cs-method: The request method used for this request. Usually it’ll be GET, POST, or HEAD. GET is the most common – it’s a standard “Gimme that file” request. I won’t try to explain the others here. This post is long enough as it is.
cs-uri-stem: The address of the page requested, up to but not including any query attributes like ?this=that. The stem of http://www.mysite.com/stuff/index.html?this=that is http://www.mysite.com/stuff/index.html
cs-uri-query: All of the stuff after the ? in the URL, if there is anything.
c-ip: The IP address of the browser making the request.
cs(User-Agent): Basically, the browser make and model.
cs(Cookie): Any cookie data the server placed on the browser making the request.
cs(Referer): The web site from which the browser came. If you clicked a link on theirsite.com and landed on mysite.com, then the referer (I don’t know why they don’t spell it right) is theirsite.com.
sc-status: The response the server delivered to the request. This is very important! To learn what the response codes mean, go here.
time-taken: How long it took for the server to answer this request. Handy if you’re checking for a problem.
Applying your newfound knowledge
OK, enough messing around in text editors. Open the same log file in a spreadsheet program. In Excel, I use File > New and then File > Import. Then I select Delimited File, with space as the delimiter.
Here’s what the first line of my log file tells me:
- At 7 AM on 3/24/10, Baidu’s search spider visited my site’s home page (that’s what the ‘/’ means.
- It issued a GET request.
- My server responded with a 304 code, meaning nothing’s changed on the home page.
Reading one line isn’t very useful. But, if you start looking at entire sessions – all request by a single browser after it arrives at your site – you can learn a lot. If I sort by user agent, I can see what MSNBot (that’s Bing) – found last time it crawled my site:
As well as any errors that occurred during MSNBot’s crawl, etc.. It’s a great way to quickly diagnose issues.
That’s it. You now know how to read a log file. What you choose to do with this knowledge is, of course, your own business. If you want to use your powers for good, I suggest:
- Regularly checking log files for any 404 codes. ’404′ means ‘page not found’. You’ll want to fix those.
- Measuring how often search engines visit your site, and how deeply they crawl. These are good indicators of importance.
- Finding the files that account for the longest downloads, and then making ‘em smaller.
That’s just a few ideas.
As I said above, log files can be humungous. If that’s the case, you can use grep to search them, or import them into a database tool instead of Excel.
- Get geeky using grep as an SEO tool
- Attribution, part 2: Connecting the dots
- Analytics apples and oranges