Get geeky with GREP as an SEO tool

grep googlebot Featured

Ian Lurie Mar 23 2010

This is a preview of one of the advanced topics in the soon-to-be-published Fat Free Guide to Internet Marketing. Read it and see what you’ll be getting.

I’ve occasionally written about log files, and why I love them. It’s not an entirely healthy relationship – they sit there like, well, logs, and I slave away, grooming them, combing out data and finding snarls where search engines got stuck. Do they appreciate it? Noooo. They’re the aloof cats of the online world.
But, now and then, they give something back. This is one of those times. In this post, you’ll learn to use grep to crunch through web server log files and do some old-school analytics, such as finding all Googlebot hits on your site.

I’m a Linux/Mac OS X nerd. All tools I use in this post are built in to both. If you’re on Windows, you’ll need to install a tool like wingrep. If you want the full Linux command line in Windows, go for CYGWIN. Not for the faint of heart, though.

What you’ll need

A computer with grep installed, one way or another.
A server log file.
A willingness to embrace the command line, if only for a few minutes.

Using grep as the ultimate search tool

Now, before we start, understand that grep is a command-line tool. This means going away from the nice, clickable interface we’ve all grown to love. It’s OK! Command line is your friend. It lets you do all sorts of magical stuff. It also won’t do unexpected things, like crash every other program on your computer, unless you really work at it. And it makes you seem more attractive to nearby geeks of the opposite (or same) sex. All important stuff.
Grep also happens to be the ultimate search tool if you’re trying to grab lines from a huge log file. That’s what we’re going to use it for.
Using grep is easy:

  1. On OS X, click Go > Utilities > Terminal. If you’re on Linux I’ll assume you know this part. The terminal window will appear, all deserted and sparse looking:
    terminal window all scary and threatening
  2. Type in ‘grep’ and click the enter key. Whups. You get something like this:
    grep options
  3. That’s OK. It’s just grep saying “Hey, you didn’t type in enough information, here’s what you need to tell me. For our purposes grep needs at a minimum two inputs: The pattern for which you’re searching, and the filename.

So, if you had a log file named ‘accesslog.log’, then
grep “Googlebot” accesslog.log
would sift through the file, grabbing every line that had ‘Googlebot’ in it. Note the quotes around the pattern. That’s how you type it.
Add one extra command and grep will put the results in a nice neat file for you:
grep “Googlebot” accesslog.log > googlebothits.log
The > googlebothits.log tells your computer “write the result of my grep search to a file called googlebothits.log”.

If you’re using a Windows tool, it may let you just click a button that says ‘write output to file’. Showoffs.

Also important: Grep is case sensitive! See how I’m capitalizing the ‘G’ in ‘Googlebot’? That’s why.

See the potential? Grep turns a log file into an instant database. You can search for:

  • All visits that started with a search on the keyword ‘nubwit’.
  • All visits that generated a 404 page not found error.
  • All visits by a bot.

And so on. Good stuff, in one little line of text.

Putting it to good use

Time to use grep.
Reminder: Grep is case-sensitive unless you add a -i modifier to it. I won’t go into that now. Just make sure you use capitals where necessary.

  1. In your terminal window, navigate to the same folder as your log file. If you can’t figure it out, the command you want is CD. If you’re not sure where you are, use pwd first. Once you get there, you can use ls or to list the directory contents:
    terminal-pwd-cd.gif
  2. OK! We’re in the right place. But we’re stuck with a bunch of different files. We could learn to grep them all together. But it’s easier to just combine them using another set of commands. Moving along…
  3. Type touch mylog and click enter, where mylog is the name of the file you want to create. Then type cat *.log > mylog and click enter. If you have large files (more than 500 megabytes total) it could take a few minutes. It’s hard work. Then voila – you’ll have one big log file in mylog:
    using the cat command to combine your files
  4. Now, say you want to grab every instance where Googlebot – Google’s search crawler – grabbed content from your site. This could tell you whether Google’s having trouble crawling your whole site. In that case, you’ll type grep “Googlebot” mylog > googlebot_visits.txt. Click enter.
  5. Your computer will creak a little, then sift through the whole file, grabbing any lines from the log that include a mention of ‘Googlebot’, and then copying them to a file called googlebot_visits.txt:
    terminal_grep_googlebot.gif

See? No pain. You didn’t end the universe. Nor did you reduce your computer to a smoking heap.
Open googlebot_visits.txt in the tool of your choice. If it’s a small file, I use Excel. If it’s huge, I may use more grep commands, or import the whole thing into a database tool, so I can do the analysis. Here’s how my Googlebot report looks in Excel:
final-log-excel.gif

If this is gibberish, wait 24 hours. Tomorrow I’m going to write about how to read a log file

But, why?

You could use a tool that costs real money, like Sawmill or Splunk. But those cost money, and grep is free. Plus, all log analyzers make certain assumptions. If you’re a hardcore internet marketing nerd like me, you don’t like anyone making assumptions about anything. I want to see the raw log files and know I’m working with the real thing.
There’s another reason: Once you work with something like grep, you get more comfortable with using your computer to sift through mountains of data. That can open a lot of options that you may not have known about. This post barely scrapes the surface.

Alternatives

If this was too painful, a few alternatives are:

Other reading

 

tags : conversation marketing

2 Comments

  1. Thanks Ian for sharing the details about how GREP can be used to find google bot visits to our site! Surely it is a time saver for us! Especially because neither Google Webmaster tools nor Google analytics are showing google bot visits with the same level of detail as provided by our web server log files!

  2. Ron

    Thanks for the crash course in log file analysis with GREP. I could see many circumstances where this could be useful.
    I am currently using Awstats and comparing the output with Webalizer and a JavaScript tagging solution from Yahoo. so far the results have been very encouraging. I also have noticed I get a ton more spiders hits then I would have thought.
    However this is great information. I will have try this out with my other log file tools. Thanks again!

Comments are closed.