Get geeky with GREP as an SEO tool
Ian Lurie Mar 23 2010
This is a preview of one of the advanced topics in the soon-to-be-published Fat Free Guide to Internet Marketing. Read it and see what you’ll be getting.
I’ve occasionally written about log files, and why I love them. It’s not an entirely healthy relationship – they sit there like, well, logs, and I slave away, grooming them, combing out data and finding snarls where search engines got stuck. Do they appreciate it? Noooo. They’re the aloof cats of the online world.
But, now and then, they give something back. This is one of those times. In this post, you’ll learn to use grep to crunch through web server log files and do some old-school analytics, such as finding all Googlebot hits on your site.
I’m a Linux/Mac OS X nerd. All tools I use in this post are built in to both. If you’re on Windows, you’ll need to install a tool like wingrep. If you want the full Linux command line in Windows, go for CYGWIN. Not for the faint of heart, though.
What you’ll need
A computer with grep installed, one way or another.
A server log file.
A willingness to embrace the command line, if only for a few minutes.
Using grep as the ultimate search tool
Now, before we start, understand that grep is a command-line tool. This means going away from the nice, clickable interface we’ve all grown to love. It’s OK! Command line is your friend. It lets you do all sorts of magical stuff. It also won’t do unexpected things, like crash every other program on your computer, unless you really work at it. And it makes you seem more attractive to nearby geeks of the opposite (or same) sex. All important stuff.
Grep also happens to be the ultimate search tool if you’re trying to grab lines from a huge log file. That’s what we’re going to use it for.
Using grep is easy:
- On OS X, click Go > Utilities > Terminal. If you’re on Linux I’ll assume you know this part. The terminal window will appear, all deserted and sparse looking:
- Type in ‘grep’ and click the enter key. Whups. You get something like this:
- That’s OK. It’s just grep saying “Hey, you didn’t type in enough information, here’s what you need to tell me. For our purposes grep needs at a minimum two inputs: The pattern for which you’re searching, and the filename.
So, if you had a log file named ‘accesslog.log’, then
grep “Googlebot” accesslog.log
would sift through the file, grabbing every line that had ‘Googlebot’ in it. Note the quotes around the pattern. That’s how you type it.
Add one extra command and grep will put the results in a nice neat file for you:
grep “Googlebot” accesslog.log > googlebothits.log
The > googlebothits.log tells your computer “write the result of my grep search to a file called googlebothits.log”.
If you’re using a Windows tool, it may let you just click a button that says ‘write output to file’. Showoffs.
Also important: Grep is case sensitive! See how I’m capitalizing the ‘G’ in ‘Googlebot’? That’s why.
See the potential? Grep turns a log file into an instant database. You can search for:
- All visits that started with a search on the keyword ‘nubwit’.
- All visits that generated a 404 page not found error.
- All visits by a bot.
And so on. Good stuff, in one little line of text.
Putting it to good use
Time to use grep.
Reminder: Grep is case-sensitive unless you add a -i modifier to it. I won’t go into that now. Just make sure you use capitals where necessary.
- In your terminal window, navigate to the same folder as your log file. If you can’t figure it out, the command you want is CD. If you’re not sure where you are, use pwd first. Once you get there, you can use ls or to list the directory contents:
- OK! We’re in the right place. But we’re stuck with a bunch of different files. We could learn to grep them all together. But it’s easier to just combine them using another set of commands. Moving along…
- Type touch mylog and click enter, where mylog is the name of the file you want to create. Then type cat *.log > mylog and click enter. If you have large files (more than 500 megabytes total) it could take a few minutes. It’s hard work. Then voila – you’ll have one big log file in mylog:
- Now, say you want to grab every instance where Googlebot – Google’s search crawler – grabbed content from your site. This could tell you whether Google’s having trouble crawling your whole site. In that case, you’ll type grep “Googlebot” mylog > googlebot_visits.txt. Click enter.
- Your computer will creak a little, then sift through the whole file, grabbing any lines from the log that include a mention of ‘Googlebot’, and then copying them to a file called googlebot_visits.txt:
See? No pain. You didn’t end the universe. Nor did you reduce your computer to a smoking heap.
Open googlebot_visits.txt in the tool of your choice. If it’s a small file, I use Excel. If it’s huge, I may use more grep commands, or import the whole thing into a database tool, so I can do the analysis. Here’s how my Googlebot report looks in Excel:
If this is gibberish, wait 24 hours. Tomorrow I’m going to write about how to read a log file
You could use a tool that costs real money, like Sawmill or Splunk. But those cost money, and grep is free. Plus, all log analyzers make certain assumptions. If you’re a hardcore internet marketing nerd like me, you don’t like anyone making assumptions about anything. I want to see the raw log files and know I’m working with the real thing.
There’s another reason: Once you work with something like grep, you get more comfortable with using your computer to sift through mountains of data. That can open a lot of options that you may not have known about. This post barely scrapes the surface.
If this was too painful, a few alternatives are:
- Adam Audette’s very cool command line logfile analyzer.
- Sawmill, of course. It costs money but it’s a good deal.
- Splunk, which is free up to a point, then will cost you your first born child. It’s also arguably harder to use than grep. But it’s immensely powerful.
- Read the full grep man (help) page.
- Even better, read the Panix grep tutorial, which won’t cause as much of a headache.
- Read why log files help with attribution tracking.
- Or my general presentation on analytics-driven SEO.
- Even better, just hire me and my company. Then you won’t have to deal with any of this.
Ian Lurie is CEO and founder of Portent Inc. He is co-author of the 2nd edition of the Web Marketing All-In-One for Dummies and wrote the sections on SEO, blogging, social media and web analytics. He's recorded training for Lynda.com, writes regularly for the Portent Blog and has been published on AllThingsD, Forbes.com and TechCrunch. And, Ian speaks at conferences around the world, including SearchLove, MozCon, SIC and ad:Tech. Read More