Writing a Python site map generator: Part 1

This is part 1 in a short series about my attempt to learn a new programming language. ‘Cause, you know, it’s what nerds do.

The journey begins: My site map generator

I wanted to build a decent site map generator. To me, ‘decent’ means:

It can crawl and parse pages, grabbing links, without generating an index. Some tools purport to do this – WGET, for one – but bugs get in the way.
It can crawl and parse pages for images and video, too.
You can pause, stop and restart crawls, and generate a map from a partial crawl.
All URLs are stored in a database, for faster recrawls later on.
It’s portable, running on many different platforms.
It’s easily customized.
It crawls images and content stored on content distribution networks, as well as on the target site.
It doesn’t make the computer on which it’s running, or the target web server, melt into a pool of slagged microprocessors and the tears of the IT team.

I know, I’m not too demanding. ColdFusion clearly couldn’t do the job. It’s far too slow at string manipulation, plus its http crawling technology isn’t that robust. Wait, ColdFusion people! Before you burn my office down, take note: I am an unabashed ColdFusion fanboy. It is still my favorite language by far. This is just not what it’s made for. I come in peace.

I try a few languages

I started with PHP, which let me build PriusMileage.com and do a little work in WordPress. But it didn’t quite fit the way ColdFusion always has for me.

Then I moved on to Java. I was slightly less successful wrapping my head around object-oriented programming than I was learning the Rule against Perpetuities in law school.

I tried Ruby on Rails, but it felt backwards, as I didn’t yet know Ruby, and I don’t like the level of abstraction frameworks impose.

So, finally, I settled on Python. So far, I love it, for several reasons:

There are lots of libraries you can install to do nifty stuff like build a web crawler.
It has a big community around it. That’s one of the things I love about PHP, too, so it was great to see lots of folks blogging about their own learning experiences.
It’s good at string manipulation. Not as good as PERL, and I’m sure someone will tell me Ruby kicks its butt, but it’s good.
It’s easy to write quick little scripts. After about an hour of study, I was able to throw together basic scripts to help me do my work.
It’s already installed in OS X. OK, this was actually not as big an advantage as I thought at first. But see ‘upgrading’, below, for the whole story on that.
It does have some solid frameworks, like Django. Once you know the lingo, you can use these frameworks to speed up your work.

Love at first site turns to bitter resentment: The upgrade

I do all my development on OS X. OS X comes with Python 2.6 pre-installed. In my hopeless naivete, I decided to upgrade to Python 2.7. That, in itself, wouldn’t have been a big deal. Except that I cheerfully downloaded, built and installed 2.7 without checking to see where 2.6 was on my machine.

Oh, the humanity.

It all seemed fine at first. Then I tried to add a Python library. Python has these things called libraries that are basically upgraded tool kits. You add them to your Python install to handle special features. For example, I use one called Beautiful Soup to chop HTML code into little pieces and pull out the links. I use another called urllib to crawl from link to link.

My boneheaded install of 2.7 meant I had 2 versions of Python running on my trusty Macbook. So, whenever I tried to install a library, it installed on Python 2.6. But all my scripts were running under 2.7. I got one error after another.

I put my laptop down, took a deep breath and went for a ride. Once the urge to fling the MacBook into Puget Sound subsided, I took another look. Turned out I needed to change the Python path in my .bash_login file. If you’re a half-competent nerd like me, and are ready to rip out your hair in great, bloody fistfuls, do a search for ‘Python install path’ before you do. It’ll save you some heartache.

I learn to steal

With Python behaving itself, I set out to find the fastest way to build a Python crawler. Generally, the easiest way is to find stuff some smartypants already wrote. After a few dead ends, I found several great pages that helped me along: A really short script someone was using to screen scrape for links, a great introduction to BeautifulSoup (the HTML parser I’m using), and the BeautifulSoup documentation itself.

My first crawler – isn’t it cute?

The result: After a couple of hours, I had a working, Python-based crawler and sitemap builder. Here’s the code – use at your own risk:

#!/usr/bin/env python import sys import urllib2 import urlparse import string from BeautifulSoup import BeautifulSoup from time import gmtime, strftime print “start time “,strftime(“%a, %d %b %Y %H:%M:%S +0000″, gmtime()),”nnn” try: root = sys.argv[1] except IndexError: print ” Usage: ./crawler.py link” print ” Example: ./crawler.py http://blah.com/” exit() linkz = [] crawled = [] errorz = [] imgz = [] parsedRoot = urlparse.urlparse(root) if parsedRoot.port == 80: hostRoot = parsedRoot.netloc[:-3] else: hostRoot = parsedRoot.netloc linkz.append(root) print ‘<?xml version=”1.0″ encoding=”UTF-8″?>n<urlset xmlns=”http://www.sitemaps.org/schemas/sitemap/0.9″>n' for l in linkz: try: src = urllib2.urlopen(l).read() bs = BeautifulSoup(src) for j in bs.findAll(‘a’, {‘href’:True}): try: absUrl = urlparse.urljoin(l, j[‘href’]) parsedUrl = urlparse.urlparse(absUrl) if parsedUrl.port == 80: hostUrl = parsedUrl.netloc[:-3] else: hostUrl = parsedUrl.netloc absUrl = urlparse.urlunparse((parsedUrl.scheme, hostUrl, parsedUrl.path, parsedUrl.params, parsedUrl.query, parsedUrl.fragment)) if (parsedUrl.scheme == ‘http’) & ((parsedUrl.netloc.endswith(‘.’ + hostRoot)) | (parsedUrl.netloc == hostRoot)) & (absUrl not in linkz): tester = absUrl.find(‘#’) if tester == -1: cleanUrl = absUrl.strip() cleanUrl = cleanUrl.replace(‘&’,’&’) print “tntt” + cleanUrl + “nt” linkz.append(absUrl) except: pass for i in bs.findAll(‘img’, {‘src’:True}): absUrl = urlparse.urljoin(l, i[‘src’]) parsedUrl = urlparse.urlparse(absUrl) if parsedUrl.port == 80: hostUrl = parsedUrl.netloc[:-3] else: hostUrl = parsedUrl.netloc absUrl = urlparse.urlunparse((parsedUrl.scheme, hostUrl, parsedUrl.path, parsedUrl.params, parsedUrl.query, parsedUrl.fragment)) if (parsedUrl.scheme == ‘http’) & ((parsedUrl.netloc.endswith(‘.’ + hostRoot)) | (parsedUrl.netloc == hostRoot)) & (absUrl not in imgz): print “tntt” + absUrl + “nt” imgz.append(absUrl) except: pass print “” print “Completed at “,strftime(“%a, %d %b %Y %H:%M:%S +0000”, gmtime()),”nnn”

It ain’t pretty, but it got me started. Feel free to steal it if you like. Note that I have a lot of debug stuff stuck in there that you can strip out if desired.

Also, I’m sure I did all sorts of stupid stuff. Please point it out – I’m learning, and I appreciate the help.

Next time: Hooking it up to a database, and how I nearly crashed the interwebs.