Writing a Python site map generator: Part 1
This is part 1 in a short series about my attempt to learn a new programming language. 'Cause, you know, it's what nerds do.
The journey begins: My site map generatorI wanted to build a decent site map generator. To me, 'decent' means:
- It can crawl and parse pages, grabbing links, without generating an index. Some tools purport to do this - WGET, for one - but bugs get in the way.
- It can crawl and parse pages for images and video, too.
- You can pause, stop and restart crawls, and generate a map from a partial crawl.
- All URLs are stored in a database, for faster recrawls later on.
- It's portable, running on many different platforms.
- It's easily customized.
- It crawls images and content stored on content distribution networks, as well as on the target site.
- It doesn't make the computer on which it's running, or the target web server, melt into a pool of slagged microprocessors and the tears of the IT team.
I try a few languagesI started with PHP, which let me build PriusMileage.com and do a little work in WordPress. But it didn't quite fit the way ColdFusion always has for me.Then I moved on to Java. I was slightly less successful wrapping my head around object-oriented programming than I was learning the Rule against Perpetuities in law school.I tried Ruby on Rails, but it felt backwards, as I didn't yet know Ruby, and I don't like the level of abstraction frameworks impose.So, finally, I settled on Python. So far, I love it, for several reasons:
- There are lots of libraries you can install to do nifty stuff like build a web crawler.
- It has a big community around it. That's one of the things I love about PHP, too, so it was great to see lots of folks blogging about their own learning experiences.
- It's good at string manipulation. Not as good as PERL, and I'm sure someone will tell me Ruby kicks its butt, but it's good.
- It's easy to write quick little scripts. After about an hour of study, I was able to throw together basic scripts to help me do my work.
- It's already installed in OS X. OK, this was actually not as big an advantage as I thought at first. But see 'upgrading', below, for the whole story on that.
- It does have some solid frameworks, like Django. Once you know the lingo, you can use these frameworks to speed up your work.
Love at first site turns to bitter resentment: The upgradeI do all my development on OS X. OS X comes with Python 2.6 pre-installed. In my hopeless naivete, I decided to upgrade to Python 2.7. That, in itself, wouldn't have been a big deal. Except that I cheerfully downloaded, built and installed 2.7 without checking to see where 2.6 was on my machine.Oh, the humanity.It all seemed fine at first. Then I tried to add a Python library. Python has these things called libraries that are basically upgraded tool kits. You add them to your Python install to handle special features. For example, I use one called Beautiful Soup to chop HTML code into little pieces and pull out the links. I use another called urllib to crawl from link to link.My boneheaded install of 2.7 meant I had 2 versions of Python running on my trusty Macbook. So, whenever I tried to install a library, it installed on Python 2.6. But all my scripts were running under 2.7. I got one error after another.I put my laptop down, took a deep breath and went for a ride. Once the urge to fling the MacBook into Puget Sound subsided, I took another look. Turned out I needed to change the Python path in my .bash_login file. If you're a half-competent nerd like me, and are ready to rip out your hair in great, bloody fistfuls, do a search for 'Python install path' before you do. It'll save you some heartache.
I learn to stealWith Python behaving itself, I set out to find the fastest way to build a Python crawler. Generally, the easiest way is to find stuff some smartypants already wrote. After a few dead ends, I found several great pages that helped me along: A really short script someone was using to screen scrape for links, a great introduction to BeautifulSoup (the HTML parser I'm using), and the BeautifulSoup documentation itself.
My first crawler - isn't it cute?The result: After a couple of hours, I had a working, Python-based crawler and sitemap builder. Here's the code - use at your own risk:It ain't pretty, but it got me started. Feel free to steal it if you like. Note that I have a lot of debug stuff stuck in there that you can strip out if desired.Also, I'm sure I did all sorts of stupid stuff. Please point it out - I'm learning, and I appreciate the help.Next time: Hooking it up to a database, and how I nearly crashed the interwebs.
- Agile SEO using Query Deserves Freshness (QDF)
- My followers are bigger than yours: On Twitter, quality trumps quantity
- SEO copywriting best practices, now on the Fat Free Guide