Python web crawler code – use at your own risk

seochat Random

Ian Lurie Nov 5 2010

Update 12.13.10

Big changes to the crawler code:

  1. Switched from urllib, which left sockets open and created memory leaks, crashes and other computer higgledy-piggledy, to httplib.
  2. Now fetching mime-type and using it to separate images from text pages.
  3. Better URL handling.
  4. Cleaner output – removes domain name from output for smaller, easier to handle files.
  5. Switch to crawl only the current site, or to check other sites, too.
  6. Checks for URLs previously crawled and marks them as such, but still notes them. That will provide you a complete list of all links on every page, without slowing the crawl.
  7. Better connection management for faster crawls.

Things that still bug me:

  1. The script will time out if the target site times out. Need a way to have it stop gracefully.
  2. Still not multithreaded.
  3. Not storing in a database. That’s to keep the script simple and portable, but at some point it’ll have to change.
  4. Needs a pretty interface. Working on that next.

Download the code (and contribute to the project by improving the code!) here:
[ CMCrawler - an open source Python web crawler ]

Really, really basic docs

This is a command-line Python script. It doesn’t get much uglier, just so ya know. But it’s fast, lightweight and the output is easy to mash for generating XML sitemaps, checking for 404 errors on your site, or just getting a sense of a site’s layout.
As a speed reference: It averages 90 seconds to crawl 700 or so pages. It is single threaded (at the moment).

Stuff you’ll need to use it

You must have Python installed. If you don’t, or don’t know how to install it, frankly I don’t suggest you mess with this just now. It’s not a mature-enough program yet.
You also need one library that doesn’t come standard with Python: The fantastic BeautifulSoup library. It’s worth the effort, and without it, writing this crawler would have reduced me to a damp, gibbering lump of flesh under my desk.
Finally, you need to know how to use the command line on your computer, just a little bit.

Running the crawler

  1. Download the code.
  2. Extract the compressed archive to your hard drive. Put it wherever you want – just make sure you remember the location.
  3. Start up your command-line client. On my Mac I use the trusty BASH shell.
  4. Navigate to the folder where you put the script.
  5. Type python cmcrawler.py [domain to crawl] [stay within domain]. Domain to crawl is your site’s domain, without the leading ‘http://’. Stay within domain is a ’0′ for ‘stay within this domain or a ’1′ for ‘crawl everything’. For God’s sake, stick with 0 for now, OK?
  6. The script will spit out the results of the crawl, as they happen. The results are tab-delimited, so you can easily cut-and-paste them into a text editor or Excel.

Original post

A few folks at #seochat last night asked for the code from a Python-driven web crawler I’m working on, so here it is, in a Github repository.
I’m just warning you: This is some ugly stuff in that code. This was the very first Python code I wrote. Ever. It does all the horrible things developers do when learning a new platform.
I’ll update it as often as I can. The code is totally, 100% free for everyone to use. There are a few conditions though:

  1. You can’t use this for a commercial project without talking to me first.
  2. Please improve it! Check out the issues page on Github and see what you can do. Send me feedback.
  3. You are not permitted to laugh at my lack of coding-fu.

Enjoy.
PS: This crawler is really me hacking together great libraries other people wrote. I get no credit for anything that works.
[ CMCrawler - an open source Python web crawler ]

tags : conversation marketing

related articles

4 Comments

  1. Hey Ian,
    Where do you plan on taking this? Is it just for your learning pleasure, or do you intend on eventually turning it into a production tool?
    Good luck!

  2. Ian

  3. avas

    Hello Ian,
    I was looking for some web crawler, and luckily I saw your crawler. It is just what I was looking for. But I am having a problem,
    the crawler SKIPS the subdomain of a website which it is crawling as External URL.
    Can you please help and let me know, how to make this crawler not to Skip the subdomain of a website.

    For example: if it is crawling http://www.diabetes.org, this crawler will SKIP community.diabetes.org as a external URL.

    Thank you very much for this project.
    Good luck.

    • Hi Avas,

      I actually built the crawler this way. Search engines generally treat subdomains as separate sites, so the crawler does, too.

      Ian

Comments are closed.