December 8, 2005

implementing a simple net-spider

Posted in Python at 1:02 am by Frank

It was years ago that I heard about net-spider thing. It downloads a html soucecode of an url and then find some useful informations from it, and then recursively processes its sub-urls the same way. Google use a spider to download many site pages into databases so they can search a web content from their database for us. Junk mail sending programs use spider to find out the e-mail addresses from the internet and then send junk mails to these addresses. So, it’s very interesting to code a spider of my own. I implemented a sipder with python. It’s a very simple one, but it works. After I ran it excitedly for some time I found there was a very big problem with my spider. Surely that it’s not efficient as a real-life spider, but this is not my problem here. I want to say that it will cost a lot of memory after running for some time, more than 300M. Since this blog uses css which will cause unreadability of souce codes, I post my spilder.py as an attach, it can be downloaded and openned from inside a web browser. It’s very readable. And I will give some brief coments about how my spider.py works.

I use the spider to find out how many words there are in msnd2.microsoft.com, which is the online documents of .net 2.0. There are 5 classes in total, as shown below:
spider.png

SynchronizedObject class just has a lock() and unlock() method, which can be thread-safe.

FetchThread class does the following things in a row in a separate thread:
1.downloads the html of its url
2.extracts the content of the html and add them to WordCounter
3.extracts sub-urls from the html and add them to NewLinks
All the steps can be seen in the souce code of FetchThread.run().

ThreadList class contains a list of threads. The max count of the threads is defined by __maxcount, which can be set from its constructor. In my spider.py I define an max_thread_count as 20. GetCount() method returns the current count of __threads. GetThread() method sees if there are any new urls available in NewLinks, if there are, it will spawn a thread for each of them. removeThread() remove a thread from __threads list when a thread finished doing its job.

WordCounter receives some text and splites it into words and save each word and their corresponding occurence times.

NewLinks save all the urls in it. I save each url for its url adress, level and status. Status 0 means new url, 1 means in process, 2 means processed. getCount() method return the count of unprecessed urls in it.

I use a timer to check whether there are threads running, and whether there are more urls unprecessed. If there are, I use thread_list.getThread() to spawn some threads to spider the ulrs.

I suspect it’s the huge amount of string garbage that make it suck up so much memory. But I’m not so sure. Perhaps some other reasons. I didn’t get any clues.

The ariginal script file is here. Get it and change the file name to “spider.py”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.