Been quite a while since the last post. Well, I've been busy. Among other things I've been working on a Spider for my new company. After a rather long process, I've landed on a threaded controller/producer design. Threads with python can be a hassle, but it's definitively easier than async operations.
The Spider implements a threaded XMLRPC server in addition to the controller/producer threads. Running with 15 threads I've managed to make 13.970 HTTP request pr. hour. The operation is quite simple. The XMLRPC server receives an URL which is added to a queue. The queue is read by an controller who pop's the item and inserts it into the producer queue. The controller then updates, via XMLRPC, the caller with status 0 if all threads are busy or 1 if there are threads available. The producer queue is read by the thread pool. Each thread downloads the URL and feeds it through a HTML parser. The parser dissolves frames and collects .js and .ccs links. The resulting data is then added to a mysql database.
As mentioned, 13.970 unique URL's were processed on a single CPU AMD XP1800+ running Fedora Core 1 with Python2.3.3. I think it's quite impressive. Of course, what delights me the most is that my Spider replaced a Java based Spider. The Python Spider, being amazingly faster, using a heck of a lot less CPU, not to mention memory, simply squished the competition.
Now, my only concern is scalability. Hardware is cheap, but dual processors is even cheaper. With the GIL, dual may prove difficult. I guess I could spawn a child Spider when all 15 threads are occupied. It may, or may not, use the second processor. In theory it should, but I've got no way of knowing until I can have a test run on a dual machine.
In other news - Arsenal meets ManUnited in the FA cup semi. Arsenal tops the Premier League 9 points ahead of Chelsea (thank you ManCity!!). Arsenal is also well on their way to meet Real Madrid in the Champions League semi. Now there's an interesting match...