Speno's Pythonic Avocado 3.6.2004

2004-06-03

Baby Steps

On a recent Saturday morning, I was called upon to extract data out of my family's web site so that a new site could be constructed using fancy web standards of which I know nothing. The old html files were in good order so it was easy to get the data we needed from them.

The pages make up an online photo album with captions. As the captions and images were in the same order, I was able to write this data scraping class fairly quickly to get all the unique bits out of the pages:

import sgmllib from cgi import escape class MSParser(sgmllib.SGMLParser): def __init__(self): self.dates = {} self.datelist = [] self.curr_date = '' self.in_caption = False self.in_image = False self.in_datebox = False self.data = [] sgmllib.SGMLParser.__init__(self) def unknown_starttag(self, tag, attrs): """We want anything in captions, including other tags.""" if self.in_caption: s = '<%s ' % tag for a, v in attrs: s = '%s %s=%r' % (s, a, v) self.data.append(s) self.data.append('>') def start_p(self, attrs): """Ignore any p tags in captions""" if self.in_caption: pass def end_p(self): """Ignore any p tags in captions""" if self.in_caption: pass def start_td(self, attrs): for attr, value in attrs: if value == 'DateBox': self.in_datebox = True elif value == 'CaptionBox': self.in_caption = True def start_img(self, attrs): for attr, value in attrs: if attr == 'src': img = value elif attr == 'width': width = value elif attr == 'height': height = value img_data = (img, width, height) self.dates[self.curr_date]['images'].append(img_data) def end_td(self): if self.in_caption: caption = ''.join(self.data) self.dates[self.curr_date]['captions'].append(caption) self.data = [] self.in_caption = False def unknown_endtag(self, tag): if self.in_caption: self.data.append('</%s>' % tag) def handle_data(self, text): if self.in_caption: self.data.append(escape(text, quote=True)) if self.in_datebox: date = date.strip() if date: self.curr_date = date self.datelist.append(date) self.dates[date] = {} self.dates[date]['captions'] = [] self.dates[date]['images'] = [] self.in_datebox = False def error(self, message): pass

After that, much more work went into automatically generating the new set of pages based on the dates, captions and image links I had obtained using that MSParser class. I think the resulting pages are fantastic looking. I may be a tad biased, however.

Now we have to think about a new way of keeping that site up to date. It doesn't make any sense to write new pages by hand when you can add new content to a database and have a program generate the resulting pages automatically. Baby steps. Literally. Smiley

This post references topics: python
posted at 23:48:16    #    comment []    trackback []
 

The snake of good Omen

I've had good luck with all the (two) contract Python programmmers that we've hired in the past year. In both cases, they were the only people with any Python experience available immediately, and in both cases we hired them after one short interview. This says good things about both of them, but I also think it may say something good about Python.<wink>

And how is it for me going from a solo programmer to a team leader? Wonderful and difficult. It's wonderful to share ideas and solve problems with another programmer. Our products are better as a result. Also, we're way more productive as a team then when I'm working by myself by several factors.

It's difficult because I can sometimes fall behind dealing with all of the other issues I'm responsible for, while the other programmer only has to worry about one project. I'm trying to work well, and not fast, but I often have days where I feel lucky if I make any progress at all on Project Albatross.

posted at 19:23:44    #    comment []    trackback []
June 2004
MoTuWeThFrSaSu
  1 2 3 4 5 6
7 8 910111213
14151617181920
21222324252627
282930    
May
2004
 Aug
2004

One python programmer's search for understanding and avocados. This isn't personal, only pythonic.

XML-Image Letterimage

© 2004-2005, John P. Speno