from __future__ import * 24.8.2004

2004-08-24

Forget Spreadsheet::ParseExcel!

I've been working on some automation scripts to take data out of excel and do useful things with it, and I hit a big stumbling block with Spreadsheet::ParseExcel. Unicode SUCKS in Perl, and Spreadsheet::ParseExcel does nothing at all to help you with that. Each cell gets its own encoding ('ucs2', '_native_' which I haven't seen, or it's simply undef.. which seems to be latin-1). Anyway, it's completely bogus, so I started shopping around for another implementation.

Andy Khan's JExcelApi does the trick and is light-years more correct and faster than the alternatives I have tried (other than the time it takes a JVM to start). Not only that, but by default the jar does exactly what I want it to do. It gets the unicode right, and everything worked perfectly the first time. My dealings with Excel files have been reduced to the following:

java -jar -Djxl.encoding=latin1 jxl.jar -xml EXCELFILE.xls

And the Python code to parse the workbook xml document from jxl looks roughly like this:

from xml.dom import minidom

def parseDocRows(doc):
    for row in doc.getElementsByTagName(u'row'):
        rowdata = [
            u''.join([x.nodeValue for x in col.childNodes])
            for col in row.getElementsByTagName(u'col')]
        if rowdata:
            yield rowdata

if __name__ == '__main__':
    import sys
    for row in parseDocRows(minidom.parse(file(sys.argv[1]))):
        print row

Thanks!

posted at 20:32:00    #    comment []    trackback []
August
MoTuWeThFrSaSu
       1
2 3 4 5 6 7 8
9101112131415
16171819202122
23242526272829
3031     
Jul Sep

Bob's Rants

XML-Image Letterimage

© 2004, Bob Ippolito