Python Rocks! and other rants 6.12.2005
Weblog of Kent S Johnson

2005-12-06

Simple itertools.groupby() example

Suppose you have a (sorted) list of dicts containing the names of cities and states, and you want to print them out with headings by state:

>>> cities = [
...     { 'city' : 'Harford', 'state' : 'Connecticut' },
...     { 'city' : 'Boston', 'state' : 'Massachusetts' },
...     { 'city' : 'Worcester', 'state' : 'Massachusetts' },
...     { 'city' : 'Albany', 'state' : 'New York' },
...     { 'city' : 'New York City', 'state' : 'New York' },
...     { 'city' : 'Yonkers', 'state' : 'New York' },
... ]

First let me explain operator.itemgetter(). This function is a factory for new functions. It creates functions that access items using a key. In this case I will use it to create a function to access the 'state' item of each record:

>>> from operator import itemgetter
>>> getState = itemgetter('state')
>>> getState
<operator.itemgetter object at 0x00A31D90>
>>> getState(cities[0])
'Connecticut'
>>> [ getState(record) for record in cities ]
['Connecticut', 'Massachusetts', 'Massachusetts', 'New York', 'New York', 'New York']

So the value returned by itemgetter('state') is a function that accepts a dict as an argument and returns the 'state' item of the dict. Calling getState(d) is the same as writing d['state'].

What does this have to do with itertool.groupby()?

>>> from itertools import groupby
>>> help(groupby)
Help on class groupby in module itertools:

class groupby(__builtin__.object)
|  groupby(iterable[, keyfunc]) -> create an iterator which returns
|  (key, sub-iterator) grouped by each value of key(value).

groupby() takes an optional second argument which is a function to extract keys from the data. getState() is just the function we need.

>>> groups = groupby(cities, getState)
>>> groups
<itertools.groupby object at 0x00A88300>

Hmm. That's a bit opaque. groupby() returns an iterator. Each item in the iterator is a pair of (key, group). Let's take a look:

>>> for key, group in groups:
...   print key, group
...
Connecticut <itertools._grouper object at 0x0089D0F0>
Massachusetts <itertools._grouper object at 0x0089D0C0>
New York <itertools._grouper object at 0x0089D0F0>

Hmm. Still a bit opaque :-) The key part is clear - that's the state, extracted with getState - but group is another iterator. One way to look at it's contents is to use a nested loop. Note that I have to call groupby() again, the old iterator was consumed by the last loop:

>>> for key, group in groupby(cities, getState):
...   print key
...   for record in group:
...     print record
...
Connecticut
{'city': 'Harford', 'state': 'Connecticut'}
Massachusetts
{'city': 'Boston', 'state': 'Massachusetts'}
{'city': 'Worcester', 'state': 'Massachusetts'}
New York
{'city': 'Albany', 'state': 'New York'}
{'city': 'New York City', 'state': 'New York'}
{'city': 'Yonkers', 'state': 'New York'}

Well, that makes more sense! And it's not too far from the original requirement, we just need to pretty up the output a bit. How about this:

>>> for key, group in groupby(cities, getState):
...   print 'State:', key
...   for record in group:
...     print '   ', record['city']
...
State: Connecticut
     Harford
State: Massachusetts
     Boston
     Worcester
State: New York
     Albany
     New York City
     Yonkers

Other than misspelling Hartford (sheesh, and I grew up in Connecticut!) that's not too bad!

posted at 22:37:36    #    comment []    trackback []
December 2005
MoTuWeThFrSaSu
    1 2 3 4
5 6 7 8 91011
12131415161718
19202122232425
262728293031 
Aug
2005
 Jan
2006

Comments about life, the universe and Python, from the imagination of Kent S Johnson.

XML-Image Letterimage

BlogRoll

© 2005, Kent Johnson