Bill Bumgarner

2005-3-16

httpflow; tcpflow parser to help debug http

I ran across this article mentioning something called httpflow and was momentarily confused as I co-authored something called httpflow at one point.

It is still useful, so a quick rehash (to update the Google cache):

A really long time ago, Ben Holt and I wrote a little python script that parses the output of tcpflow and reconstructs HTTP conversations from the raw data. Furthermore, it can easily be configured to detect and display or exclude specific headers as well as optionally squelching the body of the HTTP request/repsonses.

It was written because we were faced with a really nasty bug where certain cookies were being arbitrarily dropped. It turned out to be a bug in Apache's mod_track.

The script is still available as a part of the sourceforge tcpflow project. I have also added it to my repository of hacques.

The original article:

Quite some time ago, Ben Holt and I wrote a useful little Python script that can parse the output of tcpflow (available in Fink as well as via just about every Linux package manager around) captured HTTP sessions and reconstitute them into actual requests and responses. We were motivated by the need to track down a handful of bugs related to Cookie handling in some WebObjects applications we were working on (turned out to be a bug in an Apache module) and the best way to do that is at the wire.

Well, of course, if you have ever captured every byte of a raw HTTP conversation, it is only slightly less painful than wading through the reams of raw data produced by simply capturing every bloody packet that goes across the pipe. Since most browsers are threaded, there was also the challenge of putting everything back into sequence.

So, we gradually added some options to the script that would allow one to filter for particular headers, turn on/off dumping of the bodies, and output to indicate which request caused a particular response to be generated.

In the end, this proved to be far easier to do than reconfiguring the machine to use some random logging/analysis proxy (which sucked at the time anyway) or otherwise disrupting our development environment.

It is a total hack and completely dependent on the format of the output of tcpflow. It has proven to be extremely useful in many situations over the years.

Recently, I wanted to capture the sequence of URLs passed from a web services client to a server.

So, this version of httpflow.py adds the '--terse' option which will cause httpflow to emit only the GET/POST/HEAD/*method* line from the request/response. Very useful for seeing the sequence of URLs fired against a [known] server.

Example captures of searching for "fried catfish" via Google (including loading www.google.com) follow.

Terse:

 % sudo tcpflow -c -p -i en0 | ./httpflow.py --terse
HTTPFlow Running (--help for usage/help)...
tcpflow[3160]: listening on en0
GET / HTTP/1.1
GET /images/logo.gif HTTP/1.1
GET /search?hl=en&ie=ISO-8859-1&q=fried+catfish&btnG=Google+Search HTTP/1.1
^Ctcpflow[3160]: terminating

Default:

 % sudo tcpflow -c -p -i en0 | ./httpflow.py
HTTPFlow Running (--help for usage/help)...
tcpflow[3162]: listening on en0
--- begin header ---
Source: 192.168.001.070 : 55937 (-unknown-)
Destination: 216.239.039.099 : 80 (-unknown-)

GET / HTTP/1.1
Host: www.google.com
Connection: keep-alive
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/73 (KHTML, like Gecko) Safari/73
If-Modified-Since: Wed, 30 Apr 2003 04:44:52 GMT
Accept: */*
Accept-Language: en-us, ja;q=0.33, en;q=0.67
Cookie: PREF=***secret stuff deleted***
---- end header ----

--- begin header ---
From request: GET / HTTP/1.1
Source: 216.239.039.099 : 80 (-unknown-)
Destination: 192.168.001.070 : 55937 (-unknown-)

HTTP/1.1 304 Not Modified
Date: Wed, 30 Apr 2003 04:45:12 GMT
Content-Type: text/html
Server: GWS/2.0
Content-Length: 0
---- end header ----

--- begin header ---
Source: 192.168.001.070 : 55937 (-unknown-)
Destination: 216.239.039.099 : 80 (-unknown-)

GET /search?hl=en&ie=ISO-8859-1&q=fried+catfish&btnG=Google+Search HTTP/1.1
Host: www.google.com
Connection: keep-alive
Referer: http://www.google.com/
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/73 (KHTML, like Gecko) Safari/73
If-Modified-Since: Wed, 30 Apr 2003 04:44:59 GMT
Accept: */*
Accept-Language: en-us, ja;q=0.33, en;q=0.67
Cookie: PREF=
---- end header ----

--- begin header ---
From request: GET /search?hl=en&ie=ISO-8859-1&q=fried+catfish&btnG=Google+Search HTTP/1.1
Source: 216.239.039.099 : 80 (-unknown-)
Destination: 192.168.001.070 : 55937 (-unknown-)

HTTP/1.1 200 OK
Date: Wed, 30 Apr 2003 04:45:20 GMT
Cache-control: private
Content-Type:
Transfer-Encoding: chunked
Server: GWS/2.0
---- end header ----

Not pretty, but works well enough....

Comment on this post [ so far] ... more like this: [Network, Python, Technology, tcpflow] ... topic exchange: [Network, Python, Technology, tcpflow]