evangelizing pcap2har

Steve Souders

unread,

Oct 26, 2010, 3:53:21 AM10/26/10

to pcap2har

pcap2har is an awesome piece of work that helps bridge the gap between
tcpdump and WireShark and the HAR-compatible tools that many
developers are accustomed to such as Firebug, Fiddler, and HttpWatch.

I'd like to start evangelizing pcap2har. When do you think it's the
right time to drive users to this project?

-Steve

Jake Holland

unread,

Oct 26, 2010, 2:08:57 PM10/26/10

to pcap...@googlegroups.com

I hadn't gotten around to reviewing it since Andrew finished his
internship, but since you were interested, I took a look at it this
morning and tried it in a har viewer
(http://www.softwareishard.com/har/viewer/) with one of the random pcaps
I had.

There were a few loose ends that made it come out looking crappy (they
were: page_refs and pages, sorting by time, and some verbose debug
messages), so I patched those. It might still be raw in some places, but
at least for some pcaps, I think it's more or less ok on the har viewer.
(I only looked at the master branch.)

I pushed those into http://github.com/GrumpyOldTroll/pcap2har and sent a
pull request to Andrew. I'm sure it could still benefit from some
cleanup and a test suite, but as far as I saw in my brief look, that
build or similar has a chance at not turning off new users.

Ryan Witt

unread,

Oct 26, 2010, 2:11:29 PM10/26/10

to pcap...@googlegroups.com

Thanks Jake! Andrew, can you pull?

At some point we need to write a little tutorial and write a setup.py file and get this into the python package index.

--Ryan

Steve Souders

unread,

Oct 26, 2010, 2:13:56 PM10/26/10

to pcap...@googlegroups.com, Ryan Witt

I found that Honza's HAR Viewer is less strict about syntax than Fiddler
and HttpWatch. Can you confirm the changes work in these other viewers?
Feel free to attach a HAR file if you want me to test in HttpWatch.

-Steve

Libo Song

unread,

Oct 26, 2010, 2:21:42 PM10/26/10

to pcap...@googlegroups.com, Ryan Witt

I also generated a patch on the master branch, with some cleanup code.
I have not tested on either Fiddler or HttpWatch. But I looked at the har
file, and verified than "page_ref" is not "pageref", and the status code is
200 instead of "200".

-libo

0001-Fix-har-format-issues.patch

Jake Holland

unread,

Oct 26, 2010, 4:15:47 PM10/26/10

to pcap...@googlegroups.com, Ryan Witt

Hi Libo,

Thanks for fixing the format problems, I hadn't noticed those, and the
viewer I was using didn't complain. I agree those changes for format
are correct according to the har spec (c.f.
http://groups.google.com/group/http-archive-specification/web/har-1-2-sp
ec?hl=en).

I also agree with the changes to include the connecting time in the
first request, and that the logging changes are beneficial, so thanks
for those as well.

However, while reviewing your patch I had a few comments and questions
about some changes I didn't fully agree with. Here are the questions
(they refer to specific patches which I've included below all the
questions):

1. I wasn't sure I understood patch0. Can you send a pcap where this is
meaningful?

I think it's saying if there's a repeated header, then instead of making
a list of values it concatenates the values--is that correct?

If so, I'm not sure this represents correct behavior. I'm not surprised
if it causes problems the original way, but I think I just read in the
http spec that a comma should be inserted if you want to merge repeated
http headers without changing the semantics:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
(The last paragraph in the section, specifically.)
Based on this, I've made it "d[k] += ','+v" instead of "d[k] += v" and
took out the commented part.

2. In patch1, I think you use req and then afterwards check that it's
usable. If the req was bad, I think you get an exception in the first
part. Based on this reasoning, I've put the "if not req: continue"
clause first.

3. In patch2, I think the name change was not complete, so I changed "as
e" to "as error" and the .append to use error instead of e.

4. In patch3, on "self.entries.sort(lambda x,y: x.ts_start -
y.ts_start)" I got a "TypeError: comparison function must return int,
not long", so I changed it to "int(x.ts_start - y.ts_start)".

With the changes I've described, I've applied your patch (fixing
conflicts where I had previously changed the logging that you removed)
and tried it on the same pcap and viewer as before, and unless you have
objections or can suggest a better integration path, I'll push them to
my branch with attribution ("applied patches from Libo Song mostly
fixing format errors") and request another pull from Andrew. Thanks.

(patch fragments below)

<patch0>
diff --git a/dpkt_http_replacement.py b/dpkt_http_replacement.py
index 4556979..bc0d69b 100644
--- a/dpkt_http_replacement.py
+++ b/dpkt_http_replacement.py
@@ -24,9 +24,10 @@ def parse_headers(f):
k = l[0][:-1].lower()
v = len(l) != 1 and l[1] or ''
if k in d:
- if not type(d[k]) is list:
- d[k] = [d[k]]
- d[k].append(v)
+# if not type(d[k]) is list:
+# d[k] = [d[k]]
+# d[k].append(v)
+ d[k] += v
else:
d[k] = v
return d
</patch0>

<patch1>
diff --git a/http/flow.py b/http/flow.py
index c7f0806..6e2e17d 100644
--- a/http/flow.py
+++ b/http/flow.py
@@ -38,12 +39,21 @@ class Flow:
pairable_responses.extend( [None for i in
range(len(requests) - len(pairable_responses))] )
# if there are more responses, we would just ignore them
anyway, which zip does for us
# create MessagePair's
+ connected = False
for req, resp in zip(requests, responses):
+ if not connected and tcpflow.handshake:
+ req.ts_connect = tcpflow.handshake[0].ts
+ connected = True
+ else:
+ req.ts_connect = req.ts_start
+ if not req:
+ logging.warning("Request is missing.")
+ continue
self.pairs.append(MessagePair(req, resp))
except LookupError:
# there were no responses after the first request
# there's nothing we can do
- pass
+ logging.warning("Request has no reponse.")

class MessagePair:
'''
</patch1>

<patch2>
diff --git a/pcap.py b/pcap.py
index fc2065c..d601318 100644
--- a/pcap.py
+++ b/pcap.py
@@ -48,17 +48,14 @@ class TCPFlowAccumulator:
<skipping>...</skipping>
except dpkt.dpkt.NeedData as e:
+ log.warning(error)
log.warning('A packet in the pcap file was too short, '
'debug_pkt_count=%d' % debug_pkt_count)
self.errors.append((None, e))
</patch2>

<patch3>
diff --git a/httpsession.py b/httpsession.py
index 9dd366b..3f109cb 100644
--- a/httpsession.py
+++ b/httpsession.py
@@ -133,8 +134,8 @@ class HTTPSession(object):
# parse basic data in the pair, add it to the list
self.entries.append(Entry(msg.request, msg.response))

- # LSONG sort the entries on start
- self.entries.sort(lambda x,y: x.ts_start < y.ts_start) # LSONG
+ # Sort the entries on start
+ self.entries.sort(lambda x,y: x.ts_start - y.ts_start)
self.user_agent = self.user_agents.dominant_user_agent()
def json_repr(self):
'''
</patch3>

Jake Holland

unread,

Oct 26, 2010, 4:18:21 PM10/26/10

to pcap...@googlegroups.com, Ryan Witt

The pcap I was using had some private data (friends' Facebook info and
ids), so I don't think I should publish it. But if you send a pcap, I'll
send a har.

-----Original Message-----
From: pcap...@googlegroups.com [mailto:pcap...@googlegroups.com] On
Behalf Of Steve Souders

Sent: Tuesday, October 26, 2010 11:14 AM
To: pcap...@googlegroups.com
Cc: Ryan Witt
Subject: Re: evangelizing pcap2har

Libo Song

unread,

Oct 26, 2010, 4:37:24 PM10/26/10

to pcap...@googlegroups.com, Ryan Witt

There are some repeated headers, especially the "Set-Cookie" header.
You fix is correct to insert "," between values, my fix was a quick hack
to make the harviewer happy.

>
> 2. In patch1, I think you use req and then afterwards check that it's
> usable. If the req was bad, I think you get an exception in the first
> part. Based on this reasoning, I've put the "if not req: continue"
> clause first.
>

Here I am trying to get the connection timing from the handshake packets.
Yes, it the req was bad, we should ignore it.

>
> 3. In patch2, I think the name change was not complete, so I changed "as
> e" to "as error" and the .append to use error instead of e.

Thanks for catching this.

>
>
> 4. In patch3, on "self.entries.sort(lambda x,y: x.ts_start -
> y.ts_start)" I got a "TypeError: comparison function must return int,
> not long", so I changed it to "int(x.ts_start - y.ts_start)".

I had seen this problem before, but I thought is was solved by convert
ts_connection to int value ts_start in millisecond. And I do not get that
error message any more.

Andrew

unread,

Oct 27, 2010, 11:54:07 PM10/27/10

to pcap2har

I'll try to pull them in tomorrow.

Jake Holland

unread,

Oct 28, 2010, 12:29:40 PM10/28/10

to pcap...@googlegroups.com

Honza fixed his viewer to accept pageref instead of page_ref, so I
pushed up the pageref change also. (And thanks, Honza!)

Plus, his verifier now gives some output, and it complained about an
empty page title. I'm getting the page title from the "referer" header,
but when it wasn't present I had it blank, so I wrote that as "top"
instead.

There's one remaining problem reported Honza's verifier: an empty
mimeType (or a missing, non-optional mimeType).

This happens whenever the response has no content-type header, which
seems to happen when the response has 0 bytes of data (which is typical
on a 301, and sometimes even on a 200, for instance on the first
response on the stream with "tcp.port == 2214").

I don't have a good answer for that offhand. I think the best answer
would have been if the spec had listed mimeType as optional in the first
place.

Failing that, I'm torn between leaving it alone with validation errors
and faking a default. I considered "text/plain" but didn't put it in. I
guess a command-line switch would maybe be ok, but I'm not a fan of
handling corner cases that way. Any other suggestions or votes?

Fiddler still imports it fine and has a red caution-looking symbol that
I don't know what it means on every line. It looks the same as before,
near as I can tell.

I attached the output of the latest version of my fork (from the pcap I
sent out earlier). Further discussion and broader testing is invited.
Thanks.

sample_good.har

Jake Holland

unread,

Oct 28, 2010, 2:20:34 PM10/28/10

to pcap...@googlegroups.com

Update: it appears that what firebug with the har exporter actually will
use "application/x-unknown-content-type" as the har mimeType when
"content-type" is not present as an http header.

At least, that's what I saw on this page for 4TFKVFYDRS3V.jpg:

http://blogs.msdn.com/b/fiddler/archive/2010/06/30/import-and-export-htt
p-archives-from-fiddler.aspx

I still don't like the idea of injecting fake data, but I guess that's a
precedent, at least.

-----Original Message-----
From: pcap...@googlegroups.com [mailto:pcap...@googlegroups.com] On

Behalf Of Jake Holland
Sent: Thursday, October 28, 2010 9:30 AM
To: pcap...@googlegroups.com
Subject: RE: evangelizing pcap2har

...

Failing that, I'm torn between leaving it alone with validation errors
and faking a default. I considered "text/plain" but didn't put it in. I
guess a command-line switch would maybe be ok, but I'm not a fan of
handling corner cases that way. Any other suggestions or votes?

...

Andrew

unread,

Oct 28, 2010, 5:26:49 PM10/28/10

to pcap2har

Ok, I think I pulled/pushed all that.

Re page title: It makes more sense to me to use the URL as the default
page title than "top".

Re media type: Should I just make MediaType default to "application/x-
unknown-content-type" until we come up with something more sensible?
It would be easy to fix later, and in the meantime people looking at
it will know to take it with a grain of salt. Maybe later, try to
guess based on the resource name.

I'm also writing docstrings for PageTracker.

Jake Holland

unread,

Oct 28, 2010, 5:58:15 PM10/28/10

to pcap...@googlegroups.com

Excellent. Those all sound like good ideas to me. Thanks.

-----Original Message-----
From: pcap...@googlegroups.com [mailto:pcap...@googlegroups.com] On

Steve Souders

unread,

Oct 28, 2010, 6:00:47 PM10/28/10

to pcap...@googlegroups.com, Jake Holland

When I loaded sample_good.har into HttpWatch it generated this error:
The value 'onContentLoaded' was not found in 'pages[2010-10-26T
14:46:41.836524Z]/pageTimings'

-Steve

Andrew Fleenor

unread,

Oct 28, 2010, 6:15:25 PM10/28/10

to pcap...@googlegroups.com

That's probably because pageTimings is empty. I'll put in a quick hack for that while I'm doing the other changes, make all the values be -1. Though if that's the case, the value is called onContentLoad, not onContentLoaded, and it's supposed to be optional.

Andrew Fleenor

unread,

Oct 28, 2010, 6:23:55 PM10/28/10

to pcap...@googlegroups.com

Pushed to messy. While I've only made a couple minor tweaks, I'm a little queasy about putting them in master until someone tests them.

Jake Holland

unread,

Oct 28, 2010, 6:32:04 PM10/28/10

to pcap...@googlegroups.com

I still haven’t gotten httpwatch, but my guess is that -1 is worse than taking the latest end time of all the responses refered by the same page or something. -1 will probably be an illegal value.

Of course, we don’t actually have this information from a pcap, but “after everything in the page” is a bad but reasonable guess, where “1ms before the page starts” is not, if we need this as a workaround.

Libo Song

unread,

Oct 28, 2010, 6:38:45 PM10/28/10

to pcap...@googlegroups.com

From the HAR spec, -1 means that the timing does not apply to this field.

Jake Holland

unread,

Oct 28, 2010, 6:41:52 PM10/28/10

to pcap...@googlegroups.com

Oh yeah, of course. Thanks, please disregard my last post, -1 sounds right to me :)

Reply all

Reply to author

Forward