There were a few loose ends that made it come out looking crappy (they
were: page_refs and pages, sorting by time, and some verbose debug
messages), so I patched those. It might still be raw in some places, but
at least for some pcaps, I think it's more or less ok on the har viewer.
(I only looked at the master branch.)
I pushed those into http://github.com/GrumpyOldTroll/pcap2har and sent a
pull request to Andrew. I'm sure it could still benefit from some
cleanup and a test suite, but as far as I saw in my brief look, that
build or similar has a chance at not turning off new users.
At some point we need to write a little tutorial and write a setup.py file and get this into the python package index.
--Ryan
-Steve
-libo
Thanks for fixing the format problems, I hadn't noticed those, and the
viewer I was using didn't complain. I agree those changes for format
are correct according to the har spec (c.f.
http://groups.google.com/group/http-archive-specification/web/har-1-2-sp
ec?hl=en).
I also agree with the changes to include the connecting time in the
first request, and that the logging changes are beneficial, so thanks
for those as well.
However, while reviewing your patch I had a few comments and questions
about some changes I didn't fully agree with. Here are the questions
(they refer to specific patches which I've included below all the
questions):
1. I wasn't sure I understood patch0. Can you send a pcap where this is
meaningful?
I think it's saying if there's a repeated header, then instead of making
a list of values it concatenates the values--is that correct?
If so, I'm not sure this represents correct behavior. I'm not surprised
if it causes problems the original way, but I think I just read in the
http spec that a comma should be inserted if you want to merge repeated
http headers without changing the semantics:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
(The last paragraph in the section, specifically.)
Based on this, I've made it "d[k] += ','+v" instead of "d[k] += v" and
took out the commented part.
2. In patch1, I think you use req and then afterwards check that it's
usable. If the req was bad, I think you get an exception in the first
part. Based on this reasoning, I've put the "if not req: continue"
clause first.
3. In patch2, I think the name change was not complete, so I changed "as
e" to "as error" and the .append to use error instead of e.
4. In patch3, on "self.entries.sort(lambda x,y: x.ts_start -
y.ts_start)" I got a "TypeError: comparison function must return int,
not long", so I changed it to "int(x.ts_start - y.ts_start)".
With the changes I've described, I've applied your patch (fixing
conflicts where I had previously changed the logging that you removed)
and tried it on the same pcap and viewer as before, and unless you have
objections or can suggest a better integration path, I'll push them to
my branch with attribution ("applied patches from Libo Song mostly
fixing format errors") and request another pull from Andrew. Thanks.
(patch fragments below)
<patch0>
diff --git a/dpkt_http_replacement.py b/dpkt_http_replacement.py
index 4556979..bc0d69b 100644
--- a/dpkt_http_replacement.py
+++ b/dpkt_http_replacement.py
@@ -24,9 +24,10 @@ def parse_headers(f):
k = l[0][:-1].lower()
v = len(l) != 1 and l[1] or ''
if k in d:
- if not type(d[k]) is list:
- d[k] = [d[k]]
- d[k].append(v)
+# if not type(d[k]) is list:
+# d[k] = [d[k]]
+# d[k].append(v)
+ d[k] += v
else:
d[k] = v
return d
</patch0>
<patch1>
diff --git a/http/flow.py b/http/flow.py
index c7f0806..6e2e17d 100644
--- a/http/flow.py
+++ b/http/flow.py
@@ -38,12 +39,21 @@ class Flow:
pairable_responses.extend( [None for i in
range(len(requests) - len(pairable_responses))] )
# if there are more responses, we would just ignore them
anyway, which zip does for us
# create MessagePair's
+ connected = False
for req, resp in zip(requests, responses):
+ if not connected and tcpflow.handshake:
+ req.ts_connect = tcpflow.handshake[0].ts
+ connected = True
+ else:
+ req.ts_connect = req.ts_start
+ if not req:
+ logging.warning("Request is missing.")
+ continue
self.pairs.append(MessagePair(req, resp))
except LookupError:
# there were no responses after the first request
# there's nothing we can do
- pass
+ logging.warning("Request has no reponse.")
class MessagePair:
'''
</patch1>
<patch2>
diff --git a/pcap.py b/pcap.py
index fc2065c..d601318 100644
--- a/pcap.py
+++ b/pcap.py
@@ -48,17 +48,14 @@ class TCPFlowAccumulator:
<skipping>...</skipping>
except dpkt.dpkt.NeedData as e:
+ log.warning(error)
log.warning('A packet in the pcap file was too short, '
'debug_pkt_count=%d' % debug_pkt_count)
self.errors.append((None, e))
</patch2>
<patch3>
diff --git a/httpsession.py b/httpsession.py
index 9dd366b..3f109cb 100644
--- a/httpsession.py
+++ b/httpsession.py
@@ -133,8 +134,8 @@ class HTTPSession(object):
# parse basic data in the pair, add it to the list
self.entries.append(Entry(msg.request, msg.response))
- # LSONG sort the entries on start
- self.entries.sort(lambda x,y: x.ts_start < y.ts_start) # LSONG
+ # Sort the entries on start
+ self.entries.sort(lambda x,y: x.ts_start - y.ts_start)
self.user_agent = self.user_agents.dominant_user_agent()
def json_repr(self):
'''
</patch3>
-----Original Message-----
From: pcap...@googlegroups.com [mailto:pcap...@googlegroups.com] On
Behalf Of Steve Souders
Sent: Tuesday, October 26, 2010 11:14 AM
To: pcap...@googlegroups.com
Cc: Ryan Witt
Subject: Re: evangelizing pcap2har
There are some repeated headers, especially the "Set-Cookie" header.
You fix is correct to insert "," between values, my fix was a quick hack
to make the harviewer happy.
>
> 2. In patch1, I think you use req and then afterwards check that it's
> usable. If the req was bad, I think you get an exception in the first
> part. Based on this reasoning, I've put the "if not req: continue"
> clause first.
>
Here I am trying to get the connection timing from the handshake packets.
Yes, it the req was bad, we should ignore it.
>
> 3. In patch2, I think the name change was not complete, so I changed "as
> e" to "as error" and the .append to use error instead of e.
Thanks for catching this.
>
>
> 4. In patch3, on "self.entries.sort(lambda x,y: x.ts_start -
> y.ts_start)" I got a "TypeError: comparison function must return int,
> not long", so I changed it to "int(x.ts_start - y.ts_start)".
I had seen this problem before, but I thought is was solved by convert
ts_connection to int value ts_start in millisecond. And I do not get that
error message any more.
Plus, his verifier now gives some output, and it complained about an
empty page title. I'm getting the page title from the "referer" header,
but when it wasn't present I had it blank, so I wrote that as "top"
instead.
There's one remaining problem reported Honza's verifier: an empty
mimeType (or a missing, non-optional mimeType).
This happens whenever the response has no content-type header, which
seems to happen when the response has 0 bytes of data (which is typical
on a 301, and sometimes even on a 200, for instance on the first
response on the stream with "tcp.port == 2214").
I don't have a good answer for that offhand. I think the best answer
would have been if the spec had listed mimeType as optional in the first
place.
Failing that, I'm torn between leaving it alone with validation errors
and faking a default. I considered "text/plain" but didn't put it in. I
guess a command-line switch would maybe be ok, but I'm not a fan of
handling corner cases that way. Any other suggestions or votes?
Fiddler still imports it fine and has a red caution-looking symbol that
I don't know what it means on every line. It looks the same as before,
near as I can tell.
I attached the output of the latest version of my fork (from the pcap I
sent out earlier). Further discussion and broader testing is invited.
Thanks.
At least, that's what I saw on this page for 4TFKVFYDRS3V.jpg:
http://blogs.msdn.com/b/fiddler/archive/2010/06/30/import-and-export-htt
p-archives-from-fiddler.aspx
I still don't like the idea of injecting fake data, but I guess that's a
precedent, at least.
-----Original Message-----
From: pcap...@googlegroups.com [mailto:pcap...@googlegroups.com] On
Behalf Of Jake Holland
Sent: Thursday, October 28, 2010 9:30 AM
To: pcap...@googlegroups.com
Subject: RE: evangelizing pcap2har
...
Failing that, I'm torn between leaving it alone with validation errors
and faking a default. I considered "text/plain" but didn't put it in. I
guess a command-line switch would maybe be ok, but I'm not a fan of
handling corner cases that way. Any other suggestions or votes?
...
-----Original Message-----
From: pcap...@googlegroups.com [mailto:pcap...@googlegroups.com] On
-Steve
I still haven’t gotten httpwatch, but my guess is that -1 is worse than taking the latest end time of all the responses refered by the same page or something. -1 will probably be an illegal value.
Of course, we don’t actually have this information from a pcap, but “after everything in the page” is a bad but reasonable guess, where “1ms before the page starts” is not, if we need this as a workaround.