Bulk access to the HAR files?

59 views
Skip to first unread message

Brian Pane

unread,
Jul 1, 2011, 1:38:29 PM7/1/11
to HTTP Archive
Are the HAR files for the collected test results available as a
single, large download anywhere? I know I can get detailed request
timings for a single page by fetching a mysqldump download from
httparchive.org and using the wptid and wptrun values in the Pages
table to formulate an HTTP request to httparchive.webpagetest.org's
export.php script; but if there's a bulk HAR download I should be
using instead, I'll gladly switch to that.

Thanks,
-Brian

Annie Sullivan

unread,
Jul 13, 2011, 12:22:59 PM7/13/11
to httpa...@googlegroups.com

I would love to be able to bulk-download the bar files as well.

Steve Souders

unread,
Jul 13, 2011, 1:53:10 PM7/13/11
to HTTP Archive
How would you package those up? Downloading all HAR files for all runs
would be too huge. Even the HAR files for one run might be too big.

What would you do with them?

-Steve

Brian Pane

unread,
Jul 13, 2011, 2:37:31 PM7/13/11
to HTTP Archive
On Jul 13, 10:53 am, Steve Souders <stevesouders...@gmail.com> wrote:
> How would you package those up? Downloading all HAR files for all runs
> would be too huge. Even the HAR files for one run might be too big.

I'd be happy with one enormous tar file. HAR files are bulky but
compress quite well. At, say, 10KB compressed for an average site's
HAR file, the download size for the entire archive would be 150MB.
That's on the same order of magnitude as the currently available MySQL
dump files.

> What would you do with them?

I'm attempting to model the effect that HTTP pipelining (or a large
increase in the client's number of concurrent requests) would have on
nontrivial websites. My methodology is basically:

1. Given a waterfall, apply some heuristics to identify requests that
could be issued earlier if more connections were available or if the
client were willing to pipeline its requests.
2. Adjust the waterfall accordingly.
3. Determine whether the adjusted waterfall has a shorter total
duration than the original.
4. Repeat steps 1-3 for a huge number of websites.

Thanks,
-Brian

Steve Souders

unread,
Jul 13, 2011, 3:11:03 PM7/13/11
to HTTP Archive
It's ~150MB per run. The entire archive as of now would be 3GB. Even
if you only wanted one run, we're about to go from 17K URLs to 1M, so
the size for a single run would be 9GB.

Perhaps there's a way to package them so the downloads aren't so big.

-Steve

Brian Pane

unread,
Jul 20, 2011, 12:33:49 AM7/20/11
to HTTP Archive
For now, I'm getting the HARs for my research via lots of HTTP
requests to the export.php script. For 15K pages, that's proved to be
a reasonable workaround. By the way, the HAR files have yielded some
very useful data already: from the July 1st run, 66% of the sites have
sequences of requests that could benefit from pipelining.

-Brian

Steve Souders

unread,
Jul 20, 2011, 6:18:51 PM7/20/11
to HTTP Archive
Hi, Brian.

I saw your blog post and passed it to the SPDY folks. Very cool.
http://www.brianp.net/2011/07/19/will-http-pipelining-help-a-study-based-on-the-httparchive-org-data-set/

I'm open to suggestions on how to make the HAR files available in an
efficient way so it's less painful for you and for our servers.

-Steve

gup...@gmail.com

unread,
Jun 21, 2012, 1:14:02 PM6/21/12
to httpa...@googlegroups.com
Are the HAR files available for bulk download now?  If yes, will you please post the link.

Thanks,
Vikas

Charlie Clark

unread,
Jun 22, 2012, 5:00:36 PM6/22/12
to httpa...@googlegroups.com
Am 21.06.2012, 19:14 Uhr, schrieb <gup...@gmail.com>:

> Are the HAR files available for bulk download now? If yes, will you
> please
> post the link.

Given that each HAR is about 100 kB (100 kB compressed) and there are
currently over 200,000 sites in the index, that's around 2 GB per run. I'm
not sure there's a lot to be gained from pumping the whole archive over
the network each time instead of working out which metrics you want to
apply on a batch and having that done centrally. It's easy enough to set
up a batch job for a subset of URLs if you want to process them regularly.
Personally, I think batching them by URL would make more sense for more
efficient comparisons over time. How would you like them grouped and what
analyses would you be performing?

Charlie
--
Charlie Clark
Managing Director
Clark Consulting & Research
German Office
Kronenstr. 27a
Düsseldorf
D- 40217
Tel: +49-211-600-3657
Mobile: +49-178-782-6226
Reply all
Reply to author
Forward
0 new messages