Introducing pywb and a demo

140 views
Skip to first unread message

Ilya Kreymer

unread,
Mar 7, 2014, 4:28:06 PM3/7/14
to openway...@googlegroups.com
Hi,

In the interest of openness and transparency and moving web archiving forward, I thought I should share some of the latest work I have been doing with this group.
Some of you may have heard of this already, so I thought I should officially announce it myself.

I have been working on a new Python based implementation of the wayback machine, called pywb.

It can be found here with a lot more info: https://github.com/ikreymer/pywb

The project was created with basic best practices (continuous integration, test suite, etc..) in mind, and hopefully will continue to evolve in this direction.

While I have been leading the development of the project, there have been a few outside contributions, both from inside and outside IA.

Disclaimer: This is not an official IA project at this time, and no support or adoption from IA should be expected or guaranteed.

The project was started to address multiple limitations of the existing wayback architecture, specifically lack of domain specific rules, which are necessary
to display difficult and dynamic content.

Much of the focus of the project has been on the core replay component, which is the most challenging.

A Demo

Based on the recent survey results (thanks to everyone who contributed!) social media and dynamic content continue to be important and most challenging for web archival playback.
I've set up an independent demo of some work with Facebook, Twitter and experimental work in Flickr in some of these areas.

I've hosted a version of pywb on the app hosting site Heroku to demonstrate it's self-containedness.


This features a couple of examples (captured using warcprox)

* Facebook page with dynamic scrolldown

* Twitter page with dynamic scrolldown

* Experimental flickr work, partial replay of dynamic content. Flickr replay continues to be a challenge but often works much better in proxy mode.


This exact deployment of pywb running on Heroku is available in this repo:


(I've created a seperate repo from the main pywb repo to avoid including large warc files to be downloaded)

Unfortunately, Heroku does not support proxy mode, but you may try it locally.
I would encourage anyone interested to try these samples locally as well.

Ed Summers suggested that we have a centralized place for test content and I think that's a great idea.
I'd happy to move these sample warcs to a more centralized place (perhaps not on github?) if there is overall interest.

I would like to invite others interested in web archiving replay to take a look at this project, and I appreciate any feedback!

Thanks,
Ilya

Reply all
Reply to author
Forward
0 new messages