Retrieve number of sn

28 views
Skip to first unread message

Anne Helmond

unread,
Feb 7, 2019, 7:02:28 AM2/7/19
to Memento Development
Dear Memento,

First of all, thanks for the great service and tools. I am a researcher at the University of Amsterdam who uses web archives a lot and Memento is really valuable in doing so.

I am currently trying to get the following information out of Memento:
For a URL, how many mementos are available in each web archive, and between which dates:
Andy J already helpfully pointed me to TimeMap for access to version history, which then returns a list of TimeMap URIs that may be exposed by archives.

So if If I want to know which web archives have archived "developers.facebook.com", how many Mementos they have available, and what the dates of the first and last Memento is:
  1. http://timetravel.mementoweb.org/timemap/json/http://developers.facebook.com
  2. this info then contains for example for each archive "uri":"https://arquivo.pt/wayback/timemap/*/http://developers.facebook.com"
  3. then this info contains a list of URLs (Mementos) for Arquivo that I can count, and the dates for their first and last Memento.
However, it feels like I am overlooking something and this could be done much easier. I hope you understand my question :)

Thanks! Anne

Herbert Van de Sompel

unread,
Feb 7, 2019, 7:28:32 AM2/7/19
to memen...@googlegroups.com, Herbert Van de Sompel
The question is easier for whom? For you or for the Memento infrastructure? ;-)

The approach you describe is the ""Do It Yourself" TimeMap for access to version history" described at http://timetravel.mementoweb.org/guide/api/#timemap-diy . That one is easy on the Memento infrastructure and harder on you. In essence, the added value that this approach provides over you just polling _all_ web archives is that it lets you know which ones you should actually bother (the ones that have mementos) to poll and which not (the ones that don't have mementos). The infrastructure makes that distinction based on a ML approach.

The other approach is the ""We Do It" TimeMap for access to version history" described at http://timetravel.mementoweb.org/guide/api/#timemap-wdi . This one is easy for you and harder on the Memento infrastructure. It involves the infrastructure effectively polling all web archives on your behalf. Response times can be slow if the required information is not cached. Note that, in case there are many mementos, an Index TimeMap rather than an actual TimeMap will be provided. Info is at the URL listed earlier in this paragraph.

I hope this helps.

Cheers

Herbert Van de Sompel


 
Thanks! Anne

--

---
You received this message because you are subscribed to the Google Groups "Memento Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
==================
Herbert Van de Sompel
Chief Innovation Officer 
DANS
+31 6 22 83 93 15

Sawood Alam

unread,
Feb 7, 2019, 10:04:37 AM2/7/19
to memento-dev
Hey Anne,

I just tried doing it using MemGator, a Memento aggregator one can run locally (https://github.com/oduwsdl/MemGator).

$ memgator -f cdxj developers.facebook.com | grep -v "^@" | awk -F'[/ ]' '{if(!($5 in a)){a[$5]["first"]=$1};a[$5]["last"]=$1;a[$5]["count"]++} END {for (k in a){print a[k]["count"]" - "k" - "a[k]["first"]" / "a[k]["last"]}}'
107 - arquivo.pt - 20091223050754 / 20161123182230
21459 - web.archive.org - 20060820131607 / 20190206193533
16639 - wayback.archive-it.org - 20081121234147 / 20190204192631
273 - swap.stanford.edu - 20081209074335 / 20150516040436
857 - wayback.vefsafn.is - 20090422113107 / 20181009100252
375 - webarchive.loc.gov - 20080214023504 / 20180104004903
29 - archive.md - 20060820131607 / 20170601173243

However, for research purposes I would first dump TimeMaps of various URIs in separate files and process them later. This will allow me to revisit those dumps to extract more information later without hitting the network again.

# Download TimeMaps
$ memgator -f cdxj developers.facebook.com > /PATH/TO/TIMEMAPS/URI-1.cdxj

# Extract per-archive summary
$ grep -v "^@" /PATH/TO/TIMEMAPS/URI-1.cdxj | awk -F'[/ ]' '{if(!($5 in a)){a[$5]["first"]=$1};a[$5]["last"]=$1;a[$5]["count"]++} END {for (k in a){print a[k]["count"]" - "k" - "a[k]["first"]" / "a[k]["last"]}}'

# Yearly memento counts
$ grep -v "^@" /PATH/TO/TIMEMAPS/URI-1.cdxj | cut -c-4 | uniq -c

I used CDXJ format for TimeMaps because it is easier to parse, but you can do it using standard Link format as well. I would note here that MemGaor does not aggregate from as many sources by default as LANL's TimeTravel service, but you can customize the list of archives to aggregate from using "-A" flag. Also, MemGator provides the complete TimeMap in one go, sorted by Datetime, without any pagination or nested index of TimeMaps, which makes it easier to process the response, but may be slower.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529



Anne Helmond

unread,
Feb 13, 2019, 7:41:29 AM2/13/19
to Memento Development
Hi, thanks for the response and for outlining the differences between the two. This really helped.

Anne

m...@matkelly.com

unread,
Feb 13, 2019, 7:41:29 AM2/13/19
to Memento Development
Ann,
In a previous study we did something similar:

Mat Kelly, Lulwah M. Alkwai, Sawood Alam, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel, “Impact of URI Canonicalization on Memento Count,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Toronto, Canada, June 2017, pp. 303-304.

and the much more extensive tech report at https://arxiv.org/abs/1703.03302

One of the Take Homes from the study was that just because a TimeMap/aggregator reports X number of URI-Ms, the number of non-redirecting (i.e., non HTTP 3XX) representations when the URI-Ms are dereferenced is likely substantially lower. Many URI-Ms in a TimeMap will simply contain a redirect to another URI-M in the TimeMap when dereferenced.

Keep that in mind when counting the quantity of holdings for a URI-R.

-Mat

Anne Helmond

unread,
Feb 13, 2019, 7:41:29 AM2/13/19
to Memento Development
Hi Sawood,

This is extremely useful! We downloaded MemGator and proceeded in similar lines as suggested by you below. This was incredibly helpful. Thanks!

Anne

Mat Kelly

unread,
Feb 13, 2019, 7:42:34 AM2/13/19
to memen...@googlegroups.com
Ann,
In a previous study we did something similar:

Mat Kelly, Lulwah M. Alkwai, Sawood Alam, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel, “Impact of URI Canonicalization on Memento Count,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Toronto, Canada, June 2017, pp. 303-304.

and the much more extensive tech report at https://arxiv.org/abs/1703.03302

One of the Take Homes from the study was that just because a TimeMap/aggregator reports X number of URI-Ms, the number of non-redirecting (i.e., non HTTP 3XX) representations when the URI-Ms are dereferenced is likely substantially lower. Many URI-Ms in a TimeMap will simply contain a redirect to another URI-M in the TimeMap when dereferenced.

Keep that in mind when counting the quantity of holdings for a URI-R.

-Mat

Anne Helmond

unread,
Feb 13, 2019, 7:56:38 AM2/13/19
to memen...@googlegroups.com
Great, thanks for the heads up, and will further dive into the previous study. Thanks, Anne
Reply all
Reply to author
Forward
0 new messages