Retrieve number of sn

Anne Helmond

unread,

Feb 7, 2019, 7:02:28 AM2/7/19

to Memento Development

Dear Memento,

First of all, thanks for the great service and tools. I am a researcher at the University of Amsterdam who uses web archives a lot and Memento is really valuable in doing so.

I am currently trying to get the following information out of Memento:

For a URL, how many mementos are available in each web archive, and between which dates:

URL (e.g. http://developers.facebook.com) – 23 mementos – archive_id – startdate / enddate

Andy J already helpfully pointed me to TimeMap for access to version history, which then returns a list of TimeMap URIs that may be exposed by archives.

So if If I want to know which web archives have archived "developers.facebook.com", how many Mementos they have available, and what the dates of the first and last Memento is:

http://timetravel.mementoweb.org/timemap/json/http://developers.facebook.com
this info then contains for example for each archive "uri":"https://arquivo.pt/wayback/timemap/*/http://developers.facebook.com"
then this info contains a list of URLs (Mementos) for Arquivo that I can count, and the dates for their first and last Memento.

However, it feels like I am overlooking something and this could be done much easier. I hope you understand my question :)

Thanks! Anne

Herbert Van de Sompel

unread,

Feb 7, 2019, 7:28:32 AM2/7/19

to memen...@googlegroups.com, Herbert Van de Sompel

The question is easier for whom? For you or for the Memento infrastructure? ;-)

The approach you describe is the ""Do It Yourself" TimeMap for access to version history" described at http://timetravel.mementoweb.org/guide/api/#timemap-diy . That one is easy on the Memento infrastructure and harder on you. In essence, the added value that this approach provides over you just polling _all_ web archives is that it lets you know which ones you should actually bother (the ones that have mementos) to poll and which not (the ones that don't have mementos). The infrastructure makes that distinction based on a ML approach.

The other approach is the ""We Do It" TimeMap for access to version history" described at http://timetravel.mementoweb.org/guide/api/#timemap-wdi . This one is easy for you and harder on the Memento infrastructure. It involves the infrastructure effectively polling all web archives on your behalf. Response times can be slow if the required information is not cached. Note that, in case there are many mementos, an Index TimeMap rather than an actual TimeMap will be provided. Info is at the URL listed earlier in this paragraph.

I hope this helps.

Cheers

Herbert Van de Sompel

Thanks! Anne

--

---
You received this message because you are subscribed to the Google Groups "Memento Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

==================

Herbert Van de Sompel

Chief Innovation Officer

DANS

herbert.va...@dans.knaw.nl

+31 6 22 83 93 15

https://hvdsomp.info

https://orcid.org/0000-0002-0715-6126

Sawood Alam

unread,

Feb 7, 2019, 10:04:37 AM2/7/19

to memento-dev

Hey Anne,

I just tried doing it using MemGator, a Memento aggregator one can run locally (https://github.com/oduwsdl/MemGator).

$ memgator -f cdxj developers.facebook.com | grep -v "^@" | awk -F'[/ ]' '{if(!($5 in a)){a[$5]["first"]=$1};a[$5]["last"]=$1;a[$5]["count"]++} END {for (k in a){print a[k]["count"]" - "k" - "a[k]["first"]" / "a[k]["last"]}}'

107 - arquivo.pt - 20091223050754 / 20161123182230

21459 - web.archive.org - 20060820131607 / 20190206193533

16639 - wayback.archive-it.org - 20081121234147 / 20190204192631

273 - swap.stanford.edu - 20081209074335 / 20150516040436

857 - wayback.vefsafn.is - 20090422113107 / 20181009100252

375 - webarchive.loc.gov - 20080214023504 / 20180104004903

29 - archive.md - 20060820131607 / 20170601173243

However, for research purposes I would first dump TimeMaps of various URIs in separate files and process them later. This will allow me to revisit those dumps to extract more information later without hitting the network again.

# Download TimeMaps

$ memgator -f cdxj developers.facebook.com > /PATH/TO/TIMEMAPS/URI-1.cdxj

# Extract per-archive summary

$ grep -v "^@" /PATH/TO/TIMEMAPS/URI-1.cdxj | awk -F'[/ ]' '{if(!($5 in a)){a[$5]["first"]=$1};a[$5]["last"]=$1;a[$5]["count"]++} END {for (k in a){print a[k]["count"]" - "k" - "a[k]["first"]" / "a[k]["last"]}}'

# Yearly memento counts

$ grep -v "^@" /PATH/TO/TIMEMAPS/URI-1.cdxj | cut -c-4 | uniq -c

I used CDXJ format for TimeMaps because it is easier to parse, but you can do it using standard Link format as well. I would note here that MemGaor does not aggregate from as many sources by default as LANL's TimeTravel service, but you can customize the list of archives to aggregate from using "-A" flag. Also, MemGator provides the complete TimeMap in one go, sorted by Datetime, without any pagination or nested index of TimeMaps, which makes it easier to process the response, but may be slower.

Best,

--

Sawood Alam

Department of Computer Science

Old Dominion University

Norfolk VA 23529

Anne Helmond

unread,

Feb 13, 2019, 7:41:29 AM2/13/19

to Memento Development

Hi, thanks for the response and for outlining the differences between the two. This really helped.

Anne

m...@matkelly.com

unread,

Feb 13, 2019, 7:41:29 AM2/13/19

to Memento Development

Ann,

In a previous study we did something similar:

Mat Kelly, Lulwah M. Alkwai, Sawood Alam, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel, “Impact of URI Canonicalization on Memento Count,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Toronto, Canada, June 2017, pp. 303-304.

https://www.cs.odu.edu/~mkelly/papers/2017_jcdl_countingMementos.pdf

and the much more extensive tech report at https://arxiv.org/abs/1703.03302

One of the Take Homes from the study was that just because a TimeMap/aggregator reports X number of URI-Ms, the number of non-redirecting (i.e., non HTTP 3XX) representations when the URI-Ms are dereferenced is likely substantially lower. Many URI-Ms in a TimeMap will simply contain a redirect to another URI-M in the TimeMap when dereferenced.

Keep that in mind when counting the quantity of holdings for a URI-R.

-Mat

Anne Helmond

unread,

Feb 13, 2019, 7:41:29 AM2/13/19

to Memento Development

Hi Sawood,

This is extremely useful! We downloaded MemGator and proceeded in similar lines as suggested by you below. This was incredibly helpful. Thanks!

Anne

Mat Kelly

unread,

Feb 13, 2019, 7:42:34 AM2/13/19

to memen...@googlegroups.com

Ann,

In a previous study we did something similar:

Mat Kelly, Lulwah M. Alkwai, Sawood Alam, Michael L. Nelson, Michele C. Weigle, and Herbert Van de Sompel, “Impact of URI Canonicalization on Memento Count,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Toronto, Canada, June 2017, pp. 303-304.

https://www.cs.odu.edu/~mkelly/papers/2017_jcdl_countingMementos.pdf

and the much more extensive tech report at https://arxiv.org/abs/1703.03302

One of the Take Homes from the study was that just because a TimeMap/aggregator reports X number of URI-Ms, the number of non-redirecting (i.e., non HTTP 3XX) representations when the URI-Ms are dereferenced is likely substantially lower. Many URI-Ms in a TimeMap will simply contain a redirect to another URI-M in the TimeMap when dereferenced.

Keep that in mind when counting the quantity of holdings for a URI-R.

-Mat

Anne Helmond

unread,

Feb 13, 2019, 7:56:38 AM2/13/19

to memen...@googlegroups.com

Great, thanks for the heads up, and will further dive into the previous study. Thanks, Anne

Reply all

Reply to author

Forward