discrepancies in retrievability of individual mementos vs. their timemaps

22 views
Skip to first unread message

Nicholas Taylor

unread,
May 24, 2018, 3:12:25 PM5/24/18
to openwayback-dev
Hello All,

I've noticed that for some resources indexed in our OpenWayback instance, individual mementos can be accessed, e.g., https://swap.stanford.edu/20170628171729/https://www.waseda.jp/fcom/soc/ but trying to access the timemap for all versions of the same memento gives "Resource not in archive": https://swap.stanford.edu/*/https://www.waseda.jp/fcom/soc/.

Any suggestions on where to troubleshoot would be welcome.

Thanks!

~Nicholas

Sawood Alam

unread,
May 24, 2018, 3:30:26 PM5/24/18
to openway...@googlegroups.com
Here are my quick findings:

Memento (works):



TimeGate (works):



TimeMap (fails):



Calendar (fails):



With this experiment in hand, I would think there might be some issue in how binary search is being performed in loaded CDX files when searching for a specific line vs. a range of lines. I would perhaps guess there might be some issue in how CDX files are sorted. It might be worth trying to sort them again with "LC_ALL=C" environment variable set and see if any of the CDX files are different from what they should be. Alternatively, I would manually locate the WARC file(s) with the help of the CDX file(s) that contain record of this URI. Then I would index those WARC files again and run a test instance of the replay system (with as much of the settings replicated from the production as possible) just on that subset to see how it behaves.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529



--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/39c0633c-ed6e-43b9-96c6-b20a12202284%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicholas Taylor

unread,
May 29, 2018, 12:57:27 PM5/29/18
to openway...@googlegroups.com
Thanks for the troubleshooting tips, Sawood. It does seem like a
content indexing issue.

I double-checked our indexing workflows for the environment variable
being set for the sort to work properly, and it's there. Will try your
other strategies and report back.

~Nicholas

Andy Jackson

unread,
May 29, 2018, 5:13:29 PM5/29/18
to openway...@googlegroups.com
Weirdly this works:


Are you accessing OWB via a proxy server? Is it mucking about with the URLs? If you go directly to the back-end service does that work okay?

Eg http://OWB:8080/*/https...

Cheers,
Andy

--
Sent via a tiny keyboard, so apologies for any tipos.




Cheers,
Andy

--
Sent via a tiny keyboard, so apologies for any tipos.

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/openwayback-dev.

Andy Jackson

unread,
May 29, 2018, 5:36:17 PM5/29/18
to openway...@googlegroups.com
Oh no this is much weirder….

Note that this works:


And this doesn’t:


i.e. target dates that are since the first date are not working!

Hm, actually this reminds me of an oddity we came across. Are you using RemoteResourceIndex? If so:


If not, it’s possible that the logic in your BubbleCalendar.jsp has clipped things to the most recent year. You could try comparing if with the current version.

HTH,
Andy

Reply all
Reply to author
Forward
0 new messages