Server-relative links not being re-written on BBC sites

34 views
Skip to first unread message

andrew.jackson

unread,
May 17, 2016, 8:10:12 AM5/17/16
to openwayback-dev
Hi All,

Can anyone help me understand why the server-relative links on BBC sites (and possibly others) are not being re-written. This is an example (running under an older OpenWayback):

https://www.webarchive.org.uk/act/wayback/20160324120924/http://www.bbc.co.uk/news

Putting aside the css/mobile site issues, many of the links don't work, and they are just simple server-relative ones.

<li class="links-list__item"><a href="/news/world-europe-35888427" class="links-list__link">  

This should surely work fine, (and indeed all is fine on the IA Wayback instance), so I assume we've broken server-relative URL re-writing somehow. We do override bits of OpenWayback, so perhaps it's our changes rather than anything else, but I'd appreciate knowing if these links playback okay under 'vanilla' OW 2.3.1.

Does anyone have any pointer to where I should start looking? What configuration properties might affect this?

Thanks for your time.

Best wishes,
Andy Jackson

Lauren Ko

unread,
May 17, 2016, 7:32:26 PM5/17/16
to openway...@googlegroups.com
Hi Andy,
I can confirm that with vanilla OW 2.3.1 the server relative URLs are not being rewritten in the HTML of http://www.bbc.co.uk/news and in the other websites I have just looked at. Currently the pages are replaying for me though because of this block in the default wayback.xml that is redirecting the server relative links:
  <!--
    Last ditch attempt to resolve server-relative URLs (/page1.htm) that were 
    not successfully rewritten, resolving them against the referring URL to
    get them back on track.
  -->
  <bean name="+" class="org.archive.wayback.webapp.ServerRelativeArchivalRedirect">
    <property name="matchPort" value="${wayback.url.port}" />
    <property name="useCollection" value="true" />
  </bean>

First the non-rewritten links are requested at http://localhost:8080/news/uk-36259746, then ServerRelativeArchivalRedirect redirects to http://localhost:8080/wayback/20160517202225/http://www.bbc.co.uk/news/uk-36259747.

I have not yet looked into when/how the server relative URLs stopped being properly rewritten. Thanks for bringing this up.

Lauren Ko

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lauren Ko

unread,
May 18, 2016, 2:44:23 PM5/18/16
to openway...@googlegroups.com
Just noting, server-relative URL rewriting seems to be working ok for other sites in OW 2.3.1. Looked at http://webarchive.library.unt.edu/unt/fall2015/20151202194235/https://www.library.unt.edu/ (OW 2.3.1.) and compared it to the raw HTML in the WARC file to see that the server relative urls are being rewritten. This is deployed in non-ROOT context there. Looked at the same content in locally deployed OW 2.3.1 in ROOT context and it is rewritten there too.

andrew.jackson

unread,
May 18, 2016, 5:58:17 PM5/18/16
to openwayback-dev
Thanks again. At first, I thought this was similar to https://github.com/iipc/openwayback/issues/306. But looking more closely, I notice that the Wayback insertion is happening in the midst of a rather confusing pair of conditional comments:

<!--[if lt IE 10]><body class="lt-ie10 nojs"><![endif]-->
<!--[if gt IE 9]><!--><body class="nojs">
<!-- BEGIN WAYBACK TOOLBAR INSERT -->
...
<!-- END WAYBACK TOOLBAR INSERT -->
<!--<![endif]-->

Which leaves me wondering if this is related to the handling of conditional comments in the rewriter. Perhaps adding '!--' and '!--<![endif]--' to the list of allowed header 'tags' would help?! (https://github.com/iipc/openwayback/blob/master/wayback-core/src/main/java/org/archive/wayback/archivalurl/FastArchivalUrlReplayParseEventHandler.java#L82)

Andy
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.

andrew.jackson

unread,
May 18, 2016, 6:30:08 PM5/18/16
to openwayback-dev
Okay, I managed to reproduce this in a unit test. It seems using <script type="text/html"> is confusing the rewriter. See https://github.com/iipc/openwayback/pull/315

Given this doesn't happen on on web.archive.org, I guess they are using a different rewriter?

Andy
Reply all
Reply to author
Forward
0 new messages