That made no difference to me. To do a final check I simply used a quoted URL in the GREL expression box first trying my own webpage, and then trying with one of the URLs from Wade's project. The first worked without a problem, the second failed as previously.
Doing some more digging. It looks like the part of the retrieval that fails includes an attempt to get the character encoding for the web page being retrieved. Wade also mentioned getting the error "utf-8". Retrieving one of the pages with curl gives:
HTTP/1.0 200 OK
Cache-Control: private
Content-Language: en-US
Content-Type: text/html; charset="utf-8"
Date: Wed, 13 Aug 2014 07:00:27 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Server: nginx/1.4.7
Set-Cookie: PHPSESSID=gvql4gtdt71qthmril5tmsfk90; path=/
Set-Cookie: wsid=1407913227; expires=Tue, 13-Aug-2019 07:00:27 GMT
X-Powered-By: EquineNow
X-Cache: MISS from IMP-cache
X-Cache-Lookup: MISS from IMP-cache:3128
Via: 1.0 IMP-cache (squid/3.1.20)
Connection: keep-alive
The relevant line here is:
Content-Type: text/html; charset="utf-8"
If the content encoding isn't set explicitly in the header Refine tries to extract it from the Content-Type line. My reading of the code is that in this case it would extract "utf-8" - including the inverted commas - which shouldn't be there - the line should really look like:
Content-Type: text/html; charset=UTF-8
My guess is that having got an encoding of
"utf-8"
The line of code that fails is:
ParsingUtilities.inputStreamToString(is, encoding != null ? encoding : "UTF-8"),null));
If my analysis here is correct, I can't see any easy way of fixing this - this would require a change to OpenRefine code, and to be honest while I think OpenRefine could fail with a better error message here, I'm not sure I'd advocate changing the code to cope with random errors from web servers.
The easiest way to work around this maybe to use an intermediary process to mediate the request. I'm thinking of a local script which can retrieve web pages, or possibly an online web proxy to do this. You can then use a GREL like:
Personally I'd probably just run something locally to retrieve the content. In theory an open web proxy would work, but I don't have any recommendations in this area - maybe others can suggest a service? Anything that prevents the original Content-Type header getting through would work I think
Owen