"fetching urls" results in blank column

333 views
Skip to first unread message

wade

unread,
Aug 12, 2014, 10:01:49 AM8/12/14
to openr...@googlegroups.com
Hoping someone can, since I'm new to Refine...

I try to add a column by fetching URLs, the column gets added, but all the cells are null.  I've tried quitting OpenRefine and starting it up, pulling in the file from Excel, and adding the column, but get the same results.  The URLs are valid, since I can right click and "open in a new tab", and I have successfully done this operation on other files in the past few days.

Any suggestions would be much appreciated!

Martin Magdinier

unread,
Aug 12, 2014, 10:18:56 AM8/12/14
to openrefine
Hello,

Can you please provide the URL your are trying to fetch and a screenshot of the Add column by fetching url so we can double check the setting you are using. 

Martin




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

wade

unread,
Aug 12, 2014, 2:30:31 PM8/12/14
to openr...@googlegroups.com
Sure -  the screen shot is attached

I also added one of the urls you see in the screen shot to another project in which I had successfully accomplished the  Add column by fetching url  procedure, and all urls were successfully fetched EXCEPT the one I added.  Can a website page do something to cause nothing to be fetched by Refine, but still be opened ok by a browser??
EquineNow Horses for Sale 1 xlsx - Google Refine.htm

Owen Stephens

unread,
Aug 12, 2014, 3:46:06 PM8/12/14
to openr...@googlegroups.com
I can recreate the behaviour locally, and if I use the option 'On error - store error' I get the error:

"HTTP error 200 : OK |"

Which is strange as a http 200 should mean the request has been successful.
That's as far as I've got but thought it would be good to confirm that I can recreate the problem

wade

unread,
Aug 12, 2014, 3:49:05 PM8/12/14
to openr...@googlegroups.com
Is it normal for OpenRefine to have an "error" that indicates an operation was "ok"?

wade

unread,
Aug 12, 2014, 3:55:58 PM8/12/14
to openr...@googlegroups.com
When I did the same (i.e.,  use the option 'On error - store error') I get  "utf-8" -- no idea what this means, and its interesting that it is different than what Owen got.


On Tuesday, August 12, 2014 9:01:49 AM UTC-5, wade wrote:

Thad Guidry

unread,
Aug 12, 2014, 8:44:10 PM8/12/14
to openr...@googlegroups.com
Fetch operation will need to work with URLs as string if I recall.

You need to create a STRING representation of the URL...

Can you try adding .toString() at the end of your GREL expression that your using ? or if your just hand editing the cell itself, simply just wrap the URL with double quotes in the GREL expression dialog should do it:


instead of




--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Aug 13, 2014, 3:45:27 AM8/13/14
to openr...@googlegroups.com
That made no difference to me. To do a final check I simply used a quoted URL in the GREL expression box first trying my own webpage, and then trying with one of the URLs from Wade's project. The first worked without a problem, the second failed as previously.

Doing some more digging. It looks like the part of the retrieval that fails includes an attempt to get the character encoding for the web page being retrieved. Wade also mentioned getting the error "utf-8". Retrieving one of the pages with curl gives:

HTTP/1.0 200 OK
Cache-Control: private
Content-Language: en-US
Content-Type: text/html; charset="utf-8"
Date: Wed, 13 Aug 2014 07:00:27 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Server: nginx/1.4.7
Set-Cookie: PHPSESSID=gvql4gtdt71qthmril5tmsfk90; path=/
Set-Cookie: wsid=1407913227; expires=Tue, 13-Aug-2019 07:00:27 GMT
X-Powered-By: EquineNow
X-Cache: MISS from IMP-cache
X-Cache-Lookup: MISS from IMP-cache:3128
Via: 1.0 IMP-cache (squid/3.1.20)
Connection: keep-alive

The relevant line here is:
Content-Type: text/html; charset="utf-8"

If the content encoding isn't set explicitly in the header Refine tries to extract it from the Content-Type line. My reading of the code is that in this case it would extract "utf-8" - including the inverted commas - which shouldn't be there - the line should really look like:

Content-Type: text/html; charset=UTF-8

My guess is that having got an encoding of
"utf-8"

The line of code that fails is:
ParsingUtilities.inputStreamToString(is, encoding != null ? encoding : "UTF-8"),null));

If my analysis here is correct, I can't see any easy way of fixing this - this would require a change to OpenRefine code, and to be honest while I think OpenRefine could fail with a better error message here, I'm not sure I'd advocate changing the code to cope with random errors from web servers. 

The easiest way to work around this maybe to use an intermediary process to mediate the request. I'm thinking of a local script which can retrieve web pages, or possibly an online web proxy to do this. You can then use a GREL like:

"http://some_proxy_or_script.com"+value.escape("url")

Personally I'd probably just run something locally to retrieve the content. In theory an open web proxy would work, but I don't have any recommendations in this area - maybe others can suggest a service? Anything that prevents the original Content-Type header getting through would work I think

Owen

Thad Guidry

unread,
Aug 13, 2014, 11:20:03 AM8/13/14
to openr...@googlegroups.com
Owen,

That line in the code looks like it sets the encoding to "UTF-8" whenever the encoding is null...and in this case, 'utf-8" is not null and so Refine should be explicitly setting the input stream to UTF-8.  Correct ?  But your not seeing that happen ?

Thad Guidry

unread,
Aug 13, 2014, 11:20:44 AM8/13/14
to openr...@googlegroups.com
(whenever the encoding is NOT null) ... sorry, typo.

Owen Stephens

unread,
Aug 13, 2014, 11:27:19 AM8/13/14
to openr...@googlegroups.com
My reading of the code is that it only sets the encoding to UTF-8 if the encoding is null at this point in the code.
As you point out, in this case the encoding won't be null - my reading of the header and code is that the code would have already set the encoding at this point to the value:
"utf-8"

- that is including the " characters. This is not a valid encoding - suggesting that the next step would fail.

But I'm not doing analysis at looking at logged output here, it's just my best guess based on the code, header and behaviour I'm seeing :)

Thad Guidry

unread,
Aug 13, 2014, 12:14:32 PM8/13/14
to openr...@googlegroups.com
Owen,

Try changing line 294 in the file ColumnAdditionByFetchingURLsOperation.java to the following:

is, (encoding == null) || ( encoding.equalsIgnoreCase("\"UTF-8\"")) ? "UTF-8" : encoding),

then do 'ant clean' and then 'ant build' and test it out again.

I think that should solve the issue easily for now.

wade

unread,
Aug 13, 2014, 12:15:55 PM8/13/14
to openr...@googlegroups.com
Another URL (www.equilume.com) has the same charset setting, yet Refine can retrieve the page fine

Also, when creating a nw project in Refine by "using web addresses (URLs)", the equinenow url gets fetched just fine


On Tuesday, August 12, 2014 9:01:49 AM UTC-5, wade wrote:

Owen Stephens

unread,
Aug 13, 2014, 1:38:26 PM8/13/14
to openr...@googlegroups.com
On Wednesday, August 13, 2014 5:15:55 PM UTC+1, wade wrote:
Another URL (www.equilume.com) has the same charset setting, yet Refine can retrieve the page fine

I see different charset on http://www.equilume.com:

HTTP/1.1 200 OK
Server: Apache
Cache-Control: no-cache
Content-Type: text/html; charset=utf-8
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Date: Wed, 13 Aug 2014 17:05:37 GMT
Pragma: no-cache
Transfer-Encoding: chunked
Connection: Keep-Alive
Set-Cookie: X-Mapping-kojlelff=1A7AD49B49CA2AC68289AC4F87F39108; path=/
Set-Cookie: 783ffef31462e1e241c10fdf4b973fda=l3ur6hb2dpasrolsv2oddak9g4; path=/

Here the charset is not enclosed in inverted commas and so works. With http://www.equinenow.com/ the charset is enclosed in inverted commas which I believe is what is causing it to fail:

HTTP/1.0 200 OK
Cache-Control: private
Content-Language: en-US
Content-Type: text/html; charset="utf-8"
Date: Wed, 13 Aug 2014 17:38:12 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Server: nginx/1.4.3
Set-Cookie: PHPSESSID=k9b58d7distnkac4h9iln299p7; path=/
Set-Cookie: wsid=1407951492; expires=Tue, 13-Aug-2019 17:38:12 GMT

Owen Stephens

unread,
Aug 13, 2014, 2:10:48 PM8/13/14
to openr...@googlegroups.com
On Wednesday, August 13, 2014 5:14:32 PM UTC+1, Thad Guidry wrote:
Owen,

Try changing line 294 in the file ColumnAdditionByFetchingURLsOperation.java to the following:

is, (encoding == null) || ( encoding.equalsIgnoreCase("\"UTF-8\"")) ? "UTF-8" : encoding),

then do 'ant clean' and then 'ant build' and test it out again.

I think that should solve the issue easily for now.


Unfortunately I don't have a build environment on my machine (and I don't particularly need to solve this problem - just trying to help Wade :)
Java isn't my strong point but that looks right to me as a fix (possibly needs additional set of brackets around the condition?)



Owen Stephens

unread,
Aug 13, 2014, 3:00:24 PM8/13/14
to openr...@googlegroups.com
On Wednesday, August 13, 2014 5:15:55 PM UTC+1, wade wrote:
Also, when creating a nw project in Refine by "using web addresses (URLs)", the equinenow url gets fetched just fine
 
The code that does the new project is different to that which is used when adding new column using URLs.
Specifically when creating a new project from a URL, as far as I can see there is no attempt to extract content encoding from the Content-Type. Instead the code relies exclusively on the Content-Encoding field in the HTTP header. This isn't populated for equinenow, so it just ends up not detecting any encoding.

Thad Guidry

unread,
Aug 13, 2014, 10:47:00 PM8/13/14
to openr...@googlegroups.com
I pushed the FIX to master branch and verified that it works.

Wade, if you install Ant for Windows, then you can easily just type "ant build" and get the latest version of Refine that works for your problem website scraping.

Here's the step by step, if you get lost or confused on what to do.  It is really easy however.  Let me know if I can help you further with building or whatever.

Regards,

Thad Guidry

unread,
Aug 13, 2014, 10:50:47 PM8/13/14
to openr...@googlegroups.com

wade

unread,
Aug 14, 2014, 10:51:51 AM8/14/14
to openr...@googlegroups.com
Thanks Thad.  Your help is very much appreciated.


On Tuesday, August 12, 2014 9:01:49 AM UTC-5, wade wrote:

wade

unread,
Aug 14, 2014, 3:08:41 PM8/14/14
to openr...@googlegroups.com
Thad,

I believe I did everything necessary to rebuild the version you modified, and when I try to add a column by fetching urls, I get the following error msg and no column gets added:

Error contacting recon service: timeout: timeout - http://standard-reconcile.freebaseapps.com/reconcile

Any suggestions??


On Tuesday, August 12, 2014 9:01:49 AM UTC-5, wade wrote:

wade

unread,
Aug 14, 2014, 3:14:09 PM8/14/14
to openr...@googlegroups.com
Correction... the error message even if I do nothing other than "Create Project" based on the urls I'd like to get the pages of


On Tuesday, August 12, 2014 9:01:49 AM UTC-5, wade wrote:

Thad Guidry

unread,
Aug 14, 2014, 6:15:59 PM8/14/14
to openr...@googlegroups.com
That recon service endpoint is dead and never coming back.  (was Google maintained and now is deprecated from them).

You can ignore that error ?
or remove that standard endpoint in the Recon Dialog's box, which has an option to remove a service, as well as add a new one.

Try to remove that old Freebase Recon service in the dialog box.

Reconcile > Start reconciling



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thad Guidry

unread,
Aug 14, 2014, 6:17:50 PM8/14/14
to openr...@googlegroups.com
More info on the Freebase Reconcile deprecation here: https://groups.google.com/forum/#!topic/openrefine/N-O2Wn12S-g
Reply all
Reply to author
Forward
0 new messages