OpenRefine: Add column by fetching urls returns blank cells

197 views
Skip to first unread message

cul...@couriertimes.com

unread,
Dec 31, 2017, 3:55:24 PM12/31/17
to OpenRefine
I recently purchased the book "Scraping for Journalists" by Paul Bradshaw, the second edition. I was following the chapter on using OpenRefine to extract data and I came into an error. I've attached pictures with my results following along and included an image from the book of what I should see compared to what I actually get. 

Summary: The first thing the book had us to was create a google spreadsheet of a series of URLs and then we created the project in OpenRefine by importing that file. We basically put together two values to create a single URL, and those urls did go to working pages. For instance: http://www.nhs.uk/Services/hospitals/Overview/DefaultView.aspx?id=2205 is the first URL in the series and it loads fine. The next step was to edit the new url column using the "add column by fetching URLs" section. The book instructs to name the column (in this case "HTML") but leave the expression box as just "value" (which is the default setting). We are supposed to click OK, and the result should be a new column of the raw HTML of each link. Instead, I get a column of blank cells. I searched some online discussions and other tutorials to see if there was any answer, but I didn't find anything. It either wasn't the same situation, or I wasn't understanding it properly (fairly new to coding and all of this). I decided to try the next step, which was edit the new HTML column with "add column based on this column" (the same process we used to make the column of URLs in the first place). The image "What the book says I should have.jpg" shows I should already see columns filled with the html code already there and the formula is meant to narrow that down to what we're looking for in the book's example. However, what I have is a column of still empty cells. 

What am I doing wrong?

This message may contain confidential and/or privileged information. If you are not the intended recipient or authorized to receive this for the intended recipient, you must not use, copy, disclose or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by sending a reply e-mail and delete this message. Thank you for your cooperation.
step 1.jpg
step 2.jpg
step 3.jpg
step 4.jpg
step 5.jpg
what I actually have.jpg
what the book says I should have.jpg

Ettore Rizza

unread,
Dec 31, 2017, 5:10:16 PM12/31/17
to OpenRefine
Hi,

The problem is probably caused by a permanent redirection of the URLs. When you click on http://www.nhs.uk/Services/hospitals/Overview/DefaultView.aspx?id=2205, notice that the URL in your browser becomes https://www.nhs.uk/Services/hospitals/Overview/DefaultView.aspx?id=2205 (https with a s). Looks like OpenRefine can not handle this redirection.

The simplest is to replace "http" in your urls columns by "https", either by rebuilding the urls, or by using "transform" on the said column with the Grel formula:

value.replace('http', 'https')

Hope this helps (and happy new year !)

Ettore

cul...@couriertimes.com

unread,
Dec 31, 2017, 8:15:01 PM12/31/17
to OpenRefine
That seems to have done the trick. At least, I'm able to pull up the raw html text that should have happened  the first time. Thank you so much for your help! 
Reply all
Reply to author
Forward
0 new messages