Fetch from URL not working

108 views
Skip to first unread message

ThomasVee

unread,
Sep 3, 2018, 5:41:22 AM9/3/18
to OpenRefine
Hi,
I had setup a workflow in openrefine, in which i paste the title and url of articles in a journal and fetch the html, and extract details such as abstract, attribution etc from there. upI was then exporting this for further processing. I have been using this for several months now but it apprears to be failing now at the "Create column FetchShowPage at index 4 by fetching URLs based on column Column 4 using expression grel:value" . Below is the code I was using. Can someone please point me to what might be wrong? Thanks in advance.
[
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column Column 3 using expression grel:value.replace(\"-- \",\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "Column 3",
    "expression": "grel:value.replace(\"-- \",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column Column 4 using expression grel:value.replace(\"DOI:\",\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "Column 4",
    "expression": "grel:value.replace(\"DOI:\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-addition-by-fetching-urls",
    "description": "Create column FetchShowPage at index 4 by fetching URLs based on column Column 4 using expression grel:value",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "FetchShowPage",
    "columnInsertIndex": 4,
    "baseColumnName": "Column 4",
    "urlExpression": "grel:value",
    "onError": "set-to-blank",
    "delay": 100,
    "cacheResponses": false,
    "httpHeadersJson": null
  },
  {
    "op": "core/column-addition",
    "description": "Create column keywords at index 5 based on column FetchShowPage using expression grel:value.parseHtml().select(\"div#articleSubject\")[0].toString()",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "keywords",
    "columnInsertIndex": 5,
    "baseColumnName": "FetchShowPage",
    "expression": "grel:value.parseHtml().select(\"div#articleSubject\")[0].toString()",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column keywords using expression grel:replace(value,/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)\\/?>/,'')",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "keywords",
    "expression": "grel:replace(value,/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)\\/?>/,'')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column keywords using expression grel:value.replace(\"Keywords\",\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "keywords",
    "expression": "grel:value.replace(\"Keywords\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column keywords using expression grel:value.replace(\";\",\",\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "keywords",
    "expression": "grel:value.replace(\";\",\",\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-addition",
    "description": "Create column abstract at index 5 based on column FetchShowPage using expression grel:value.parseHtml().select(\"div#articleAbstract\")[0].toString()",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "abstract",
    "columnInsertIndex": 5,
    "baseColumnName": "FetchShowPage",
    "expression": "grel:value.parseHtml().select(\"div#articleAbstract\")[0].toString()",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column abstract using expression grel:replace(value,/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)\\/?>/,'')",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "abstract",
    "expression": "grel:replace(value,/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)\\/?>/,'')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column abstract using expression grel:value.replace(\"Abstract\",\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "abstract",
    "expression": "grel:value.replace(\"Abstract\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column abstract using expression grel:value.replace(\"&nbsp;\",\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "abstract",
    "expression": "grel:value.replace(\"&nbsp;\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-addition",
    "description": "Create column PDFlink at index 5 based on column FetchShowPage using expression grel:value.parseHtml().select(\"div#articleFullText\")[0].select(\"a[href]\")[0].toString()",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "PDFlink",
    "columnInsertIndex": 5,
    "baseColumnName": "FetchShowPage",
    "expression": "grel:value.parseHtml().select(\"div#articleFullText\")[0].select(\"a[href]\")[0].toString()",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PDFlink using expression grel:value.replace(\"<a href=\",\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "PDFlink",
    "expression": "grel:value.replace(\"<a href=\",\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PDFlink using expression grel:value.replace('\"',\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "PDFlink",
    "expression": "grel:value.replace('\"',\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PDFlink using expression grel:value.replace('class=file target=_parent>PDF/A</a>',\"\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "PDFlink",
    "expression": "grel:value.replace('class=file target=_parent>PDF/A</a>',\"\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column PDFlink using expression grel:value.replace(\"view\",\"viewFile\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "PDFlink",
    "expression": "grel:value.replace(\"view\",\"viewFile\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-addition",
    "description": "Create column vol issue no at index 5 based on column FetchShowPage using expression grel:value.parseHtml().select(\"div#breadcrumb\")[0].select(\"a[href]\")[1].toString()",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "vol issue no",
    "columnInsertIndex": 5,
    "baseColumnName": "FetchShowPage",
    "expression": "grel:value.parseHtml().select(\"div#breadcrumb\")[0].select(\"a[href]\")[1].toString()",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column vol issue no using expression grel:replace(value,/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)\\/?>/,'')",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "vol issue no",
    "expression": "grel:replace(value,/<\\/?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)\\/?>/,'')",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/column-split",
    "description": "Split column Column 3 by separator",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "Column 3",
    "guessCellType": true,
    "removeOriginalColumn": true,
    "mode": "separator",
    "separator": "Pp.",
    "regex": false,
    "maxColumns": 0
  },
  {
    "op": "core/column-rename",
    "description": "Rename column Column 3 2 to page nos",
    "oldColumnName": "Column 3 2",
    "newColumnName": "page nos"
  },
  {
    "op": "core/column-addition",
    "description": "Create column attribution at index 3 based on column Column 3 1 using expression grel:value + \"Journal of Threatened Taxa, \" + cells[\"vol issue no\"].value+\"; pp. \" +cells[\"page nos\"].value+ cells[\"Column 4\"].value",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "newColumnName": "attribution",
    "columnInsertIndex": 3,
    "baseColumnName": "Column 3 1",
    "expression": "grel:value + \"Journal of Threatened Taxa, \" + cells[\"vol issue no\"].value+\"; pp. \" +cells[\"page nos\"].value+ cells[\"Column 4\"].value",
    "onError": "set-to-blank"
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column abstract using expression grel:value.replace(\"&amp;\",\"&\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "abstract",
    "expression": "grel:value.replace(\"&amp;\",\"&\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column keywords using expression grel:value.replace(\"&amp;\",\"&\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "keywords",
    "expression": "grel:value.replace(\"&amp;\",\"&\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  },
  {
    "op": "core/text-transform",
    "description": "Text transform on cells in column keywords using expression grel:value.replace(\"&amp,\",\"&\")",
    "engineConfig": {
      "mode": "row-based",
      "facets": []
    },
    "columnName": "keywords",
    "expression": "grel:value.replace(\"&amp,\",\"&\")",
    "onError": "keep-original",
    "repeat": false,
    "repeatCount": 10
  }
]

Antonin Delpeuch (lists)

unread,
Sep 3, 2018, 6:16:30 AM9/3/18
to openr...@googlegroups.com
If your workflow used to work but now fails, here are a few ideas:
- have you updated any software when the change happened?
- if you are doing a large number of requests, you might have been
blocked on that website. A delay of 100 ms (which you have set) means
that you will do around 10 requests per second, which is probably too
fast for the source. You might have been blocked on the basis of your
user-agent, so the block might not be visible if you use your browser to
access the source.

Antonin
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Sep 3, 2018, 6:21:25 AM9/3/18
to OpenRefine
Try doing the workflow up to (not including) "Create column FetchShowPage at index 4 by fetching URLs based on column Column 4 using expression grel:value"

Then create the fetch step manually with the same parameters but set the "On error" option to "store error"

This should show any error found during the fetch process - which should help uncover what is going wrong at this step

Owen

ThomasVee

unread,
Sep 3, 2018, 6:21:46 AM9/3/18
to OpenRefine
Thanks for the reply Antonin . I increased the delay to 2000 but it still doesnt work.
Is there a way to change the user-agent in openrefine?

ThomasVee

unread,
Sep 3, 2018, 6:26:30 AM9/3/18
to OpenRefine
Thanks Owen. tried that but I see only blank cells.

Owen Stephens

unread,
Sep 3, 2018, 6:27:17 AM9/3/18
to openr...@googlegroups.com
Can you share an example of the URLs you are requesting?


Owen Stephens
Owen Stephens Consulting

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

ThomasVee

unread,
Sep 3, 2018, 7:14:32 AM9/3/18
to OpenRefine

On Monday, 3 September 2018 15:57:17 UTC+5:30, Owen Stephens wrote:
Can you share an example of the URLs you are requesting?

Owen Stephens
Owen Stephens Consulting

Ettore Rizza

unread,
Sep 3, 2018, 8:30:32 AM9/3/18
to OpenRefine

Owen Stephens

unread,
Sep 3, 2018, 8:57:46 AM9/3/18
to OpenRefine
My guess is the target service (i.e. threatenedtaxa.org) has made some changes - although it's hard to be sure.

I think rather than the number of redirects the problem is the redirect from http -> https - the code that OpenRefine uses (not written by us, but a third-party library) will not follow redirects that go http->https or https->http. See https://github.com/OpenRefine/OpenRefine/issues/1410 for more details - but essentially we need to make some changes to support redirect http->https (which should always be safe) while not support https->http (which would take you from a secure to insecure connection which would be a 'bad thing' tm )

I've not managed to do a full check, but if you request https://doi.org/10.11609/jott.3948.10.8.11999-12007 rather than http://doi.org/10.11609/jott.3948.10.8.11999-12007 you can get back an HTML snippet which contains the URL of the redirect (I'm not entirely sure why this works, but it seems to!)

You can extract this and convert to an https version with 

value.parseHtml().select(\"a\")[0].htmlAttr("href").replace("http","https")

And then do a further 'fetch' which will get you the content of the webpage I think

Owen

ThomasVee

unread,
Sep 3, 2018, 9:29:27 AM9/3/18
to OpenRefine
It would be great is automatic redirect to https happens . For now I`m trying your suggestion but t says Parsing error at offset 25: Missing number, string, identifier, regex, or parenthesized expression. 

Owen Stephens

unread,
Sep 3, 2018, 11:53:57 AM9/3/18
to OpenRefine
Can you share some more information?

When you get that error - What stage of the process is that at, what is the starting point for the data etc. etc.

Also I wonder if all you need is the article metadata you might want to try using the Crossref API instead of scraping it from the journal page.

E.g. a request like


Should retrieve the metadata as JSON

Owen

ThomasVee

unread,
Sep 4, 2018, 2:11:36 AM9/4/18
to OpenRefine

Thanks owen. I tried fetching via the crossref API but it returns this error: HTTP error 401 : Unauthorized | <html><body><h1>401 Unauthorized</h1> You need a valid user and password to access this content. </body></html>

Owen Stephens

unread,
Sep 4, 2018, 4:00:28 AM9/4/18
to OpenRefine
Hi Thomas,

Sorry I've not been able to help so far. It looks like that CrossRef API might need registration + authorisation to work.

I've just tested the following with https://doi.org/10.11609/jott.3948.10.8.11999-12007 and it works for me. You will need to use OpenRefine v3 rc1 (download from http://openrefine.org/download.html) to use this approach.

Use "Add column by fetching URLs"
Where it says "HTTP headers to be used when fetching URLs" click the 'Show' link
In the box labelled "Accept:" replace the existing text (*/*) with 
application/unixref+xml

This using the DOI URL I quote above, this fetches me a nice XML representation of the metadata about the article which I can parse

Owen

ThomasVee

unread,
Sep 4, 2018, 4:47:57 AM9/4/18
to OpenRefine
Hi Owen,
Thanks again. I tried that but still all I get is 
<html><head><title>Handle Redirect</title></head>
<body><a href="http://threatenedtaxa.org/index.php/JoTT/article/view/3948">http://threatenedtaxa.org/index.php/JoTT/article/view/3948</a></body></html>
However, when I manually click on the url  and go there on my browser - https://threatenedtaxa.org/index.php/JoTT/article/view/4039 - it shows 

DB Error: Got error 28 from storage engine


Perhaps there is some issue with the server?

ThomasVee

unread,
Sep 5, 2018, 5:24:17 AM9/5/18
to OpenRefine
Hi Owen,
I tried that again today and it worked. Thank you so much for your help.

Regards,

Thomas

Owen Stephens

unread,
Sep 5, 2018, 9:26:52 AM9/5/18
to OpenRefine
Great news! Glad it is working

Best wishes

Owen
Reply all
Reply to author
Forward
0 new messages