check HTTP status code for URLs pointing to PDF files

191 views
Skip to first unread message

Felix Lohmeier

unread,
Jan 15, 2021, 7:21:19 AM1/15/21
to OpenRefine
In one of my projects, I am processing metadata from an open access repository. In one column I have URLs that link to PDF files. I know that some of the links are broken, so I want to check the HTTP status code for these URLs.

The Add column by fetching URLs function always seems to download the entire content, so it is unsuitable for PDF files, right?

I found a solution with Jython, which I also documented in the wiki:

import httplib
import urlparse
url = urlparse.urlparse(value)
conn = httplib.HTTPConnection(url[1])
conn.request("HEAD", url[2])
res = conn.getresponse()
return res.status

But this solution needs several minutes for 1,000 rows. Does anyone have an idea how to solve this better?

Otherwise I would probably switch to wget or curl as described here, but it would be very handy to do this within OpenRefine.

Thanks in advance for your help,
Felix

Thad Guidry

unread,
Jan 15, 2021, 2:30:36 PM1/15/21
to openr...@googlegroups.com
Clojure (think LISP inspired) is really great for this, because you can directly access all the Java classes, methods, etc. that you might need, but in a fairly straightforward manner like GREL does.

If you want the header fields...

(.getHeaderFields (.openConnection (java.net.URL. value)))

and I guess, if you want it as lightweight as possible...just an HTTP status code?  302 code would be redirects (HTTP -> HTTPS most commonly)

(.getResponseCode (.openConnection (java.net.URL. value)))

You can change the 1st method on those examples like .getResponseCode to whatever method or subclass method you need from those within:
https://docs.oracle.com/javase/8/docs/api/index.html?java/net/URLConnection.html
https://docs.oracle.com/javase/8/docs/api/index.html?java/net/HttpURLConnection.html

Lots of Java interop is directly available to you in the Expression editor when you choose to use Clojure as your expression language
https://clojure.org/reference/java_interop



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/4e299820-e27d-4a0d-9f1e-eb3c9fed590dn%40googlegroups.com.

Antonin Delpeuch (lists)

unread,
Jan 16, 2021, 8:57:08 AM1/16/21
to openr...@googlegroups.com
Hi Felix,

I would find it curious that OpenRefine adds any significant overhead
there. I am not sure where it should come from - perhaps it is
inefficient to send HTTP requests via Jython? Otherwise I don't think
there should be any particular delay between the computation of cell values.

Antonin

On 15/01/2021 13:21, Felix Lohmeier wrote:
> In one of my projects, I am processing metadata from an open access
> repository. In one column I have URLs that link to PDF files. I know
> that some of the links are broken, so I want to check the HTTP status
> code for these URLs.
>
> The Add column by fetching URLs function always seems to download the
> entire content, so it is unsuitable for PDF files, right?
>
> I found a solution with Jython, which I also documented in the wiki:
> Recipes / Get HTTP Status code (e.g. link checker)
> <https://github.com/OpenRefine/OpenRefine/wiki/Recipes#get-http-status-code-eg-link-checker>
>
> import httplib
> import urlparse
> url = urlparse.urlparse(value)
> conn = httplib.HTTPConnection(url[1])
> conn.request("HEAD", url[2])
> res = conn.getresponse()
> return res.status
>
> But this solution needs several minutes for 1,000 rows. Does anyone have
> an idea how to solve this better?
>
> Otherwise I would probably switch to wget or curl as described here
> <https://stackoverflow.com/questions/6136022/script-to-get-the-http-status-code-of-a-list-of-urls>,
> but it would be very handy to do this within OpenRefine.
>
> Thanks in advance for your help,
> Felix
>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> <https://groups.google.com/d/msgid/openrefine/4e299820-e27d-4a0d-9f1e-eb3c9fed590dn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Morris

unread,
Jan 16, 2021, 10:35:42 AM1/16/21
to openr...@googlegroups.com
I'm going to interpret "several minutes for 1,000 rows" as ~200 msec average for your requests, which doesn't sound unreasonable for average latency. The primary way that you can reduce the amount of time for a large batch is by increasing parallelism (assuming that the files are all on different hosts). If you have a bunch of files on the same host, you'd want to organize them so that you check them together, reusing your connection, to save the connection setup time. Particularly for HTTPS connections, that will likely be where the bulk of your overhead comes from.

I haven't used Python's multiprocess/multithread capabilities in Jython, but I don't know of any reason that it would work.

Tom

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/4e299820-e27d-4a0d-9f1e-eb3c9fed590dn%40googlegroups.com.

Felix Lohmeier

unread,
Jan 18, 2021, 6:09:34 PM1/18/21
to OpenRefine
Tom, Antonin and Thad,

Thanks for your helpful advice! Clojure is actually a bit faster here (188ms) than Jython (277ms) and I really like the one-liner!

I'll have another look to see if I can get it parallelized with Jython. For that I would probably have to create temporary records, which quickly becomes confusing. For such a case it would be handy if there would be a variable column.cells[columnName].value similar to row.record.cells[columnName].value.
Reply all
Reply to author
Forward
0 new messages