In one of my projects, I am processing metadata from an open access repository. In one column I have URLs that link to PDF files. I know that some of the links are broken, so I want to check the HTTP status code for these URLs.
The Add column by fetching URLs function always seems to download the entire content, so it is unsuitable for PDF files, right?
I found a solution with Jython, which I also documented in the wiki:
import httplib
import urlparse
url = urlparse.urlparse(value)
conn = httplib.HTTPConnection(url[1])
conn.request("HEAD", url[2])
res = conn.getresponse()
return res.status
But this solution needs several minutes for 1,000 rows. Does anyone have an idea how to solve this better?
Otherwise I would probably switch to wget or curl as described
here, but it would be very handy to do this within OpenRefine.
Thanks in advance for your help,
Felix