Using zotero's translation server with pdf urls?

122 views
Skip to first unread message

seraphinatarrant

unread,
Dec 16, 2019, 3:32:48 PM12/16/19
to zotero-dev
Hi,

I have been in a conversation about from pdf metadata retrieval options on the zotero forum and was advised to go here.
Essentially, Zotero has a higher rate of being able to extract metadata from articles than the Wikimedia API does, so I was going to try to implement using zotero's translation server, which adamsmith advised me to do. 
But I'm not sure how to do it with PDF URLs (which is what I have, as a result of scraping). Do any of you know how to do this? Or is there an enumeration of the API endpoint options available to me?

The readme (https://github.com/zotero/translation-server) includes the ability to query either a webpage or search a DOI/Arxiv/etc via http://127.0.0.1:1969/search or http://127.0.0.1:1969/web. But neither of these accepts a PDF URL. It isn't super clear to me from searching through the code (and I'm not entirely sure what to search for, since grepping for something like PDF is too broad).

For some URLs I can just truncate to the path before the PDF, but this fails 3/4 of the time.

The existing discussion was here (for those interested):

Thanks so much!

Seraphina

Dan Stillman

unread,
Dec 16, 2019, 4:01:12 PM12/16/19
to zoter...@googlegroups.com
On 12/16/19 6:30 AM, seraphinatarrant wrote:
> I have been in a conversation about from pdf metadata retrieval
> options on the zotero forum and was advised to go here.
> Essentially, Zotero has a higher rate of being able to extract
> metadata from articles than the Wikimedia API does, so I was going to
> try to implement using zotero's translation server, which adamsmith
> advised me to do.

No, that's a misunderstanding — we never said translation-server
provides PDF recognition. Zotero's PDF recognition functionality isn't
publicly available, for the reasons I explained [1].

Wikimedia's API is just based on Zotero's translation-server. You should
run your own instance for your own large-scale retrieval, which is why
we don't provide a public API ourselves, but functionally they're
basically the same thing. (Theirs might not be updated as frequently,
and they might resolve things with their own code for historical
reasons.) Neither provides PDF recognition.

Note too that the Zotero Connector can run translation (not recognition)
on some PDF URLs, when the translator is able to detect the URL scheme,
but if I recall correctly that won't work in translation-server either.
(Browsers provide a fake page for PDFs, but that doesn't currently
happen in translation-server. This could probably be implemented.)
translation-server can detect identifiers such as DOIs in URLs and use
those, though.


[1] https://forums.zotero.org/discussion/comment/343399/#Comment_343399

Companjen, B.A.

unread,
Dec 17, 2019, 3:29:51 AM12/17/19
to zoter...@googlegroups.com

Hi,

 

Going outside of Zotero, perhaps you like https://github.com/kermitt2/grobid – a command-line programme and web service that is quite good at extracting information from (journal article) PDFs. I have been thinking that it would be a great experiment to combine it with Zotero for starting the retrieval of PDF metadata, but I haven't tried at all to do any integration yet.

 

Regards,

 

Ben

--
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zotero-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zotero-dev/eb472e2a-d137-4256-ae9e-0e2abf7a9d5a%40googlegroups.com.

Seraphina Goldfarb-Tarrant

unread,
Dec 17, 2019, 5:08:57 AM12/17/19
to zoter...@googlegroups.com
@Ben, thanks, I'll check that out, it may be better than my heuristics for getting at the actual content! I'm basically doing everything outside of Zotero anyway since I need to get all PDF data before I can even upload files since I can't programmatically trigger metadata retrieval. 

@Dan apologies for any lack of clarity. I was:
1) never intending to ask for PDF recognition (I'm just asking for querying databases for information, which the Zotero UI calls "metadata retrieval" and thus that is the term I used)
2) in the above thread we both linked adam appears to suggest that I use zotero's translation server (which is not the same page as the recognizer server that we were previously talking about, and which has a public facing readme etc and can be spun up quite easily, though not with the target functionality). I am perfectly fine not using it as well, I was just looking for some easy method if one existed - sounds like I misunderstood and there is not. 
For the record though, there was a very large % difference in success rate for manually triggered metadata retrieval and the Wikimedia API, which seems quite surprising if it is based on Zotero's.
But sounds like I'm in for just making up the difference with some custom code for the 50% of failures.

Seraphina

You received this message because you are subscribed to a topic in the Google Groups "zotero-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/zotero-dev/9AmwvQqBCBY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to zotero-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zotero-dev/1C4AE4C2-85F9-4D54-A2DE-D9171BBD951D%40library.leidenuniv.nl.

Marielle Volz

unread,
Dec 17, 2019, 6:03:20 AM12/17/19
to zoter...@googlegroups.com
*Theoretically* anything translation-server can do, the wikimedia API can do, because we're running translation-server in the background. If you don't like the wikimedia format you can request the zotero one directly: https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation

Unfortunately both the wikimedia api and translation-server won't give as good results as you can get with Zotero in the browser. This is because some websites have captchas, log-ins, and other bot restricting measures, which require a human and/or real web browser to bypass. 

Occasionally translation-server will do better locally than our publicly accessible api, because rate limiting and the like, since we're high traffic and sometimes get blocked. We try to get unblocked whenever that happens though, so if you find any discrepancies between zotero and the wikimedia api, you can report them here and we can try to look into it: https://phabricator.wikimedia.org/project/view/62/. Our goal is to do as well as possible! 

It would be great if translation-server could do pdf links for us too! But it's not an easy problem. The corresponding bug for the wikimedia api is here: https://phabricator.wikimedia.org/T136722.  You either have to process the pdf directly which involves totally different infrastructure than HTML processing, or try to figure out an HTML source of metadata working backwards from the URL, which as you've already discovered, is fallible. Dan, maybe there's a way in translation-server to have a translator for a website recognise pdf links and then be able to map backwards to fetch the url that has the metadata in it? If the website follows a predictable pattern, that'd be really cool (but seems like it might involve a lot of work!).

There are some libraries that can extract metadata from PDFs too, like https://github.com/CeON/CERMINE which I hear is pretty good. They do also run it as a publicly accessible service here: http://cermine.ceon.pl/about.html

If you do work up some custom code, and you'd like to share it, we'd definitely be interested in looking at it / possibly incorporating it! 

Cheers,
Marielle


Dan Stillman

unread,
Dec 17, 2019, 2:01:29 PM12/17/19
to zoter...@googlegroups.com
On 12/17/19 5:08 AM, Seraphina Goldfarb-Tarrant wrote:
> @Dan apologies for any lack of clarity. I was:
> 1) never intending to ask for PDF recognition (I'm just asking for
> querying databases for information, which the Zotero UI calls
> "metadata retrieval" and thus that is the term I used)

What specific UI element are you referring to? Because "Retrieve
Metadata for PDF" is PDF recognition — taking a PDF file, extracting its
content, and retrieving metadata for it. That's the non-public,
multi-step process involving logic on both the client and server.

Saving from the Zotero Connector with a translator — that is, when it
doesn't say "Save to Zotero (PDF)" in the tooltip, which simply kicks of
"Retrieve Metadata for PDF" — is essentially equivalent to
translation-server, with some additional restrictions on translation-server:

1) The current inability to run translation on a PDF URL, which I
mentioned in my previous message. (Marielle, I think this is the
"recognise pdf links" functionality you're looking for.) As I say, this
should be fixable in translation-server, and is sometimes offset by its
ability to detect DOIs in the URL. This also only sometimes works in
Zotero — basically, when the URL scheme makes it possible for
translators to figure out where they are and get back to the article
page. It's not possible on CDNs or opaque URL schemes.

2) Captchas and such, as Marielle mentioned

3) JavaScript-rendered pages. This is a big one. When you save from the
browser, the page is already rendered via JS, but translation-server
just gets the HTML that comes over the wire. Fixing this would require
pretty major, security-sensitive changes to translation-server —
basically running a browser. We actually used to do that (as Marielle
remembers) and moved away from it, so we're not eager to try to do it
again, but it does account for the biggest difference between Zotero and
translation-server.

Marielle Volz

unread,
Dec 18, 2019, 4:55:26 AM12/18/19
to zoter...@googlegroups.com
Oh yes, we had apparently discussed this a year ago or so here: 

https://github.com/zotero/translation-server/issues/70

And then I apparently totally forgot about it :).

--
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zotero-dev+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages