How to access text content of a PDF?

299 views
Skip to first unread message

Ben Jaeger

unread,
Jul 28, 2024, 3:00:41 AM7/28/24
to zotero-dev
I'm currently working on a text to speech plugin and am trying to access the text content of a PDF (both to speak the text without needing user selection, and to highlight the words, both being highly requested features). 

Using this code seems like it should work, adapting it to Zotero gives:
Zotero.Reader.getByTabID(Zotero_Tabs.selectedID)
    ._internalReader
    ._primaryView
    ._iframeWindow
    .PDFViewerApplication
    .pdfDocument
    .getPage(1).then(pdfPage => {
        pdfPage.getTextContent().then(data => {
            console.log(data);
        });
    });

However, the pdfPage.getTextContent() step returns a rejected promise with the reason being "Restricted"

Just wondering if it's a) possible to access this data and b) how I might do so?

Thanks in advance!

volatile static

unread,
Jul 28, 2024, 5:51:52 AM7/28/24
to zotero-dev

the attachment items have a property `attachmentText`

Martynas Bagdonas

unread,
Jul 29, 2024, 2:58:11 AM7/29/24
to zotero-dev
There are various limitations when calling functions or moving variables between the reader iframe and the main Zotero client code. Please let us know what exactly you need, and we might add special APIs for plugins.

Note that we use a custom text layer and a custom logic to process page text into words, lines, and paragraphs.

The Zotero reader supports PDFs, EPUBs, and snapshots, so any new APIs we introduce will cover these formats as well, if possible.

Can you tell us more about the word highlighting part you mentioned?

Ben Jaeger

unread,
Aug 2, 2024, 12:52:14 PM8/2/24
to zotero-dev
Thanks for your reply, yeah I very quickly realised there would be a lot of unforeseen complexity when I started digging into this!

For the text highlighting feature, it's a very common text to speech feature for the program to highlight the text as it reads it aloud, this helps the user know where the program is in the paragraph and also acts as a general focus/accessibility aid.

Here's a screenshot a user sent me for how text highlighting works in Edge https://drive.google.com/file/d/1KQnLCG8TrC9DKEyuGA84an5RFwxXU3TP/view?usp=drivesdk

You can see the whole text being spoken is highlighted in blue, with the specific word the program is on in a darker shade too.

For PDFs at least, I believe something similar should be possible since pdfjs stores each word (or sometimes bundle of words) as a separate div: https://github.com/mozilla/pdf.js/blob/master/src%2Fdisplay%2Ftext_layer.js#L232 (For EPUBs and snapshots I've no idea if equivalent functionality exists sorry)

The idea being that whenever a user starts a TTS utterance, the plugin would then be able to find the corresponding text and apply some sort of highlighting to help them read along.

Ideally you could both tell Zotero to highlight some words, and then later remove them with some kind of ID it returns.

(The TTS engine I'm using fires events whenever it changes words so this would enable highlighting both the whole text and the individual words portions)

I haven't looked at how Zotero does it but I _imagine_ this is fairly similar behaviour to how it does annotations currently (with the main difference being that these wouldn't show up in the annotations sidebar), but again, haven't looked into that at all so I could be wrong!

Hope that clarifies what I'm looking for! Let me know if you need any more details/mock code/whatever.

Thanks in advance for any help you can give!

Abe Jellinek

unread,
Aug 2, 2024, 9:37:11 PM8/2/24
to zoter...@googlegroups.com
Completely untested, but take a look at how we highlight search results in Settings:


Replace window.docShell in that method with Zotero.Reader.getByTabID(Zotero_Tabs.selectedID)._iframe.docShell and try adding some ranges. You can get the current text selection range as a test case using Zotero.Reader.getByTabID(Zotero_Tabs.selectedID)._iframeWindow.getSelection().getRangeAt(0).

That’s the same system that Firefox uses to highlight find-in-page results, and it might work for your use case too.

-- 
You received this message because you are subscribed to the Google Groups "zotero-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zotero-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/zotero-dev/670084e5-2bff-45b7-a426-fb2416eb4780n%40googlegroups.com.

Ben Jaeger

unread,
Aug 19, 2024, 8:37:07 AM8/19/24
to zotero-dev
Thanks for your solution Abe (and sorry for disappearing, you replied just as I was going on holiday lol!)

I just tried out your suggestion and have had mixed results...

For good news, the selections/ranges code you gave works successfully! I had to tweak Zotero.Reader.getByTabID(Zotero_Tabs.selectedID)._iframeWindow
to be
Zotero.Reader.getByTabID(Zotero_Tabs.selectedID)._internalReader._primaryView._iframe.contentWindow
instead (the first version successfully finds selected text within the UI around the article, eg selected comments on an annotation, but not within the article itself).

For mixed/bad news, the adapted selection highlighting code works perfectly, however only in the first version where it's highlighting text in the UI around the article, in the ... ._iframe.contentWindow version, docShell comes up as undefined apparently... I tested this on a PDF as well as an EPUB out of curiosity, and neither worked. Additionally, using the docShell from the working version with the updated selection finding code doesn't work (although it doesn't throw an error either interesting)

(And going back to the original "accessing the text nodes of a PDF" topic, in all this fiddling I realised I can make a TreeWalker on the document the PDF/EPUB/Snapshot is in, and use that to collect the text nodes I need)

So for both speaking a selection and speaking the whole article, I can find the nodes to be spoken and make ranges from them, but without the docShell I don't know that I can highlight them... I'm not quite sure how to proceed from here...

Thanks again in advance for any help!

Kyana Burhite

unread,
Sep 6, 2024, 3:18:22 PM9/6/24
to zotero-dev
Hi!
I just wanted to chime in and say that I would also like to have this functionality. I would like to mirror the read aloud feature from microsoft edge, where it highlights the words as it reads the PDF to you and does not require you to select text you want it to read.

In reference to the highlighted text part - is it at all possible to put the box behind or in front of the rendering of the PDF?

Ben Jaeger

unread,
Sep 8, 2024, 7:44:12 PM9/8/24
to zotero-dev
Hi Kyana,

Firstly, happy to announce that ZoTTS can speak the full text of a paper at last! (Sorry this too so long, I've been just slightly too busy to get round to implementing and releasing it for a while)

As for the highlighting, as far as I see it, the issue is two-fold. The first one being that I'm not really sure _what_ exactly I would need to be implemented in APIs, since I have some ideas but haven't got round to thinking them through yet. The second being that since I come from a world of python and R, web dev (and therefore Zotero dev) stuff is all very new to me, so as well as not knowing what I need, I'm also not sure what's possible in the first place.

As a solution to both of these, I'm planning to set up a local build of Zotero, so I can have a go at rough drafting what to add, and then present it to the professionals for thorough scrutiny, since that will help me pin down what I need, and save everyone else going back and forth with me while I figure it out lol.

All that I to say, please continue to be patient. You and many others have very clearly expressed how much of a desired feature this is at this point lol, it'll come when it comes :)
Reply all
Reply to author
Forward
0 new messages