Get text selection from PDFTron WebViewer HTML5 Web Client

698 views
Skip to first unread message

Support

unread,
Aug 27, 2013, 6:07:15 PM8/27/13
to pdfnet-w...@googlegroups.com

Q:

Hi I am testing the WebViewer 1.5.0 (HTML5 version, no silverlight, no flash) client inside a website to display pdf files (after they have been converted to .xod format). Everything works fine. I would just like to ask if it is possible, from javascript, to get the text that the user selects inside the viewed pdf. I already have successfully made tests about getting via javascript the text selection of a normal text inside a web page, but since the contents of the pdf are displayed within a <canvas> element I would like to know if and how that is also possible with javascript.

----

A: It is possible to get the selected text from JavaScript. Assuming that you have a reference to a WebViewer object you could use code something like: myWebViewer.getInstance().docViewer.GetSelectedText()


If you're interested in learning about other capabilities of the HTML5 viewer you can browse through the API documentation at http://www.pdftron.com/webviewer/demo/html5/doc/

Mirko Lugano

unread,
Aug 30, 2013, 4:55:41 AM8/30/13
to pdfnet-w...@googlegroups.com
Hello,
sorry for the wrongly placed previous post, I hope this is the appropriate section of the forum. I'll repeat my question.
The text selection using the API you pointed me to works ok, but it returns 'plain text' with no formatting information. I would like to ask if it is possible to maintain formatting and/or text structure information when getting the selected text via javascript (e.g. bold, italic, paragraphs, etc.). I have not found any overloaded GetSelectedText() method or other similar methods in the API, is what I am looking for possible?
Thanks in advance.
Mirko

Matt Parizeau

unread,
Aug 30, 2013, 6:24:59 PM8/30/13
to pdfnet-w...@googlegroups.com
Hi Mirko,

Yes this is the correct location for WebViewer questions :)
There is no formatting information stored, as different parts of the text are just displayed with different fonts.  There is a bit of structural information but currently there is no API exposing this functionality.  Can you let us know what you're trying to achieve so we can figure out if that information is available or what information you need.

Matt Parizeau
Software Developer
PDFTron Systems Inc.

Mirko Lugano

unread,
Sep 2, 2013, 3:25:51 AM9/2/13
to pdfnet-w...@googlegroups.com
Hi Matt
thank you for the feedback. I have attached a screenshot of a simple test of mine to show what I mean. As you can see, I have selected some text, a paragraph title, and some other text. At the bottom of the image is the display of the selected text after I click on the 'select' button. Ideally I would like to keep the structural and formatting information of the selected text, which in this case would mean the bold on the title, the line spacing, and the indentation of the start of the next paragraph. I have then implemented a simple string replace javascript utility method to replace splitted words (due to carriage return), in this example as you can see, the word 'Finanzie- rung' becomes correctly 'Finanzierung'. If there was an API for that too it would be more than excellent.
Thank you
Mirko
text_selection.png

Matt Parizeau

unread,
Sep 4, 2013, 5:31:47 PM9/4/13
to pdfnet-w...@googlegroups.com
Hi Mirko,

I've looked into this some more and found that the information required is currently not preserved in the conversion to XOD.  We may add this at some point, but if you're very interested in this feature you can submit a custom engineering request (http://www.pdftron.com/support/professionalservices.html) and we will have someone get back to you regarding the project.

Matt Parizeau
Software Developer
PDFTron Systems Inc.

Mirko Lugano

unread,
Sep 5, 2013, 4:08:57 AM9/5/13
to pdfnet-w...@googlegroups.com
Hi Matt
thanks for your investigation and your advice. We will evaluate how important this is to us and in case do as you say.
Best regards
Mirko

Devanshi Bhagat

unread,
May 31, 2017, 12:09:42 PM5/31/17
to PDFTron WebViewer
Hi,

Using Full text search option on the web viewer, if I search string "DIGITAL TEXT" then it searches "DIGITAL TEXT" and selects it on the document. 
But after that when I do "readerControl.getDocumentViewer().getSelectedText()" then it gives me "Digital Text" but it should give me "DIGITAL TEXT".

Is there any solution for this?

Justin Jung

unread,
May 31, 2017, 7:29:12 PM5/31/17
to PDFTron WebViewer on behalf of Devanshi Bhagat
Hello,

We have trouble reproducing the issue on our end. Would you be able to provide us the document with this problem?

Justin Jung

Mirko Lugano

unread,
Aug 2, 2019, 1:12:32 PM8/2/19
to PDFTron WebViewer
Hi, I know this is an old post, but I just wanted to ask if there have been any progress / news about the issue that I described above (https://groups.google.com/d/msg/pdfnet-webviewer/SuGVYlYQ9Aw/3WJ9nTvL4UwJ, selecting text and maintaining formatting information - bold, paragrahs, etc.) in some of the newer releases of the WebViewer. In the API I didn't find anything relevant. I am now loading and displaying PDFs directly without converting them to XOD.
Best regards
Mirko

Andy Huang

unread,
Aug 6, 2019, 7:39:47 PM8/6/19
to PDFTron WebViewer
Hi Mirko,

There have been advancements since the original question I believe. The classes you may be interested in are the TextExtractor classes in the PDFNet API. For the latest WebViewer, this means you will need to enable the full API and take a slight performance hit. I have provided some sample code that runs in the main script (you can use config file as well) to help you get started:

  const iframe = document.querySelector('iframe').contentWindow;
  const PDFNet = iframe.PDFNet;
 

instance.docViewer.on('textSelected', async (e, quads, text, pageIndex) => {
    // TODO: Consider using setTimeout and clearTimeout to process last event only
    if (!text || !quads || quads.length === 0) {
      return;
    }

    const doc = instance.docViewer.getDocument();
    const pdfDoc = await doc.getPDFDoc();
    const page = await pdfDoc.getPage(pageIndex + 1);
    const extractor = await PDFNet.TextExtractor.create();

    quads.forEach(async quad => {
      const points = quad.getPoints();
      const topLeft = doc.getPDFCoordinates(pageIndex, points.x4, points.y4);
      const bottomRight = doc.getPDFCoordinates(pageIndex, points.x2, points.y2);

      await extractor.begin(page, new PDFNet.Rect(topLeft.x, topLeft.y - 3, bottomRight.x, bottomRight.y), PDFNet.TextExtractor.ProcessingFlags.e_remove_hidden_text);

      const xml = await extractor.getAsXML(PDFNet.TextExtractor.XMLOutputFlags.e_words_as_elements | PDFNet.TextExtractor.XMLOutputFlags.e_output_style_info | PDFNet.TextExtractor.XMLOutputFlags.e_output_bbox);

      // Process XML OR...

      const line = await extractor.getFirstLine();

      const numWords = await line.getNumWords();

      for (let i = 0; i < numWords; i++) {
        const word = await line.getWord(i);
        const style = await word.getStyle();
        const font = await style.getFontName();

        // Process words
      }
    });
  });


The code provided can be optimized a bit (large quad vs multiple) but it should give you the gist of what is going on. You can also find a few resources on our site that go over text extraction:


I am not entirely sure if you are still trying to achieve the same thing but the TextExtractor should be able to provide the information related to text and style. With regards to the position of the text, the quads already provide that from the event.

Let me know if this helps!

Andy Huang
Software Developer
PDFTron Systems Inc.

Mirko Lugano

unread,
Aug 7, 2019, 12:21:07 PM8/7/19
to PDFTron WebViewer
Hi Andy, thank you for your answer. I have implemented it but I think there are still some quirks or things I don't understand.
First of all I applied a debounce to the textSelected event in order to have it called only once at the end of the selection (so far so good):

readerControl.docViewer.on('textSelected', $.debounce(500, async (e, quads, text, pageIndex) => {

if I select 2 lines (title and one line, see attachment "selection.png") I get the correct text (all of it), but if I try to have the xml printed in the developer tools console, only the ml of the last block gets printed out (see attachment "xml.png"). I debugged your code and even though the number of quads is correct (in this case 2), when the debugger hits this line

await extractor.begin(page, new PDFNet.Rect(topLeft.x, topLeft.y - 3, bottomRight.x, bottomRight.y), PDFNet.TextExtractor.ProcessingFlags.e_remove_hidden_text);

for every quad which is not the last one, it does not go on to the next lines, nor is an error thrown. Therefore only the xml of the last quad is being extracted and printed out.

Apart from this problem, I have another question: I have checked in the debugger the following variables from your code


const word = await line.getWord(i);
const style = await word.getStyle();
const font = await style.getFontName();

but they look like JS objects, from which I don't know how I can extract HTML-like information. Is it possible to do something like that? Like some sort of XML-to-HTML conversion.
I hope I have explained myself good enough :)
Best regards
Mirko
selection.PNG
xml.PNG

Andy Huang

unread,
Aug 7, 2019, 4:26:31 PM8/7/19
to pdfnet-w...@googlegroups.com
Hi Mirko,

Ah, I looked into it and it appears there is a slight misunderstanding with the extractor. The extractor will have to be created on each iteration to capture the text you want. Here is a revised code snippet:

instance.docViewer.on('textSelected', (e, quads, text, pageIndex) => {
    clearTimeout(previousHandle);
    previousHandle = setTimeout(async () => {


      if (!text || !quads || quads.length === 0) {
        return;
      }

      const doc = instance.docViewer.getDocument();
      const pdfDoc = await doc.getPDFDoc();
      const page = await pdfDoc.getPage(pageIndex + 1);

      quads.forEach(async quad => {
        const points = quad.getPoints();

        const topLeft = doc.getPDFCoordinates(pageIndex, points.x1, points.y1);
        const bottomRight = doc.getPDFCoordinates(pageIndex, points.x3, points.y3);

        const rect = await PDFNet.Rect.init(topLeft.x, bottomRight.y - 3, bottomRight.x, topLeft.y + 3);



        const extractor = await PDFNet.TextExtractor.create();

        await extractor.begin(page, rect, PDFNet.TextExtractor.ProcessingFlags.e_remove_hidden_text);



        const xml = await extractor.getAsXML(PDFNet.TextExtractor.XMLOutputFlags.e_words_as_elements | PDFNet.TextExtractor.XMLOutputFlags.e_output_style_info | PDFNet.TextExtractor.XMLOutputFlags.e_output_bbox);

        // Process XML OR...

          console.log(xml)



        const line = await extractor.getFirstLine();

        const numWords = await line.getNumWords();

        for (let i = 0; i < numWords; i++) {
          const word = await line.getWord(i);
          const style = await word.getStyle();
          const font = await style.getFontName();

          // Process words

          console.log(await word.getString())
        }
      });

      clearTimeout(previousHandle);
    }, 1000);
  });


I have also adjusted and reduced the final quad provided to begin as some of the text quads were intersected each other, causing lines above and below to be included.

With regards to the objects you mentioned, they are the TextExtractor classes: TextExtractorWord and TextExtractorStyle. Technically, the line object is of the TextExtractorLine type. The line will get you the word and word will get you the style of that word. You can read more about them in our API: https://www.pdftron.com/api/web/PDFNet.TextExtractor.html.

I hope that clears things up for you!

EDIT: I forgot to address your question regarding HTML. Those classes will get you values for things of interest such as font name and size. However, the conversion to HTML will have to be done by the developer. The info retrieved should allow you to set them on your own elements.

Andy Huang
Software Developer
PDFTron Systems Inc.

Mirko Lugano

unread,
Aug 13, 2019, 12:42:30 PM8/13/19
to PDFTron WebViewer
Hi Andy, thanx it has worked and I can see the retrieved info, that's cool. I'll see what we can do with them :)
Best regards
Mirko
Reply all
Reply to author
Forward
0 new messages