Mapping coordinates to word/page index and vice versa

76 views
Skip to first unread message

Matt Parizeau

unread,
Aug 5, 2014, 12:50:21 PM8/5/14
to
Q:

I'm hoping to be able to get a bit more access to the parsed text representation of the PDF so that I can highlight arbitrary text on a sentence by sentence basis. It seems like there must be some ordered representation of the text available, for instance, to the search function; it would be great to be able to search through the text, get say the third sentence, and be able to calculate the coordinates of the bounding boxes for that text.  Basically, I would like to be able to easily map coordinates to a word/page index (page 4, word 20-30, for instance -- and to be able to get that text content as well), and vice versa.

A:

We do have some internal information that is used for searching and selecting that isn't currently exposed.  We may expose functionality in the future similar to what you want, but unfortunately it isn't completely trivial to do so and even with that information you would still have to do some work on your side.

With that said we have a workaround that could possibly work for your situation using the functionality we do have currently have exposed.  We do expose a function to load all of the text for a page and you can use it like this:
docViewer.GetDocument().LoadPageText(pageIndex, function(text) { console.log("The text: " + text) });

With the text you would have to parse the sentences and words yourself but then you could use text search to get the coordinates for a specific word.  The tricky part here is that you would need to know how many times that word (or sentence) occurs on the page before the instance that you want.  Once you've searched to the correct instance then you have the quads.  The code for searching could look something like:

var mode = me.docViewer.SearchMode.e_page_stop | me.docViewer.SearchMode.e_highlight;
docViewer
.TextSearchInit("myword", mode, false, function(result) {
   
if (result.resultCode === Text.ResultCode.e_found) {
       
var quads = result.quads;
       
for (var i = 0; i < quads.length; ++i) {
           
var myQuad = quads[i].GetPoints();
       
}
   
}
});

To go from coordinates to words you could use the text selection tool to simulate a selection and get the selected text.  Once you have the text you could search through the page and at each result compare the coordinates of the result to the original coordinates you had to see if they overlap.  To simulate the text select tool you could use this code:

var pt1 = { x: 100, y: 100, pageIndex: 0 };
var pt2 = { x: 200, y: 200, pageIndex: 0 };
var textSelectTool = new docViewer.ToolModes.TextSelect(docViewer);
textSelectTool
.select(pt1, pt2);

When this completes it will call the callback that is set for text selection (in the upcoming version this will be an event).  In ReaderControl.js you can see me.docViewer.SetTextSelectedCallback which will give you the text that was selected.

Overall there are some things to note about searching.  If you search for the same pattern consecutive times then the search will continue from where you left off.  You'll need to either change pages or search for a different pattern to reset it.  A search will begin from the current page in the viewer so you need to make sure that the page you want to search on is the current page in the viewer.

Obviously this is not completely straightforward!  You could also possibly do fancier things like on the initial load search through all the text and construct a map or some other structure of words to coordinates and use that for your lookups instead of searching each time.
Reply all
Reply to author
Forward
0 new messages