Why are TextExtractor.Word bounding boxes different from Elements?

37 views
Skip to first unread message

Lee Gillie

unread,
Jan 30, 2015, 5:28:45 PM1/30/15
to pdfne...@googlegroups.com
I am doing some conversion work, and studying the information I can get page text retrieval both by looking at e_text Elements and also TextExtractor.Words.  What I find is that the bounding boxes vary vertically. Text (e_text) Element bounding boxes are just a bit higher on the page.  But the X1 (left) sides of these bounding boxes agree very closely.

When I find conditions most easily found via Text Elements I want to capture the line and words within the same line as text (e_text) elements. That is to use a clipping Rect on the TextExtractor which I create from Y1 and Y2 of the Text Element bounding box to then in turn get the line and words via the TextExtractor. But because of the vertical shift it is not really working.  If I could understand why they are different I might be able to calculate a better clipping Rect for the TextExtractor. (from the e_text Element(s))

Can you please explain?

--

Ryan

unread,
Feb 2, 2015, 4:18:13 PM2/2/15
to pdfne...@googlegroups.com
The BBoxes returned from Element class are based on the information in the font file metadata, which might not actually be correct.

TextExtractor does more work, and rebuilds more accurate bounding boxes.

Reply all
Reply to author
Forward
0 new messages