Extract "lines" from a PDF

525 views
Skip to first unread message

john sharp

unread,
May 6, 2022, 8:23:25 PM5/6/22
to pdfium
Sorry if this has been answered elsewhere but if it has it's well hidden!

Basically, I've got a PDF document. On it there is a bunch of text that's surrounded by "lines". I think they're path's. I've done some playing around with the document and have deleted everything that's marked as a path and they disappear. So, so far, so good. 

What I need to do though is actually understand where they are. I've read a bunch of documents, posts, articles and the like and have done something like this:
var d = FPDF_LoadDocument
 var p = FPDF_LoadPage(d)
  var o = FPDFPage_GetObject(d, 4)
   var s = FPDFPath_GetPathSegment(o, 0)
    FPDFPathSegment_GetPoint(s, out x, out y)

In the document itself, the path is roughly in the middle of the page but x and y are close to 0, 0. Infact many of the results of calls to FPDFPathSegment_GetPoint are in and around 0, 0.

What am I missing?

john sharp

unread,
May 7, 2022, 3:10:38 PM5/7/22
to pdfium
To expand on the comment above. I've got a bunch of paths and can extract data like this:
  0     4  Path  2 -       0 -     27.3/     595 x    584.7/     593 - 8 |       -0.5 x          0
  0     4  Path  0 -       1 -     27.3/     595 x    584.7/     593 - 8 |      554.9 x          0
  0     5  Path  2 -       0 -    582.2/   595.5 x    584.2/   575.6 - 8 |      554.4 x       -0.5
  0     5  Path  0 -       1 -    582.2/   595.5 x    584.2/   575.6 - 8 |      554.4 x       17.4
  0     6  Path  2 -       0 -     27.8/   595.5 x     29.8/   575.6 - 8 |          0 x       17.4
  0     6  Path  0 -       1 -     27.8/   595.5 x     29.8/   575.6 - 8 |          0 x       -0.5
  0     7  Path  2 -       0 -    582.2/   577.8 x    584.2/   561.3 - 8 |      554.4 x       -0.1
  0     7  Path  0 -       1 -    582.2/   577.8 x    584.2/   561.3 - 8 |      554.4 x       14.4
  0     8  Path  2 -       0 -     27.3/   563.4 x    584.7/   561.4 - 8 |      554.9 x       14.3
  0     8  Path  0 -       1 -     27.3/   563.4 x    584.7/   561.4 - 8 |       -0.5 x       14.3
  0     9  Path  2 -       0 -     27.8/   577.8 x     29.8/   561.3 - 8 |          0 x       14.4
  0     9  Path  0 -       1 -     27.8/   577.8 x     29.8/   561.3 - 8 |          0 x       -0.1

Page No, Object No, Object Type, Segment Type, Segment Index, Bounding box, left/top x right/bottom, segment point x. y.

On the rendered page the above represents a box in about the middle of the page. I can't seem to get my head around how the above numbers relate to what's on the page. 

Miklos Vajna

unread,
May 10, 2022, 3:37:25 AM5/10/22
to john sharp, pdfium
Hi,

On Fri, May 06, 2022 at 05:03:08PM -0700, 'john sharp' via pdfium <pdf...@googlegroups.com> wrote:
> In the document itself, the path is roughly in the middle of the page but x
> and y are close to 0, 0. Infact many of the results of calls
> to FPDFPathSegment_GetPoint are in and around 0, 0.

Perhaps one of the parent objects have a transform? It would be easier
to say if you could attach a short, self-contained sample document.

Regards,

Miklos

Miklos Vajna

unread,
Jun 19, 2022, 7:55:31 AM6/19/22
to pdf...@googlegroups.com
Hi John,

On Thu, May 12, 2022 at 08:38:34AM +0100, John Sharp <johnsh...@googlemail.com> wrote:
> In terms of a parent object having modifications that can alter the child
> element location, how do I find the parent object from a child object?

Assuming you visit the objects of a page using FPDFPage_GetObject(), and
then optionally e.g. FPDFFormObj_GetObject(), then you can probably keep
track of the parent of an object in your code as you visit the object
tree, you don't need any PDFium API for that.

Regards,

Miklos

john sharp

unread,
Jun 24, 2022, 4:18:18 AM6/24/22
to pdfium
Hi Miklos, I'm not sure if I sadi thank you for the help and sharing your knowledge.

John

Reply all
Reply to author
Forward
0 new messages