Usable position data for text elements in a PDF

45 views
Skip to first unread message

Xavier Hocquet

unread,
Mar 9, 2021, 5:28:25 PM3/9/21
to PDF::Reader

Hey there, love the library!

I am exploring some options with needing to get positional data of some text in a PDF file. For example, in a bank statement, given the name of the person, I would like coordinates where their name appears.

By printing the page and page objects in my console, I am able to visually see what seems like the right data. For example, I can see structures like this that are within the cache (I think?) -

```
tokens-3d487a0f0011d31879cb978a46d4c268=>["BT", :F3, 9, "Tf", 9, "TL", 122.4, 782.64, "Td", "Ending Balance", "Tj", "ET"]
```

If I'm not mistaken, that 122.4 and 782.64 look like X/Y coordinates for the start of the text.

I'm looking for a small example on how to access these structures in some fashion knowing that I am only interested in text objects. Ideally, I would like an array of objects like such -

```
{
  value: "TEXT HERE",
  x: 123,
  y: 789
}
```

Could you please provide a small example of how you would go about this? I have poked around the source for a while but I'll admit I'm a bit stumped due to the abstract nature of it all!

Thank you!

James Healy

unread,
Mar 11, 2021, 6:47:36 AM3/11/21
to pdf-r...@googlegroups.com
Hi Xavier,

Unfortunately there isn't a public API in pdf-reader that can output
text with position annotations. I'd be very open to including it,
however I'm fairly short on time personally.

The core data is all there, and here's a hack that exposes it:
https://gist.github.com/yob/2b1dfbefdafd8833704dfadf24d131e9

To use it you'd have to fork pdf-reader and apply the patch. If you do
so and find it useful, I'd be happy to accept a pull request with a
tidied up version of the change. In particular, it'd be good to avoid
using instance_variable_get and to have some integration specs in
spec/integration_spec.rb.

James
> --
> You received this message because you are subscribed to the Google Groups "PDF::Reader" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdf-reader+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdf-reader/c7f8ae70-0027-4911-b53f-a61963a26406n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages