Text extracted seems to be encoded

37 views
Skip to first unread message

Elliot Tison

unread,
Nov 9, 2020, 5:12:35 PM11/9/20
to PDF::Reader
Hello,

I ofter use this gem with no issue. However, for one client, I cannot extract text from several PDF files.

The text extracted is like this:
\n \u0004 \n \u0005 \t \u0004 \b \a \u0003 \u0001 \u0006 \u0006 \u0005 \u0004 \u0003 \u0002 \u0001\n\n\n.......

Is this a known limitation like mentioned in your documentation ("due to the way it has been stored, or the use of invalid bytes")?

Many thanks.

Best

elliot

James Healy

unread,
Nov 9, 2020, 5:21:13 PM11/9/20
to pdf-r...@googlegroups.com
Hi Elliot,

It's hard to know without seeing a sample file. There's cases where
pdf-reader can be improved to handle rare approaches to encoding, and
other cases where there's no way to extract the text.

If you're able to share a file directly to my address, I'm happy to
take a quick look.

James
> --
> You received this message because you are subscribed to the Google Groups "PDF::Reader" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to pdf-reader+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/pdf-reader/4d9920c5-8af1-40d4-9967-d331258877a2n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages