Convert raw text to utf-8

55 views
Skip to first unread message

Damian Martinelli

unread,
May 18, 2017, 7:07:34 AM5/18/17
to PDF::Reader
I'm getting the raw_content from a pdf and I want to get some part of it converted to utf-8 (like a see it when I use the text method from the page object).

I saw the code, what is done on the text method on the Page class.

I tried with code from the Parser class, related to MAPPING, but I'm not getting the desired text.

For example, from the raw_content I get "Luj\\341n"

str = "Luj\\341n"

str.gsub!(/\\([nrtbf()\\\n]|\d{1,3})?|\r\n?|\n\r/m) do |match|
  MAPPING[match] || ""
end

And I get "Luj\xE1n"

\xE1 should be an 'á'

\x00E1 is an 'á' on utf-16 encoding.

What am I missing to get str = "Luján"?

Can anyone help me?

Thanks!!

James Healy

unread,
Jun 24, 2017, 11:01:10 AM6/24/17
to pdf-r...@googlegroups.com
Unfortunately using the raw_content method is unlikely to help with
the vast majority of PDFs.

Are you trying to extract text from specific sections of a page,
rather than the entire page?

pdf-reader doesn't support that out of the box. If you want to explore
writing custom code to do it, the `PDF::Reader::PageTextReceiver`
class is 80% of the way there. The missing piece is providing a way to
specify the bounding box you're interested in and ignoring any
characters outside it.

James
> --
> You received this message because you are subscribed to the Google Groups
> "PDF::Reader" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pdf-reader+...@googlegroups.com.
> To post to this group, send email to pdf-r...@googlegroups.com.
> Visit this group at https://groups.google.com/group/pdf-reader.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages