Problem with Superscripts

16 views
Skip to first unread message

Dan

unread,
Apr 1, 2017, 8:31:33 AM4/1/17
to PDF::Reader
I'm trying to scrape this PDF - http://www.equibase.com/static/chart/pdf/FG032517USA.pdf

Look at the columns 1/4, 1/2, Fin... below it are numbers like 5 1/2(superscript).  this gem reads it as "51/2".  is there some way i can handle this superscripted data and differentiate what is superscripted?  thanks!

James Healy

unread,
Apr 1, 2017, 8:37:12 AM4/1/17
to pdf-r...@googlegroups.com
Hi Dan,

It looks like the superscript characters are rendered as regular
characters but smaller and in an offset position. An alternative for
the PDF author would be to use unicode superscript characters, but
sadly it seems they haven't.

The standard text extraction in pdf-reader attempts to layout the
characters as plain text, where unfortunately there's no way to
differentiate size, so I don't think you'll be able to detect these
superscript characters.

It would be possible to build an alternative text extract algorithm
that examines the size and position of each character to identify
superscript. As a starting point, I'd suggest creating a custom
version of this class: lib/pdf/reader/page_text_receiver.rb

James
> --
> You received this message because you are subscribed to the Google Groups
> "PDF::Reader" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pdf-reader+...@googlegroups.com.
> To post to this group, send email to pdf-r...@googlegroups.com.
> Visit this group at https://groups.google.com/group/pdf-reader.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages