Problem with Superscripts

Skip to first unread message


Apr 1, 2017, 8:31:33 AM4/1/17
to PDF::Reader
I'm trying to scrape this PDF -

Look at the columns 1/4, 1/2, Fin... below it are numbers like 5 1/2(superscript).  this gem reads it as "51/2".  is there some way i can handle this superscripted data and differentiate what is superscripted?  thanks!

James Healy

Apr 1, 2017, 8:37:12 AM4/1/17
Hi Dan,

It looks like the superscript characters are rendered as regular
characters but smaller and in an offset position. An alternative for
the PDF author would be to use unicode superscript characters, but
sadly it seems they haven't.

The standard text extraction in pdf-reader attempts to layout the
characters as plain text, where unfortunately there's no way to
differentiate size, so I don't think you'll be able to detect these
superscript characters.

It would be possible to build an alternative text extract algorithm
that examines the size and position of each character to identify
superscript. As a starting point, I'd suggest creating a custom
version of this class: lib/pdf/reader/page_text_receiver.rb

> --
> You received this message because you are subscribed to the Google Groups
> "PDF::Reader" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
> To post to this group, send email to
> Visit this group at
> For more options, visit
Reply all
Reply to author
0 new messages