If you figure out how to do this, please post a code example here.
Also, I've been meaning to start up a dialog with the lead developer
of that project. It'd be interesting to have the two projects under
one roof, even if they don't share code or other resources for the
time being....
I'm the current active developer of PDF::Reader, and I'd be interested
in seeing code that does this.
I suspect it would be an interesting challenge without some further work
to Reader. In its current state it's pretty good at extracting vector
image commands, as well as text (provided a standard encoding is used -
advanced encoding support is coming). However advanced information like
embedded raster images, fonts and metadata are currently inaccessible.
-- James Healy <jimmy-at-deefa-dot-com> Wed, 30 Jan 2008 13:30:45 +1100
2008/1/30, James Healy <ji...@deefa.com>:
That's exactly what I'm using it for. The shipped README includes a
simple rspec example, but I'd love to see/ship further examples if
people have them.
The current work on adding advanced encoding support is driven by my
desire to unit test the contents of Unicode encoded PDFs generated by
cairo.
-- James Healy <jimmy-at-deefa-dot-com> Wed, 30 Jan 2008 21:22:48 +1100
We'll definitely be looking into this for possible limited PDF::Writer specs.
-greg
This sounds like an interesting project, but you may run into a few
hurdles.
glennswest wrote:
> Right now just for text and position. Tommorrow I'll add font, and
> basic blocks and graphics.
This will work for 7-bit text (ie. US-ASCII chars), but anything else
will have unpredictable results. PDF::Reader returns all text encoded
as UTF-8, and PDF::Writer (by default) assumes all input text is encoded
as cp-1252 (the windows charset). For the first 127 characters these
will match, but above that all hell breaks loose.
The ideal solution is for PDF::Writer to support UTF-8 input, but for
the moment it's not available. You might be able to use iconv or
something to convert the utf-8 back to cp-1252 before passing it into
PDF::Writer, but any characters that aren't representable in the
destination character set will be lost.
At this stage, the majority of PDF::Reader is focussed on correct
extraction of text, so there isn't a great amount of detailed access to
information on metadata, fonts, embedded raster images etc.
I'm happy to start looking at that stuff in due course if there's a
need, but patches are always welcome if you need it sooner :)
> If all goes as planned, I'll release the .rb on my blog, and if
> the "maintainers" of pdfwriter/reader wish, I dont might contributing
> it into the tree.
Regardless of what I've said above, I'm still keen to see your code if
you feel like releasing it and/or proving me wrong.
-- James Healy <jimmy-at-deefa-dot-com> Thu, 06 Mar 2008 00:42:07 +1100
Some common-ish non-alphanumeric characters will appear above byte 127 -
things like the euro symbol, windows "smart quotes", some hyphens, etc.
How often these show up depends on what generated the PDF file and the
locale of the system at the time.
> Let you know after I play a few days.
Sounds good.
> I thought pdf writer was getting closer to utf8 support.
I've been hacking at a patch, but it still needs a little work. The
text encoding is working fine, but mapping the character codes to font
glyphs is taking some time, as I'm not particularly familiar with how
fonts work.
-- James Healy <jimmy-at-deefa-dot-com> Thu, 06 Mar 2008 12:51:10 +1100