Content encoding

Sergey Kozlov

unread,

Feb 24, 2014, 3:58:17 AM2/24/14

to pdfhummus-in...@googlegroups.com

Hello again.
I try to replace some text in the content and met encoding problems.
I made two tests.

1) Add text to the exising PDF with PDFEditor on linux.
All fonts encodings in result pdf - WinAnsiEncoding.
Test app runs on WIndows.
After extracting Content stream I replace like:

std::string contents = <PDF contents>;
replace(contents, "text_in_pdf", PDFTextString().FromUTF8(<utf8 text>).ToString());

and also same way I use for the PDF info (otherwise it doesn't work):

InfoDictionary &info = m_pdf->GetDocumentContext().GetTrailerInformation().GetInfo();
info.Title = PDFTextString().FromUTF8(title).ToString();

2) PDF created with xournal on linux (https://dl.dropboxusercontent.com/u/4571566/Stuff/test1.pdf).
Font encoding is TrueType (DejaVu Sans); test app runs on WIndows.
Content is:

BT 12.00 0 0 -12.00 67.61 131.19 Tm /F5 1 Tf (LIPPS) Tj ET Q

Here (LIPPS) is hello.
And I can't figure out how to encode and decode content.

What is the encoding of the content stream and how to correctly put (or replace part with) utf8 text in it?
Maybe encoding depends on font or smth?
I know you provide UnicodeString, but don't know how to use it in my situation.

Can you help with this?
Thanks!

Sergey Kozlov

unread,

Feb 25, 2014, 12:39:49 AM2/25/14

to pdfhummus-in...@googlegroups.com

2) PDF created with xournal on linux (https://dl.dropboxusercontent.com/u/4571566/Stuff/test1.pdf).
Font encoding is TrueType (DejaVu Sans); test app runs on WIndows.
Content is:

BT 12.00 0 0 -12.00 67.61 131.19 Tm /F5 1 Tf (LIPPS) Tj ET Q
Here (LIPPS) is hello.
And I can't figure out how to encode and decode content.

I think I found explanation: http://sourceforge.net/p/xournal/bugs/101/#0cc0.
As I understand LIPPS is "hello" encoded to PDF glyphs. I tried to copy text from the PDF viewer (PDF-XChange) and it copies LIPPS instead of hello.
So probably not a pdfhummis issue. Don't know if possible to extract and replace text in such situation.
But would be good to know about other questions related to encoding/decoding.
Thanks.

Gal Kahana

unread,

Feb 26, 2014, 1:59:36 AM2/26/14

to pdfhummus-in...@googlegroups.com

First, the two items have different solutions.

When strings are used as text outside of the context of page/form content, like in the title example that you provided in (1), then PDFTextString is the way to go.

When strings are used as part of page/form content text, then the encoding is dependent on the font definition later placed, and are a bit more complex. hence, they library does most of the work for you, you are best advised to use the library methods in encoding text, which will take care both of encoding the text and embedding the right glyphs. read https://github.com/galkahana/PDF-Writer/wiki/Text-support

Essentially the string that you see in Tj is to be looked at as pointers. "L", for example, is to be looked at as an unsigned char index (namely 0x4C) into the array of glyphs defined for the local font definition in the PDF. most implementations take care to have that array organized so that latin chars are indexed according to their matching glyphs, so that it appears as if the "Tj" relates to actual text...however that is not really what's happening. the chars are mere indexes into encoding arrays which in turn are glyph indexes. All of this should be taken care of by the library, when using its "Tj" implementation or the higher level "ShowText".