Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Looking for PDF with UTF-16 text in it

1,142 views
Skip to first unread message

arne thormodsen

unread,
Jan 10, 2007, 2:28:12 AM1/10/07
to
Folks,

Does anyone know if there are any examples of PDF files on the net with
Unicode UTF-16 encodings for "text string"-type data elements.

I'm looking for anything in UTF-16 that fits the description from the Adobe
PDF spec:

3.8.1 Text Strings
Certain strings contain information that is intended to be human-readable,
such
as text annotations, bookmark names, article names, document information,
and
so forth. Such strings are referred to as text strings. Text strings are
encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D.
Unicode
is described in the Unicode Standard by the Unicode Consortium (see the
Bibliography).

I've got plenty of non-English files, but none contain any elements like
described above. I'm doing some testing and would like to have a few
examples like this.

Thanks,

--arne


Aandi Inston

unread,
Jan 10, 2007, 4:13:18 AM1/10/07
to
"arne thormodsen" <arneXX.th...@hpXX.com> wrote:

>Does anyone know if there are any examples of PDF files on the net with
>Unicode UTF-16 encodings for "text string"-type data elements.

You can readily make these in recent versions of Acrobat by pasting a
non-Latin1 string as the subject in document info.
----------------------------------------
Aandi Inston
Please support usenet! Post replies and follow-ups, don't e-mail them.

Bruno Lowagie

unread,
Jan 10, 2007, 3:59:17 AM1/10/07
to
arne thormodsen wrote:
> Folks,
>
> Does anyone know if there are any examples of PDF files on the net with
> Unicode UTF-16 encodings for "text string"-type data elements.

Look at object 16 in this file:
http://itext.ugent.be/itext-in-action/examples/chapter13/results/outline_actions.pdf
It's a Unicode text string used in the Outline Panel.

iText generated documents use all the text string types
on many occasions, so you might want to buy the book iText in Action
http://itext.ugent.be/itext-in-action/ to create more samples for
yourself. iText in Action is a good companion for the PDF Reference
because it first explains what's in the reference, and then allows
you to make a small PDF that actually demonstrates the functionality.

best regards,
Bruno

Aandi Inston

unread,
Jan 10, 2007, 6:42:20 AM1/10/07
to
"arne thormodsen" <arneXX.th...@hpXX.com> wrote:

>I'm looking for anything in UTF-16 that fits the description from the Adobe
>PDF spec:

By the way, if you are writing code be sure to accept UTF-16BE and
UTF-16LE, even if you only find examples of one.

So you need to recognise FFFE and FEFF as the opening bytes and
proceed accordingly.

arne thormodsen

unread,
Jan 10, 2007, 1:38:23 PM1/10/07
to

"Bruno Lowagie" <bruno....@ugent.be> wrote in message
news:eo25u7$vit$1...@gaudi2.UGent.be...

Thanks. This is what I needed.

I've another question. The only way that the parsing can work here is to
assume that the byte values "0x28" and "0x29" (left and right parenthesis)
never occur as the high byte of a UTF-16 character. Conveniently enough
there is a "hole" in the UNICODE spec that accomodates this.

I'm assuming that this "hole" is there for a reason, and that Adobe got it
put there. I'm trying to find out more right now.

Thanks,

--arne

> best regards,
> Bruno


arne thormodsen

unread,
Jan 10, 2007, 5:53:31 PM1/10/07
to

"arne thormodsen" <arneXX.th...@hpXX.com> wrote in message
news:zoaph.4165$NH6....@news.cpqcorp.net...

>
> I've another question. The only way that the parsing can work here is to
> assume that the byte values "0x28" and "0x29" (left and right parenthesis)
> never occur as the high byte of a UTF-16 character. Conveniently enough
> there is a "hole" in the UNICODE spec that accomodates this.
>
> I'm assuming that this "hole" is there for a reason, and that Adobe got it
> put there. I'm trying to find out more right now.
>

Hmmm, I was using an old version of the UNICODE spec. Looks like the "hole"
has been filled with stuff since it was printed:

Braille Patterns (2800-28FF)
Supplemental Arrows-B (2900-297F)

Now I am puzzled...

--arne

Paulo Soares

unread,
Jan 10, 2007, 5:34:32 PM1/10/07
to

The pdf reference is your friend. You only have bytes between the
parenthesis, if it's Unicode or not is only revealed after unescaping
the special characters. In other words a char/byte representation of:

)a

would be in the pdf string:

(\)a)

Paulo

Aandi Inston

unread,
Jan 10, 2007, 6:01:27 PM1/10/07
to
"arne thormodsen" <arneXX.th...@hpXX.com> wrote:

>I've another question. The only way that the parsing can work here is to
>assume that the byte values "0x28" and "0x29" (left and right parenthesis)
>never occur as the high byte of a UTF-16 character.

No, that's not the case. There is no connection; you are mixing up
processing at different levels.

The parsing of string data in a PDF follows simple rules, continuing
to a matched parenthesis, processing \ escapes as they are seen. This
is used to build a list of byte values, with no interpretation at all
at this point.

Once you have the list of bytes, you can put an intepretation on those
values - for instance a string in PDFDocEncoding or UTF-16LE or
UTF-16BE.

It was the responsibility of the software that wrote the PDF to make a
string value that could be parsed. That might mean escaping the data
values for parentheses and backslash as they are seen. Equally, the
software might have written a hex string. The parsing software will
read them all equally.

arne thormodsen

unread,
Jan 10, 2007, 7:02:02 PM1/10/07
to

"Aandi Inston" <qu...@dial.pipex.con> wrote in message
news:45a56f61....@read.news.uk.uu.net...

> "arne thormodsen" <arneXX.th...@hpXX.com> wrote:
>
>
> It was the responsibility of the software that wrote the PDF to make a
> string value that could be parsed. That might mean escaping the data
> values for parentheses and backslash as they are seen. Equally, the
> software might have written a hex string. The parsing software will
> read them all equally.

So let me state this another way to make sure that I understand it:

1. It is the AUTHORING application's responsibility to produce parseable
PDF strings

2. IF a UTF-16 string can be represented in the "(\254\255..)" form it MAY
be written this way. This would imply it does not contain any 2 or 4-byte
sequences that contain bytes that correspond to the ASCII characters "(",
")", and others that the PDF parser considers special.

3. IF this CANNOT be done, then the authoring application MUST represent
the string in the "<FEFF...>" form

Is this correct? I've already edited some PDF files and changed UTF-16
strings from one form to the other, Acrobat seems equally accepting of
either.

Thanks much for your help here. This point does not seem to be explicitly
covered in the PDF spec, or if it is I've missed it.

--arne

Aandi Inston

unread,
Jan 11, 2007, 4:14:48 AM1/11/07
to
"arne thormodsen" <arneXX.th...@hpXX.com> wrote:

>So let me state this another way to make sure that I understand it:
>
>1. It is the AUTHORING application's responsibility to produce parseable
>PDF strings

Yes.


>
>2. IF a UTF-16 string can be represented in the "(\254\255..)" form it MAY
>be written this way.

Yes.

>This would imply it does not contain any 2 or 4-byte
>sequences that contain bytes that correspond to the ASCII characters "(",
>")", and others that the PDF parser considers special.

No, I don't see why you are saying that. If it contains bytes that
correspond to the three (and only three) special characters then they
can be escaped. It doesn't imply anything about the contents of the
bytes which make up the string.

This seems to be a repeat of what I tried to cover in the last
question, so please take another look and let me know if I still
haven't made it clear.


>
>3. IF this CANNOT be done, then the authoring application MUST represent
>the string in the "<FEFF...>" form

No, this is NEVER necessary and ALWAYS permitted.

If the string would be largely written as octal escapes, a hex string
is more compact.

arne thormodsen

unread,
Jan 11, 2007, 12:34:47 PM1/11/07
to

"Aandi Inston" <qu...@dial.pipex.con> wrote in message
news:45a5ff84....@read.news.uk.uu.net...

Thanks. Now I hope do understand. I had kind of a "mental block" imagining
a 2-byte unicode character written with excaped "ascii".

My understanding now is that all of the following are acceptible ways to
represent the UNICODE character "U2800":

(\254\255\050\000)
(\254\255\(\000)
<FEFF2800>

There are also ways using regular characters, but I can't type them in here.

I'm sorry to have to ask the question so many ways. We are writing a parser
that has to partially understand PDF syntax and I've been stuck trying to
clearly understand the UNICODE representation issue.

--arne


Aandi Inston

unread,
Jan 11, 2007, 1:46:22 PM1/11/07
to
"arne thormodsen" <arneXX.th...@hpXX.com> wrote:
>
>My understanding now is that all of the following are acceptible ways to
>represent the UNICODE character "U2800":
>
>(\254\255\050\000)
>(\254\255\(\000)
><FEFF2800>
>
>There are also ways using regular characters, but I can't type them in here.

Yes, I think you've got it now. In case one of my other points was
lost on the way, don't overlook

<FFFE0028>

(UTF-16BE rather than UTF-16LE).

Thomas Merz

unread,
Jan 12, 2007, 5:10:07 PM1/12/07
to
Aandi Inston wrote:
>>I'm looking for anything in UTF-16 that fits the description from the Adobe
>>PDF spec:
>
>
> By the way, if you are writing code be sure to accept UTF-16BE and
> UTF-16LE, even if you only find examples of one.

What makes you think so? Of course strings could hold any kind of
(private) data, including binary data or UTF-16LE. However, "text
strings" in PDF are defined as follows in the PDF Reference:

"The text string type is used for character strings that are
encoded in either PDFDocEncoding or the UTF-16BE Unicode character
encoding scheme."

> So you need to recognise FFFE and FEFF as the opening bytes and
> proceed accordingly.

I don't think UTF-16LE strings would qualify as proper text
strings in PDF. They could only be used in private data structures,
under private conventions - but not in standard PDF entries.

Thomas

_______________________________________________________________
Thomas Merz t...@pdflib.com http://www.pdflib.com
PDFlib 7: Create PDF/A for archiving, format tables, and more!
_______PDFlib - a library for generating PDF on the fly________

arne thormodsen

unread,
Jan 12, 2007, 6:55:23 PM1/12/07
to

"Thomas Merz" <t...@pdflib.com> wrote in message
news:eo90vv$m4p$1...@svr7.m-online.net...

> Aandi Inston wrote:
>>>I'm looking for anything in UTF-16 that fits the description from the
>>>Adobe PDF spec:
>>
>>
>> By the way, if you are writing code be sure to accept UTF-16BE and
>> UTF-16LE, even if you only find examples of one.
>
> What makes you think so? Of course strings could hold any kind of
> (private) data, including binary data or UTF-16LE. However, "text
> strings" in PDF are defined as follows in the PDF Reference:
>
> "The text string type is used for character strings that are
> encoded in either PDFDocEncoding or the UTF-16BE Unicode character
> encoding scheme."
>
>> So you need to recognise FFFE and FEFF as the opening bytes and
>> proceed accordingly.
>
> I don't think UTF-16LE strings would qualify as proper text
> strings in PDF. They could only be used in private data structures,
> under private conventions - but not in standard PDF entries.
>

I did find that Acrobat 7 supports either data alignment. So regardless of
the spec, there may be producers out there doing this, so they've put in
support for both.

--arne

Aandi Inston

unread,
Jan 13, 2007, 3:56:54 AM1/13/07
to
Thomas Merz <t...@pdflib.com> wrote:

>Aandi Inston wrote:

>> By the way, if you are writing code be sure to accept UTF-16BE and
>> UTF-16LE, even if you only find examples of one.
>
>What makes you think so? Of course strings could hold any kind of
>(private) data, including binary data or UTF-16LE. However, "text
>strings" in PDF are defined as follows in the PDF Reference:
>
>"The text string type is used for character strings that are
>encoded in either PDFDocEncoding or the UTF-16BE Unicode character
>encoding scheme."

You are right, of course. I was going from memory.

I would have said I am pretty sure that I have seen both forms in a
PDF and that I had coded accordingly.

However, reading my code I find only support for UTF-16BE, per the
specification. I would say it would do no harm to support UTF-16LE,
but it shouldn't be necessary. Sorry for any misleading advice.

One thing that may be worth pointing out for anyone browsing this
archived string: both Windows and Macintosh SDK documents have seemed,
to me, vague about what they consider to be "Unicode characters" but
Windows programmers creating PDF should note that if they have a
Unicode string from Windows it will be little endian and, to create a
conforming PDF, they should reverse the byte order or process each
character as an integer to be split into bytes.

0 new messages