PDF Writer UTF-8 Support

Brian Schröder

unread,

Mar 30, 2005, 5:57:23 PM3/30/05

to

Hello,

I'm having a hard time getting PDF Writer to output my UTF-8 encoded
text correctly. Has anybody around here got some tips for me?

thanks a lot,

Brian

--
Brian Schröder
http://ruby.brian-schroeder.de/

Austin Ziegler

unread,

Mar 30, 2005, 11:48:16 PM3/30/05

to

On Mar 30, 2005 5:57 PM, Brian Schröder <ruby....@gmail.com> wrote:
> I'm having a hard time getting PDF Writer to output my UTF-8 encoded
> text correctly. Has anybody around here got some tips for me?

Unfortunately, PDF::Writer needs "help" understanding UTF-8 input and
I have been focussing on a number of basic feature changes before
making this "easy" as it also makes a difference as how each font is
handled.

I am hoping to have PDF::Writer 1.0 out -- with documentation on how
to do this at all -- in the next two weeks or so. I apologise for the
inconvenience.

-austin
--
Austin Ziegler * halos...@gmail.com
* Alternate: aus...@halostatue.ca

Brian Schröder

unread,

Mar 31, 2005, 2:13:31 AM3/31/05

to

Thanks for your reply, austin,

Is there any possibility to output UTF-8 encoded text right know? I
need no fancy fonts or formating, just some plain text output at
specific x-y corrdinates.

best regards and thanks for the great library,

brian

Austin Ziegler

unread,

Mar 31, 2005, 9:11:14 AM3/31/05

to

On Mar 31, 2005 2:13 AM, Brian Schröder <ruby....@gmail.com> wrote:
> On Thu, 31 Mar 2005 13:48:16 +0900, Austin Ziegler
> <halos...@gmail.com > wrote:
>> On Mar 30, 2005 5:57 PM, Brian Schröder <ruby....@gmail.com >
>> wrote:
>>> I'm having a hard time getting PDF Writer to output my UTF-8
>>> encoded text correctly. Has anybody around here got some tips
>>> for me?
>> Unfortunately, PDF::Writer needs "help" understanding UTF-8 input
>> and I have been focussing on a number of basic feature changes
>> before making this "easy" as it also makes a difference as how
>> each font is handled.
>> I am hoping to have PDF::Writer 1.0 out -- with documentation on
>> how to do this at all -- in the next two weeks or so. I apologise
>> for the inconvenience.

> Thanks for your reply, austin,
>
> Is there any possibility to output UTF-8 encoded text right know?
> I need no fancy fonts or formating, just some plain text output at
> specific x-y corrdinates.
>
> best regards and thanks for the great library,

Yes -- but you have to wade through the font encoding mapping
information for PDF documents right now, and you have to be using a
Unicode-capable font. From the PDF 1.6 Reference:

Font management is primarily concerned with producing the
correct appearance of text—that is, the shape and placement of
glyphs. However, it is sometimes necessary for a PDF application
to extract the meaning of the text, represented in some standard
information encoding such as Unicode. In some cases, this
information can be deduced from the encoding used to represent
the text in the PDF file. Otherwise, the PDF producer
application should specify the mapping explicitly by including a
special object, the ToUnicode CMap.

I have not added support for the /ToUnicode CMap in PDF::Writer, but
it may be possible. However:

Certain strings contain information that is intended to be
human-readable, such as text annotations, bookmark names,
article names, document information, and so forth. Such strings
are referred to as text strings. Text strings are encoded in
either PDFDocEncoding or Unicode character encoding.
PDFDocEncoding is a superset of the ISO Latin 1 encoding and is
documented in Appendix D. Unicode is described in the Unicode
Standard by the Unicode Consortium (see the Bibliography).

For text strings encoded in Unicode, the first two bytes must be
254 followed by 255. These two bytes represent the Unicode byte
order marker, U+FEFF, indicating that the string is encoded in
the UTF-16BE (big-endian) encoding scheme specified in the
Unicode standard. (This mechanism precludes beginning a string
using PDFDocEncoding with the two characters thorn ydieresis,
which is unlikely to be a meaningful beginning of a word or
phrase). Note: Applications that process PDF files containing
Unicode text strings should be prepared to handle supplementary
characters; that is, characters requiring more than two bytes to
represent.

An escape sequence may appear anywhere in a Unicode text string
to indicate the language in which subsequent text is written,
which is useful when the language cannot be determined from the
character codes used in the text. The escape sequence consists
of the following elements, in order:

1. The Unicode value U+001B (that is, the byte sequence 0
followed by 27)
2. A 2-character ISO 639 language code—for example, en for
English or ja for Japanese
3. (Optional) A 2-character ISO 3166 country code—for example,
US for the United States or JP for Japan
4. The Unicode value U+001B

The complete list of codes defined by ISO 639 and ISO 3166 can
be obtained from the International Organization for
Standardization (see the Bibliography).

So you can't specify UTF-8, but you can specify UTF-16BE if you
provide the 0xFEFF BOM.