How to use wchar_t (UNICODE / UTF-8)?

730 views
Skip to first unread message

Mirco

unread,
Jul 8, 2009, 4:51:40 AM7/8/09
to libHaru
Hello,

I have a string in the wchar_t* format. (so UCS-2)
How can I output this string using HPDF ?

I have searched this discussion group, but could not find a good
example / answer.

(If I should supply the string as an UTF-8 byte string it's no
problem).

Regards,
Mirco

Mirco

unread,
Jul 8, 2009, 5:50:27 AM7/8/09
to libHaru
(UPDATE)
Again searching this discussion group I found out that an UCS-2 string
can't be supported because HPDF_StrLen is used everywhere. And
HPDF_StrLen is single byte oriented, so it will for every wchar_t*
string return length 1.

So the I have to use UTF-8.
But for some reason UTF-8 is not yet supported.

I'm starting to understand what should be done:
HPDF_SetCurrentEncoder(doc, "UTF-8");

And I then have to implement a hpdf_encoder_utf8.c file:
- HPDF_UseUTF8Encoding(HPDF_Doc pdf);

Ok, I will give it a try to program this file.

Regards,
Mirco

Vincent den Boer

unread,
Jul 9, 2009, 2:03:55 AM7/9/09
to lib...@googlegroups.com
Hi,

If you find/implement the solution, please share it. People have asked about
UTF-8 on this mailing list but I never saw an answer.

Thanks in advance!

--
Kind regards,
Vincent den Boer

--
E: vin...@shishkabab.net
W: www.shishkabab.net - Open source with a good taste.

Mirco

unread,
Jul 9, 2009, 4:57:12 AM7/9/09
to libHaru
(Update #1)

I have coded the hpdf_encoder_utf8.c, but testing it, I must conclude
this approach is not the right one. HPDF_Page_ShowText does not use
the document encoder, it writes the 0-terminated string it gets binary
to the PDF.

The encoder is used when you call:
HPDF_SetInfoAttr(pdf, HPDF_INFO_PRODUCER, "using LIBHARU, http://libharu.org");
(So the string "using..." is considered UTF-8 input if you
SetCurrentEncoder(pdf,"UTF-8"))

This is not wat I want at all. Also the PDF specs do not support UTF-8
encoding, it supports UTF-16BE. UTF-16BE is recognized by the first
two bytes of a text string, these must be 0xFEFF.

*****
So now I'm going to investigate how I can adjust HPDF_Page_Showtext to
consider the TEXT parameter an UTF-8 string.

Diving deeper into the sources I found the InternalWriteText()
function in hpdf_page_operator.c. It seems like this is the central
point for inserting text into the PDF. This function writes the string
binary into the PDF, so it assumes the input is already in the correct
codepage.

I'll keep you informed of my progress.

Regards,
Mirco

Mirco

unread,
Jul 9, 2009, 7:18:02 AM7/9/09
to libHaru
(Update #2)
I can't get it to work. I do not understand why, but somehow UTF16-BE
strings are outputted as hexadecimal strings.
I have gotten really far, so if someone can investigate what I'm doing
wrong in InternalWriteText() it would really help.
The file hpdf_encoding_utf8.c is uploaded to this discussion group.

To enable UTF-8 input strings (Delphi Coding):
pdf_doc:=HPDF_New(@hpdf_error_handler, self);
if (pdf_doc = nil) then
raise Exception.Create('HPDF ERROR: Error creating document');

if (HPDF_UseUTF8Encoding(pdf_doc) <> HPDF_OK) then
raise Exception.Create('HPDF ERROR: Error creating document
(load utf-8 encoder)');

if (HPDF_SetCurrentEncoder(pdf_doc,'UTF-8') <> HPDF_OK) then
raise Exception.Create('HPDF ERROR: Error creating document
(set utf-8 encoder)');

if (HPDF_IsCurrentEncoderUTF8(pdf_doc) <> HPDF_TRUE) then
raise Exception.Create('HPDF ERROR: Error creating document
(is current encoder utf-8)');

HPDF_AddPage(pdf_doc);
HPDF_Page_TextOut(pdf_doc, 40,40, PChar(UTF8Encode('Text with
special characters')));

Regards,
Mirco

*** Changes I made ***

The HPDF_PageAttr_Rec structure does not contain the MMGR and
ENCODING fields. I need these fields if I want to be able to adjust
the InternalWriteText() function. So I added those 2 members,
otherwise 4) will never work:

1) file hpdf_pages.h
typedef struct _HPDF_PageAttr_Rec {
HPDF_Pages parent;
HPDF_MMgr mmgr; //ADDED
HPDF_Encoder encoder; //ADDED
HPDF_Dict fonts;
HPDF_Dict xobjects;
HPDF_Dict ext_gstates;
HPDF_GState gstate;
HPDF_Point str_pos;
HPDF_Point cur_pos;
HPDF_Point text_pos;
HPDF_TransMatrix text_matrix;
HPDF_UINT16 gmode;
HPDF_Dict contents;
HPDF_Stream stream;
HPDF_Xref xref;
HPDF_UINT compression_mode;
HPDF_PDFVer *ver;
} HPDF_PageAttr_Rec;

HPDF_Page
HPDF_Page_New (HPDF_MMgr mmgr,
HPDF_Xref xref,
HPDF_Encoder encoder); //ADDED

2) Source hpdf_pages.c
HPDF_Page
HPDF_Page_New (HPDF_MMgr mmgr,
HPDF_Xref xref,
HPDF_Encoder encoder)
{
HPDF_STATUS ret;
HPDF_PageAttr attr;
HPDF_Page page;

HPDF_PTRACE((" HPDF_Page_New\n"));

page = HPDF_Dict_New (mmgr);
if (!page)
return NULL;

page->header.obj_class |= HPDF_OSUBCLASS_PAGE;
page->free_fn = Page_OnFree;
page->before_write_fn = Page_BeforeWrite;

attr = HPDF_GetMem (page->mmgr, sizeof(HPDF_PageAttr_Rec));
if (!attr) {
HPDF_Dict_Free (page);
return NULL;
}

page->attr = attr;
HPDF_MemSet (attr, 0, sizeof(HPDF_PageAttr_Rec));
attr->gmode = HPDF_GMODE_PAGE_DESCRIPTION;
attr->cur_pos = HPDF_ToPoint (0, 0);
attr->text_pos = HPDF_ToPoint (0, 0);

ret = HPDF_Xref_Add (xref, page);
if (ret != HPDF_OK)
return NULL;

attr->gstate = HPDF_GState_New (page->mmgr, NULL);
attr->contents = HPDF_DictStream_New (page->mmgr, xref);
attr->encoder = encoder; //ADDED
attr->mmgr = mmgr;

if (!attr->gstate || !attr->contents)
return NULL;

attr->stream = attr->contents->stream;
attr->xref = xref;

/* add requiered elements */
ret += HPDF_Dict_AddName (page, "Type", "Page");
ret += HPDF_Dict_Add (page, "MediaBox", HPDF_Box_Array_New (page-
>mmgr,
HPDF_ToBox (0, 0, HPDF_DEF_PAGE_WIDTH,
HPDF_DEF_PAGE_HEIGHT)));
ret += HPDF_Dict_Add (page, "Contents", attr->contents);

ret += AddResource (page);

if (ret != HPDF_OK)
return NULL;

return page;
}

3) file hpdf_doc.c functions HPDF_InsertPage and HPDF_AddPage
Changed the call to HPDF_Page_new:
page = HPDF_Page_New (pdf->mmgr, pdf->xref, pdf->cur_encoder);

4) File hpdf_page_operation.c function InternalWriteText(), here is
the new function:
static HPDF_STATUS
InternalWriteText (HPDF_PageAttr attr,
const char *text)
{
HPDF_FontAttr font_attr = (HPDF_FontAttr)attr->gstate->font->attr;
HPDF_STATUS ret;
HPDF_String str;
HPDF_Stream memstream;
HPDF_BYTE *membytes;
HPDF_UINT membytes_len;

HPDF_PTRACE ((" InternalWriteText\n"));

str = HPDF_String_New(attr->mmgr, text, attr->encoder);
if (!str)
return HPDF_FAILD_TO_ALLOC_MEM;

if (font_attr->type == HPDF_FONT_TYPE0_TT ||
font_attr->type == HPDF_FONT_TYPE0_CID) {
ret = HPDF_String_Write(str, attr->stream, NULL);
HPDF_String_Free(str);
return ret;
}

memstream = HPDF_MemStream_New(attr->mmgr, HPDF_TEXT_DEFAULT_LEN);
if (!memstream)
{
HPDF_String_Free(str);
return HPDF_FAILD_TO_ALLOC_MEM;
}

ret = HPDF_String_Write(str, memstream, NULL);
if (ret == HPDF_OK)
{
membytes = HPDF_MemStream_GetBufPtr(memstream, 0,
&membytes_len);
if (membytes)
//Strip leading '<' and trailing '>'
ret = HPDF_Stream_WriteEscapeText2(attr->stream,
&membytes[1], membytes_len-2);
else
ret = HPDF_INVALID_STREAM;
}

HPDF_Stream_Free(memstream);
HPDF_String_Free(str);
return ret;
}



Mirco

unread,
Jul 9, 2009, 11:02:23 AM7/9/09
to libHaru
(Update #3)
I now understand where that hexstring is coming from:
HPDF_String_Write produces this hexstring.
So I decoded the hexstring and called HPDF_WriteEscapeText2(), but
this is not the solution.

I don't know what a UTF-16BE string looks like in a PDF.
Currently the output is:
(\376\377\000\351\000\353\000\350\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337\000\337)
Tj

Does somebody know what this should be ???

Regards,
Mirco

*** Changed InternalWriteText() ***

static HPDF_STATUS
InternalWriteText (HPDF_PageAttr attr,
const char *text)
{
HPDF_FontAttr font_attr = (HPDF_FontAttr)attr->gstate->font->attr;
HPDF_STATUS ret;
HPDF_String str;
HPDF_Stream memstream;
HPDF_BYTE *membytes;
HPDF_UINT membytes_len;
HPDF_UINT idx, idxout;
HPDF_BYTE nibble;

HPDF_PTRACE ((" InternalWriteText\n"));

str = HPDF_String_New(attr->mmgr, text, attr->encoder);
if (!str)
return HPDF_FAILD_TO_ALLOC_MEM;

if (font_attr->type == HPDF_FONT_TYPE0_TT ||
font_attr->type == HPDF_FONT_TYPE0_CID) {
ret = HPDF_String_Write(str, attr->stream, NULL);
HPDF_String_Free(str);
return ret;
}

memstream = HPDF_MemStream_New(attr->mmgr, HPDF_TEXT_DEFAULT_LEN);
if (!memstream)
{
HPDF_String_Free(str);
return HPDF_FAILD_TO_ALLOC_MEM;
}

ret = HPDF_String_Write(str, memstream, NULL);
if (ret == HPDF_OK)
{
membytes = HPDF_MemStream_GetBufPtr(memstream, 0,
&membytes_len);
if (membytes)
{
//Strip leading '<' and trailing '>'
membytes++;
membytes_len-=2;

//Decode MEMBYTES, this is a uppercase hexstring
'0' .. 'F'
idx = 0;
idxout = 0;
while (idx+1 < membytes_len)
{
nibble = membytes[idx] - 0x30;
if (nibble > 0x09) nibble -= 0x07;
membytes[idxout] = nibble<<4;
idx++;

nibble = membytes[idx] - 0x30;
if (nibble > 0x09) nibble -= 0x07;
membytes[idxout] += nibble;
idx++;

idxout++;
}

ret = HPDF_Stream_WriteEscapeText2(attr->stream,
membytes, idxout);

r.chris...@gmail.com

unread,
Jul 9, 2009, 11:27:18 AM7/9/09
to libHaru
Mirco,

I know that there is a lot of interest in a Unicode solution for
libharu. When I initially looked at this, it seemed that the consensus
was that UTF-8 was the way to go - but the big question is what does a
UTF-8 PDF really look like? There wasn't much help from the Adobe
Reference document, but my assumption is that UTF-8 is treated as a
variant of multi-byte encodings and fonts. If I'm correct then you
should be using the HPDF_FONT_TYPE0_CID type font, with appropriate
CMAPs that are accessed by the multi-byte codes. Basically UTF-8
should be handled in a way similar to Shift-JIS, the Japanese multi-
byte standard.

If that's the case, then it seems to me that you should look inside an
UTF-8 PDF or a Japanese PDF that uses a font type that supports
HPDF_FONT_TYPE0_CID to see how the bytes are written.

Good luck and keep us posted on your progress please.

Chris Worsley

Mirco

unread,
Jul 10, 2009, 4:39:44 AM7/10/09
to libHaru
(Update #4)
I found a bug in the hpdf_encoder_utf8.c. I adjusted this, it is
uploaded to this discussion group
I'm understanding what should be done in the PDF output. <FEFF...> is
the correct hexstring.
I adjusted InternalWriteText once again.

Now the input is UTF-8 and the output is UTF-16BE, but the glyphs are
not correct and the spacing between characters is also not correct.
I'm guessing that there should be something done with the maptable of
a font. Here is where my knowledge ends.

Regards,
Mirco

********

Summarizing my changes to implement UTF-8 input string:
1) Implemented hpdf_encoding_utf8.c

2) Adjusted in the hpdf_pages.h file the struct _HPDF_PageAttr_Rec,
added 2 fields:
HPDF_MMgr mmgr;
HPDF_Encoder encoder;

3) Adjusted the HPDF_Page_New (hpdf_pages.h and hpdf_pages.c)
function:
HPDF_Page
HPDF_Page_New (HPDF_MMgr mmgr,
HPDF_Xref xref,
HPDF_Encoder encoder)
{
//ADDED
attr->encoder = encoder;
attr->mmgr = mmgr;
}

4) Adjusted in hpdf_doc.c the calls to HPDF_Page_new in the functions
HPDF_AddPage() and HPDF_InsertPage():
page = HPDF_Page_New (pdf->mmgr, pdf->xref, pdf->cur_encoder);

5) Adjusted the hpdf_page_operator.c function InternalWriteText():
static HPDF_STATUS
InternalWriteText (HPDF_PageAttr attr,
const char *text)
{
HPDF_FontAttr font_attr = (HPDF_FontAttr)attr->gstate->font->attr;
HPDF_STATUS ret;
HPDF_String str;

HPDF_PTRACE ((" InternalWriteText\n"));

if ((attr->encoder) && (attr->encoder->type ==
HPDF_ENCODER_TYPE_DOUBLE_BYTE))
{
str = HPDF_String_New(attr->mmgr, text, attr->encoder);
if (!str)
return HPDF_FAILD_TO_ALLOC_MEM;

ret = HPDF_String_Write(str, attr->stream, NULL);
HPDF_String_Free(str);
return ret;
}
else
{
if (font_attr->type == HPDF_FONT_TYPE0_TT ||
font_attr->type == HPDF_FONT_TYPE0_CID) {
if ((ret = HPDF_Stream_WriteStr (attr->stream, "<")) !
= HPDF_OK)
return ret;

if ((ret = HPDF_Stream_WriteBinary (attr->stream,
(HPDF_BYTE *)text,
HPDF_StrLen (text,
HPDF_LIMIT_MAX_STRING_LEN), NULL))
!= HPDF_OK)
return ret;

return HPDF_Stream_WriteStr (attr->stream, ">");
}

return HPDF_Stream_WriteEscapeText (attr->stream, text);
}
}

6) Testing:
pdf = HPDF_New(error_handler, NULL);
HPDF_UseUTF8Encoding(pdf);
HPDF_SetCurrentEncoder(pdf, "UTF-8");
...
Add pages etc.
Strings outputted to PDF must be presented in UTF-8.
(HPDF_Page_ShowText, etc.)



Mirco

unread,
Jul 14, 2009, 1:45:22 AM7/14/09
to libHaru
(Update #5)
Chris Yocum send me an email stating the PDF specs define a "/
ToUnicodeCMap" (PDF spec pg 292). This could be the cause for the
wrong glyphs and wrong character spacing.

In hpdf_encoder_utf8.c the function UTF8_Encoder_Write_Func is empty.
I think this is the place where such a CMap has to be written. When I
look at the other encoders, there is a huge array with number. I'm
assuming this is a "/ToUnicodeCMap" conversion array.

In this discussion group there was a previous post about unicode
output. And there was a file hpdf_encoder_uni.c which I found
somewhere. In this file there is a huge maptable. I'm going to
investigate this.

Regards,
Mirco

Mirco

unread,
Jul 15, 2009, 7:04:39 AM7/15/09
to libHaru
(Update #6)
After searching the internet it appears that "/Encoding /Identity-H"
should be used somewhere.
The /ToUnicode is meant for text searching, not for text displaying.

At this point I'm not programming, I have a sample pdf generated with
the UTF-8 encoder. I'm trying to fix this PDF using a text editor so I
know what should be done.

Mirco

unread,
Jul 15, 2009, 10:29:49 AM7/15/09
to libHaru
(Update #7)
It's getting too complex for me. I'm stopping here.
I'm looking at the PoDoFo open source project. This library has
support for UTF-8 input strings. It lacks a C api interface, but I can
work around that by providing my own. I don't need all the
functionality of PoDoFo, just some text, a picture and I'm done.

Looking inside the PoDoFo sources, they are really clean and well
written.

Maybe somebody else will pickup the glove here to implement the UTF-8
input strings.
Reply all
Reply to author
Forward
0 new messages