Improvements around Unicode and other

652 views
Skip to first unread message

Furjuk Mityos

unread,
Aug 11, 2013, 10:55:16 PM8/11/13
to lib...@googlegroups.com
Hi, all!
I'm a new member.

I have done slightly big changes as following.

Improvements around Unicode:
* Unicode supplement planes support.
* UTF-16/UTF-32 support.
* New UTF encodings.

Improvements around font:
* CID text and ToUnicode map for not only UTF-8 but also all of CMapEncoders.
* TrueType font loading options to sellect CID text and ToUnicode map.
* Text converter for UTF encodings.
** UTF character encoding converter.
** User defined converter.
** Built-in converter "BiDi" for right-to-left languages such as Arabic, using GNU FriBiDi.
* Relief font for CID/TrueType font.

Improvements of HPDF_Page_TextRect:
* Hyphenation by SHY (soft hyphen, U+00AD) embedded in text.
* Line break by ZWSP (zero width space, U+200B) embedded in text.
* Kanji justification.
* HPDF_TALIGN_JUSTIFY adjusts not only char-space but also word-space and Arabic kashida.
* Kashida by ARABIC TATWEEL (U+0640) embedded in text, using "BiDi" converter.

Attachment is output.
HPDF_Page_ShowText and HPDF_Page_TextRect and HPDF_Page_CreateTextAnnot are used once each.
Maybe you want to look at the contents, it was compressed by zip without HPDF_SetCompressionMode.

I hope these changes will be merged in official libHaru.
How do I?
Can I commit to GitHub?

converter_demo.zip

Koen Deforche

unread,
Aug 13, 2013, 10:20:12 AM8/13/13
to lib...@googlegroups.com
Hey Furjuk,


On Monday, August 12, 2013 4:55:16 AM UTC+2, Furjuk Mityos wrote:
I have done slightly big changes as following.

Improvements around Unicode:
* Unicode supplement planes support.
* UTF-16/UTF-32 support.
* New UTF encodings.

That's great to hear! You did start from the latest git version and not the last release since UTF-8 support (and a number of other things) were changed since. But surely the UTF support could still use improvement.

Improvements around font:
* CID text and ToUnicode map for not only UTF-8 but also all of CMapEncoders.
* TrueType font loading options to sellect CID text and ToUnicode map.
* Text converter for UTF encodings.
** UTF character encoding converter.
** User defined converter.
** Built-in converter "BiDi" for right-to-left languages such as Arabic, using GNU FriBiDi.
* Relief font for CID/TrueType font.

Improvements of HPDF_Page_TextRect:
* Hyphenation by SHY (soft hyphen, U+00AD) embedded in text.
* Line break by ZWSP (zero width space, U+200B) embedded in text.
* Kanji justification.
* HPDF_TALIGN_JUSTIFY adjusts not only char-space but also word-space and Arabic kashida.
* Kashida by ARABIC TATWEEL (U+0640) embedded in text, using "BiDi" converter.

Attachment is output.
HPDF_Page_ShowText and HPDF_Page_TextRect and HPDF_Page_CreateTextAnnot are used once each.
Maybe you want to look at the contents, it was compressed by zip without HPDF_SetCompressionMode.

I hope these changes will be merged in official libHaru.
How do I?
Can I commit to GitHub?

Please do. You can open a pull request to the libharu git version.

Btw. the pdf looks indeed good with acrobat reader, but crashes Preview on MacOSX. Any idea?
Do you treat GNU FriBiDi an an optional dependency?

Regards,
koen



 

Furjuk Mityos

unread,
Aug 13, 2013, 11:26:07 PM8/13/13
to lib...@googlegroups.com
Hi,


> Btw. the pdf looks indeed good with acrobat reader, but crashes Preview on MacOSX. Any idea?

I heard some viewer can view pdf only with "Identity" cmap.
Attachment is created without HPDF_FONTOPT_WITH_CID_MAP (new TT font loading option) and its text is encoded in CID, with "Identity" cmap.
It seems word-space is inserted only at 1-byte 0x20, not at 2-byte 0x0020 nor CID, so we cannot use word-space in CID text, slightly ugly....

> Do you treat GNU FriBiDi an an optional dependency?

Compile with -D LIBHPDF_ENABLE_BIDI.
Configure script should add LIBHPDF_ENABLE_BIDI into config.h, not yet.

converter_demo.zip

Antony Dovgal

unread,
Aug 14, 2013, 5:37:10 AM8/14/13
to lib...@googlegroups.com
On 2013-08-12 06:55, Furjuk Mityos wrote:
> Hi, all!
> I'm a new member.
>
> I have done slightly big changes as following.

Sounds great!
Are these changes available somewhere on Github?
Can you send a pull request?

--
Wbr,
Antony Dovgal
---
http://pinba.org - realtime profiling for PHP

Furjuk Mityos

unread,
Aug 26, 2013, 12:27:46 PM8/26/13
to lib...@googlegroups.com
Hi,

Further improvement:
* Mix of HPDF_WMODE_HORIZONTAL and HPDF_WMODE_VERTICAL in starting and relief fonts.
* Vertical alignment options of HPDF_Page_TextRect (Improvement of Davyjones' ideas).
* Interlinear annotation by IAA (interlinear annotation anchor, U+FFF9), IAS (interlinear annotation separator, U+FFFA), IAT (interlinear annotation terminator, U+FFFB) embedded in text.

Please check pull request #43, mtmtysdfrjk/libharu.

cidtext.pdf
utf8text.pdf

Franco Marchesini

unread,
Aug 27, 2013, 4:12:02 AM8/27/13
to lib...@googlegroups.com
Ciao,

fyi,I try to open the file with preview app in os x 10.8.4 (mountain lion).

cidtext.pdf is open without problem.
utf8text.pdf crash the preview app.
With acrobat pro 8.0 I can fix and open utf8text.pdf

Regards
Franco



2013/8/26 Furjuk Mityos <mtmty...@gmail.com>

--
--
---
libHaru.org development mailing list
To unsubscribe, send email to libharu-u...@googlegroups.com
---
You received this message because you are subscribed to the Google Groups "libHaru" group.
To unsubscribe from this group and stop receiving emails from it, send an email to libharu+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Furjuk Mityos

unread,
Aug 27, 2013, 11:30:44 AM8/27/13
to lib...@googlegroups.com
Hi,

Thank you for trying.

I knew some viewer cannot view utf8 (or other normal) text with custom CID map.
You can select CID text or normal text by new TT loading option.
At first I thought CID text should be default because many viewers can view, but I found we cannot use word-space in CID text, so normal text must be default for backward compatibility.


2013年8月27日火曜日 17時12分02秒 UTC+9 frankit60:

Kevin Xi

unread,
Aug 28, 2013, 5:40:42 AM8/28/13
to lib...@googlegroups.com
Hello Furjuk,

First of all, thanks a lot for your contributes and I am now using your code in my project. I have exactly same feature request in my project, that I will need to dyanmically generate PDF files without knowing the language of the content.

1) All my input strings will be able to encoded in UTF-8, but could be in Japanese / French / English, etc.. Do I need to load all fonts in your sample code to ensure them be displayed well in output PDF file? (Let's assume that my Japanese customer will have Japanese font installed on the computer.)

2) I met a strange issue while using your code, that I have to comment out 'HPDF_Font_PushBuiltInConverter' (2 places), otherwise, the process will quit while execute to this line (not a crash, the process quits). I am using VS 2008 on Windows, though I didn't see big issue after comment out the 2 lines, I still want to confirm if it is an issue in code.

PS: I am new in this group, as well as PDF format.

Thank you!
Kevin

Koen Deforche

unread,
Aug 27, 2013, 3:02:31 PM8/27/13
to libHaru
Hey Furjuk,

2013/8/27 Furjuk Mityos <mtmty...@gmail.com>:
> Hi,
>
> Thank you for trying.
>
> I knew some viewer cannot view utf8 (or other normal) text with custom CID
> map.
> You can select CID text or normal text by new TT loading option.
> At first I thought CID text should be default because many viewers can view,
> but I found we cannot use word-space in CID text, so normal text must be
> default for backward compatibility.

Perhaps it's important to clarify what is what (for the record and for
the ignorant).

The utf8 file defines a custom encoding that corresponds to UTF-8, and
encodes the Unicode text in the bytestream using UTF-8 encoding, is
that correct?

I did the previous (limited) implementation of unicode support in
libharu and would be happy to see a more mature solution in place (and
you seem to have made many other improvements too). But it was also my
experience that defining a UTF-8 encoding in the PDF file broke some
viewers (such as Preview) and that's why we settled with the 'identity
encoding' (i.e. UCS-2/UCS-4) in the byte stream which we got to work
on all viewers. I believe that crashing a PDF viewer (that is default
on MacOSX) kills the UTF-8 approach in practical terms?

Therefore I think the method used in cidtext.pdf file should be the
default, or what downside is there to notice? How does the
'word-space' issue manifest itself? Is the cidtext.pdf file using
Identity-H encoding similar to what the previous implementation did?
Backwards compatibilty would rather point at CID map instead of UTF-8
encoding; is this what you mean?

Regards,
koen

Furjuk Mityos

unread,
Aug 30, 2013, 1:17:30 PM8/30/13
to lib...@googlegroups.com
Hi,

Thank you for trying.

1) All my input strings will be able to encoded in UTF-8, but could be in Japanese / French / English, etc.. Do I need to load all fonts in your sample code to ensure them be displayed well in output PDF file? (Let's assume that my Japanese customer will have Japanese font installed on the computer.)

You must choose fonts to display your text. 
Glyph ID of TrueType font is 16bits, meanwhile Unicode (that characters already defined) is 17.5bits, therefor any TrueType fonts in the world cannot display all of Unicode characters.
 
2) I met a strange issue while using your code, that I have to comment out 'HPDF_Font_PushBuiltInConverter' (2 places), otherwise, the process will quit while execute to this line (not a crash, the process quits). I am using VS 2008 on Windows, though I didn't see big issue after comment out the 2 lines, I still want to confirm if it is an issue in code.

Perhaps you compiled without "BiDi" converter, if so, it is expected result because "BiDi" converter dose not exist.
If you don't need display right-to-left languages such as Arabic or Hebrew, you don't need "BiDi" converter.
If you need display right-to-left languages, get GNU FriBiD and compile it, recompile libHaru with preprocessor symbol LIBHPDF_ENABLE_BIDI defined.

Furjuk Mityos

unread,
Aug 30, 2013, 1:42:36 PM8/30/13
to lib...@googlegroups.com
Hi,

The issue that some viewer cannot view PDF with custom CID map, is not only of UTF-encodings but also of all CMAP encodings including legacy encodings such as GB-EUC-H. Therefore I added option to select CID text to font, to be able to select also with legacy encodings.
Meanwhile, with legacy encodings we could use word-space, with CID text we cannot use word-space. I think this incompatibility is unacceptable, if CID text is default. 

Kevin Xi

unread,
Sep 19, 2013, 11:07:02 AM9/19/13
to lib...@googlegroups.com
Hello Furjuk,

First of all, thanks for your answers. I won't need BiDi so I think it is fine if I commented out the 2 lines.

I have been using your patch in my Windows app since the end of last month, it works perfectly, I loaded 3~4 fonts for all languages I need to support, and I can write all different languages into one PDF file, awesome.

Recently, I met some new issues while use the same code on Mac, I met a problem while loading Japanese / Korean font file (like AppleMyungjo.ttf or STHeiti Light.ttc), they can't be loaded via HPDF_LoadTTFontFromFile or HPDF_LoadTTFontFromFile2, the error code is HPDF_TTF_INVALID_FOMAT from the error handler.

I am really not family with the underlay theory, so I want if you can help and give me some advice, thanks in advance. PS: the reason I chose these fonts is they are the default font used by Mac system (from Wikipedia).

@Antony

I have a separated questions regarding C VS C++, while I compile the source with CMake, I saw lots of warnings about typedef redefinitions, the problem to me is, if I use libharu in Obj-C class, these warnings will be treated as compile errors. Now, as a workaround, I just use '.mm' instead of '.m', it seems typedef redefinition is okay in Obj-C++. Do you have any sight on this issue? I didn't look into all the header files, maybe it is an easy change to remove the duplicated definitions across these header files?

Thank you
Kevin

Kevin Xi

unread,
Sep 19, 2013, 11:52:22 AM9/19/13
to lib...@googlegroups.com
Hello Furjuk,

Just a follow up, I used 'Arial Unicode.ttf' as the CJK font, and it should fit my requirement. I noticed a sight difference between these font files. 'Arial Unicode' and 'Simsun' are both OpenType&TrueType font, which seems libharu can only open this type of font files. If the font is only a TrueType font, like 'AppleMyungjo.ttf', then it can't be loaded...

Thanks,
Kevin

Sergiu Oprean

unread,
Jan 21, 2016, 3:47:34 AM1/21/16
to libHaru
Hi Kevin/Furjuk,

I have a similar issue with loading a ttf font which is only a TrueType font. Have you fond a solution for it?
in hpdf_fontdef_tt.c :: ParseCMap() I have platformID = 1, encodingID = 0, format = 0

thank you,
Sergiu

chahat bhatia

unread,
Apr 21, 2016, 5:20:28 AM4/21/16
to libHaru
Hi Furjuk,
Glad to know you did a lot.
I just started using LibHaru  a while ago and i'm facing this problem with TextRect() .
The one you mentioned with the Line Break and i saw those in the pdf's you attached.
I just wanted to inquire how did you do that . if you could upload or something ?
Reply all
Reply to author
Forward
0 new messages