Cannot generate PDF that includes Chinese characters

436 views
Skip to first unread message

mrob

unread,
Jan 17, 2017, 10:05:10 PM1/17/17
to DITA-OT Users
Several people have raised questions about this and I have a similar issue.

Using DITA-OT 1.8, I need to include some Chinese and special European characters in an English-language document, like this:

<p>For example: <codeph xml:lang = "fr_FR">€</codeph>, <codeph xml:lang = "zh_CN">中国</codeph>.</p>

and:

<codeblock> ... <codeph xml:lang = "zh_CN">中国</codeph> ... </codeblock>


However, it doesn't work. I get nothing for the euro symbol, and ## for the Chinese characters.

I've checked that the relevant fonts are installed on my workstation.

My font-mappings.xml file includes the following:

    <aliases>
      <alias name="SimSun">SimSun</alias>
    </aliases>

...

    <logical-font name="Sans">
    ...
      <physical-font char-set="Simplified Chinese">
        <font-face>SimSun</font-face>
      </physical-font>
...
    <logical-font name="Serif">
    ...
      <physical-font char-set="Simplified Chinese">
        <font-face>SimSun</font-face>
      </physical-font>

(Same for monospaced, tagline, and narrow, which I added).

As a test, I tried adding "xml:lang = "zh_CN"" to the main <topic> for the whole file, but it doesn't help and messes up a bunch of other things. The document is 99.99% English, so I don't want to use zh_CN as a default.

I've read a few of the discussions of other people concerning the same problem, but I can't figure out how to resolve the issue.

Any suggestions on how I can debug this?

Toshihiko Makita

unread,
Jan 18, 2017, 12:12:19 PM1/18/17
to DITA-OT Users
Hi mrob,

If you author following topic and publish it via PDF2, the follwoing XSL-FO file will be generated by default.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_s3x_dfn_qy">
  <title>Testing xml:lang and font setting</title>
  <body>
    <p>For example: <codeph xml:lang = "fr-FR">€</codeph>, <codeph xml:lang = "zh-CN">中国</codeph>.</p>
  </body>
</topic>

[PDF2 XSL-FO: topic.fo]
<fo:block font-size="10pt" start-indent="25pt">
    <fo:block space-after="0.6em" space-before="0.6em" text-indent="0em">For example: <fo:inline font-size="10pt" line-height-shift-adjustment="disregard-shifts" font-family="Courier New, Courier">€</fo:inline>, <fo:inline font-size="10pt" line-height-shift-adjustment="disregard-shifts" font-family="Courier New, Courier">中国</fo:inline>.</fo:block>
  </fo:block>

As you can see no proper fonts for Simplified Chinese is not assigned, This is because the PDF2 plug-in is originally not designed to select fonts by @xml:lang attribute at all.

In contrast following XSL-FO file is generated by PDF5-ML plug-in.

<fo:block space-before="1.5122em - 3.8673mm">For example: <fo:inline font-size="0.9em" line-height="normal" font-family="monospace" color="black" hyphenate="false" axf:word-break="break-all" axf:word-wrap="break-word" xml:lang="fr-FR">€</fo:inline>, <fo:inline font-size="0.9em" line-height="normal" font-family="Courier New,SimHei" color="black" hyphenate="false" axf:word-break="break-all" axf:word-wrap="break-word" xml:lang="zh-CN">中国</fo:inline>.</fo:block>

The "SimHei" font is assigned according to @xml:lang="zh-CN". ("monospace" is the generic font defined in XSL-FO specification and formatting software will assign appropriate font for this.)

The result PDF will become as follows:



The PDF5-ML plug-in fully supports @xml:lang to font (or other style) mapping. If your are interested in PDF5-ML, please visit the following GitHub repository:

AntennaHouse/pdf5-ml

Regards,

-- 
/*--------------------------------------------------
 Toshihiko Makita
 Development Group. Antenna House, Inc. Ina Branch
 Web site:
 http://www.antenna.co.jp/
 http://www.antennahouse.com/
 --------------------------------------------------*/ 

mrob

unread,
Jan 20, 2017, 4:41:22 AM1/20/17
to DITA-OT Users
Hi,

Thanks for your help. The PDF5-ML plug-in looks interesting.

However, I have already spent a lot of time creating a custom plug-in based on org.dita.pdf2. To use PDF5-ML, it seems I would need to basically re-build my custom plug-in from scratch. I might be able to do this in the future, but it will take more time than I can manage right now. (Well, I wish I had known about PDF5-ML earlier.)

As a short-term solution, I'm wondering of there is any other way to do this (embed Chinese in a PDF that is English by default) by customizing the org.dita.pdf2 plug-in.

Thanks,

mrob


mrob

unread,
Jan 25, 2017, 3:09:10 AM1/25/17
to DITA-OT Users
I am still stuck on this issue, and now a bit confused about whether the DITA-OT is actually capable of handling more than one language.

Makita-san notes that "the PDF2 plug-in is originally not designed to select fonts by @xml:lang attribute", but the DITA-OT User's Guide says: "The DITA-OT uses the values for the @xml:lang, @translate, and @dir attributes that are set in the source content to provides globalization support."

So, which is it? Does this feature actually work, or not? If not, if the documentation lies, then what should I do?

Is it possible to hack my custom plug-in to do this conversion in a brute-force way?

Thanks,

mrob

Toshihiko Makita

unread,
Jan 25, 2017, 7:32:54 AM1/25/17
to DITA-OT Users
Hi mrob,

I don't know your detailed situation. But if you do need outputting Chinese text in another (English) document, following method may be effective as a temporary solution.

  • Author XSL-FO property directly in some DITA attribute (typically @outputclass). Introducing exclusive global attribute may be better.
  • Customize your plug-in by introducing the module which outputs above attribute into output XSL-FO property.
  • This method is effective when i18n-postprocess is disabled. (Confirmed in DITA-OT 2.4.2 by setting org.dita.pdf2.i18n.enabled=false. But it does not work in DITA-OT 1.8.5)
[Authoring]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "topic.dtd">
<topic id="topic_s3x_dfn_qy">
  <title>Testing xml:lang and font setting</title>
  <body>
    <p>For example: <codeph xml:lang="fr-FR" outputclass="style:font-family:Arial;">€</codeph>,
      <codeph xml:lang="zh-CN" outputclass="style:font-family:SimHei;font-size:2em;color:red;">中国</codeph>.</p>
  </body>
</topic>


[PDF2 customization]
org.dita.pdf2\Customization\fo\xsl\custom.xsl
<?xml version='1.0'?>
<xsl:stylesheet 
    version="2.0">

    <xsl:include href="dita2fo_getFoStyle.xsl"/>
    <xsl:include href="dita2fo_custom_pr_domain.xsl"/>
    
</xsl:stylesheet>

org.dita.pdf2\Customization\fo\xsl\dita2fo_getFoStyle.xsl

<?xml version='1.0'?>
<xsl:stylesheet 
    exclude-result-prefixes="xs ahf"
    version="2.0">

    <!-- 
         function: Expand FO style & property into attribute()*
         param: prmElem
         return: Attribute node
         note: XSL-FO attribute is authored in $prmElem/@outputclass using in CSS notation prefixed "style:".
    -->
    <xsl:template name="ahf:getFoProperty" as="attribute()*">
        <xsl:param name="prmElem" required="no" as="element()" select="."/>
        <xsl:sequence select="ahf:getFoProperty($prmElem)"/>
    </xsl:template>
    
    <!-- 
         function: Expand FO property into attribute()*
         param: prmElem
         return: Attribute node
         note: XSL-FO attribute is authored in $prmElem/@outputclass using in CSS notation prefixed "style:".
    -->
    <xsl:function name="ahf:getFoProperty" as="attribute()*">
        <xsl:param name="prmElem" as="element()"/>
        
        <xsl:choose>
            <xsl:when test="exists($prmElem/@outputclass) and starts-with(string($prmElem/@outputclass),'style:')">
                <xsl:variable name="foAttr" as="xs:string" select="normalize-space(substring(string($prmElem/@outputclass),7))"/>
                <xsl:for-each select="tokenize($foAttr, ';')">
                    <xsl:variable name="propDesc" select="normalize-space(string(.))"/>
                    <xsl:choose>
                        <xsl:when test="not(string($propDesc))"/>
                        <xsl:when test="contains($propDesc,':')">
                            <xsl:variable name="propName" as="xs:string">
                                <xsl:variable name="tempPropName" as="xs:string" select="normalize-space(substring-before($propDesc,':'))"/>
                                <xsl:variable name="axfExt" as="xs:string" select="'axf-'"/>
                                <xsl:choose>
                                    <xsl:when test="starts-with($tempPropName,$axfExt)">
                                        <xsl:sequence select="concat('axf:',substring-after($tempPropName,$axfExt))"/>
                                    </xsl:when>
                                    <xsl:otherwise>
                                        <xsl:sequence select="$tempPropName"/>
                                    </xsl:otherwise>
                                </xsl:choose>
                            </xsl:variable>                            
                            <xsl:variable name="propValue" as="xs:string" select="normalize-space(substring-after($propDesc,':'))"/>
                            <xsl:choose>
                                <xsl:when test="not(string($propName))"/>
                                <!--"castable as xs:NAME" can be used only in Saxon PE or EE.
                                    If $propName does not satisfy above, xsl:attribute instruction will be faild!
                                 -->
                                <!--xsl:when test="$propName castable as xs:NAME"-->
                                <xsl:when test="true()">
                                    <xsl:attribute name="{$propName}" select="$propValue"/>
                                </xsl:when>
                            </xsl:choose>                            
                        </xsl:when>
                        <xsl:otherwise>
                            <xsl:message select="concat('[getFoProperty] Missing '':'' in style description. @outputclass=''',$foAttr,''' @xtrc=''',string($prmElem/@xtrc),''' @xtrf=''',string($prmElem/@xtrf),'''')"/>
                        </xsl:otherwise>
                    </xsl:choose>
                </xsl:for-each>
            </xsl:when>
            <xsl:otherwise>
                <xsl:sequence select="()"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:function>

</xsl:stylesheet>
org.dita.pdf2\Customization\fo\xsl\dita2fo_custom_pr_domain.xsl

<?xml version='1.0'?>
<xsl:stylesheet 
    exclude-result-prefixes="xs ahf"
    version="2.0">

    <xsl:template match="*[contains(@class,' pr-d/codeph ')]">
        <fo:inline xsl:use-attribute-sets="codeph">
            <xsl:call-template name="commonattributes"/>
            <xsl:call-template name="ahf:getFoProperty"/>
            <xsl:apply-templates/>
        </fo:inline>
    </xsl:template>

</xsl:stylesheet>

[The formatting result]



This method is adopted in my several users who uses old PDF5 plug-in that supports only one language in a document.
But I am not sure that this is your solution and setting org.dita.pdf2.i18n.enabled=false affects your plug-in. 

mrob

unread,
Jan 26, 2017, 4:01:10 AM1/26/17
to DITA-OT Users
Thanks for your help with this.

I tried what you've suggested, and it half-works because the red color and larger point size of the font appear in my PDF output. However, I'm still not seeing the euro or Chinese characters. Basically, same situation.

It's still unclear to me if DITA-OT can really support multiple languages as the documentation indicates, or if the documentation is wrong, or ... ?

If anybody else involved in the DITA-OT project is reading this, please don't hesitate to weigh in.

Thanks,

mrob
Reply all
Reply to author
Forward
0 new messages