Helping you make the utf-8 rtl languages work.

2,013 views
Skip to first unread message

utf8rtlhelper

unread,
Jan 5, 2011, 6:40:06 PM1/5/11
to dompdf
Hi,
I noticed you are planning to add the utf-8 characters to the next
release,

I think it's important for you to know this feature will almost be
unique (since no other product to best of my knowledge works with
these languages) to hundreds of millions of people who write in
Arabic, Hebrew, Persian, Urdu and etc, and it will certainly help them
a lot, so I decided to help you help us.

I thought since you probably don't use these languages I could help
you a bit by testing your product, and making sure it works. This also
will be a good thing to follow for developers that will be using your
product for such purpose.

I'm gonna try my best to make it work and after I'm done others will
have a sample to use.

Let's start with Persian (my mother tongue).

So I installed dompdf_0-6-0_beta1.

Here is what I have changed

and here is my code:
[code]
define("DOMPDF_UNICODE_ENABLED", true);
[/code]


I have downloaded farsifonts-0.4.zip from here:
http://www.farsiweb.ir/wiki/Persian_fonts

And decided to use roya.ttf, because of the beauty of it.

I used this tool:
http://eclecticgeek.com/dompdf/load_font.php

To generate the font format that your product uses, I used royab.ttf
to generate the bold font.
I copied all the following generated files in the lib/fonts folder:
dompdf_font_family_cache.sample
roya.afm
roya.ttf
roya.ufm
royab.afm
royab.ttf
royab.ttf


[code]
$html = "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//
EN\">
<html>
<head>
<meta http-equiv=\"content-type\" content=\"text/html;
charset=UTF-8\">
<META HTTP-EQUIV=\"CONTENT-LANGUAGE\" CONTENT=\"fa\">

<title>Tutorial: HelloWorld</title>
<style>
body {dir:rtl;}
h1{ font-size:70px; font-family: Courier}
h2{ font-size:70px; font-family: Roya;}
h3{ font-size:70px; font-family: Arial;}
}
</style>
</head>
<body>
<h1>hello world</h1>
<h2>hello world</h2>
<h3>hello world</h3>
<h1>سلام دنیا</h1>
<h2>سلام دنیا</h2>
<h3>سلام دنیا</h3>
</body>
</html>";

set_include_path(APPLICATION_PATH . "../../library/
dompdf-0.5.1" . PATH_SEPARATOR . get_include_path());
require_once 'dompdf_config.inc.php';
$autoloader = Zend_Loader_Autoloader::getInstance();
$autoloader->pushAutoloader('DOMPDF_autoload');
$dompdf = new DOMPDF();
$dompdf->set_paper("a4", "portrait");
$dompdf->load_html($html);
$dompdf->set_base_path($_SERVER['DOCUMENT_ROOT']);
$dompdf->render();
$dompdf->stream("document.pdf");
die();
[/code]

And when I run my code, the following file is generated:
PersianRTLDocument.pdf which I uploaded to the files section:
http://groups.google.com/group/dompdf/web/PersianRTLDocument.pdf

As you can see in the file, Persian characters are replaced with
question marks (?).

When you open the file with an editor, you don't see a "roya" (the
name of the font used) anywhere, instead there are F1, F2, and F3
which happen to be Times-Roman, Courier-Bold, and Times-Bold.

Also the encoding seems to be WinAnsiEncoding, which is obviously ANSI
and not Unicode, I think it should rather be AFMEncoding.

So what am I doing wrong here?
Thanks

BrianS

unread,
Jan 7, 2011, 1:55:58 PM1/7/11
to dompdf
> I copied all the following generated files in the lib/fonts folder:
> ...
> When you open the file with an editor, you don't see a "roya" (the
> name of the font used) anywhere, instead there are F1, F2, and F3
> which happen to be Times-Roman, Courier-Bold, and Times-Bold.

You're on the right track for using the font, except for one missed
step. After you copy the contents of the archive you receive from the
font prep tool, rename "dompdf_font_family_cache.sample" to
"dompdf_font_family_cache". Otherwise dompdf will not know of your
font mapping.

> Also the encoding seems to be WinAnsiEncoding, which is obviously ANSI
> and not Unicode, I think it should rather be AFMEncoding.

This is the default encoding used by dompdf. The PDF spec is fairly
limited in the encodings it supports. Or, to reduce the problem, the
spec only allows for Latin1-style encodings or custom encodings.
Basically dompdf will support two encoding types in the next release,
WinAnsi or Unicode ... which should be enough for most people.

The reason you're not seeing Unicode in your file is that the roya
font has not been loaded. Unicode will be used if a supported font is
loaded with the document. Otherwise you'll get WinAnsi.

Here's a sample of the output when using the latest code:
http://eclecticgeek.com/dompdf/index.php?input_file=a1f459325917f0ea.htm


RTL support is not yet built in to dompdf. I don't expect it to be a
terribly difficult problem to solve, but we have other issues we're
trying to address. Plus, no one on the team actually uses an RTL
language (that I know). We'd be happy to involve you once we begin to
address this feature.

utf8rtlhelper

unread,
Jan 8, 2011, 11:48:51 PM1/8/11
to dompdf
Thank you BrianS, that was very helpful

I renamed "dompdf_font_family_cache.sample" to
"dompdf_font_family_cache" , and as you have mentioned the font
problem seems to be solved.

However in my PDF and the one you provided, the Persian "Hello world"
is not printed currently, the correct way is "سلام دنیا", the
generated script however is separated and reversed.

Seperated: "س ل ا م د ن ی ا"
Reversed: "ا ی ن د م ا ل س"

in Arabic (العربية) , Urdu (اردو) and Persian (فارسی) alphabets
connect to each other and if they are not people will not be able to
read them. In Hebrew (עִבְרִית) alphabets are separated like in Latin
languages.

I created a pdf file out of the html code I provided using a browser,
in order to show how the correct output should look like:
http://groups.google.com/group/dompdf/web/PersianRTLDocument2.pdf

Is there anything I could do about that?

letters are separated

> RTL support is not yet built in to dompdf. I don't expect it to be a
> terribly difficult problem to solve, but we have other issues we're
> trying to address. Plus, no one on the team actually uses an RTL
> language (that I know). We'd be happy to involve you once we begin to
> address this feature.

I would love to help.

Thanks

BrianS

unread,
Jan 10, 2011, 3:04:37 PM1/10/11
to dompdf
The problem with the reversed text is something we'll have to address
in the code. As I mentioned, I don't expect it to be a difficult
problem, but we'll know more when we add support for RTL.

As for the separated text ... I'm not sure about a solution to that
problem. This isn't an area I've had a chance to research at all.
We'll take a closer look at the problem as we delve into the RTL
issue.
Reply all
Reply to author
Forward
0 new messages