[reportlab-users] Incorrect character composition

_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users

On Tue, Apr 14, 2015 at 12:05:04PM -0700, Glenn Linderman wrote:

6-7 weeks with no response, for a while I thought the list was dead, but now
a flurry of messages....

I guess I didn't actually ask a question, but is this, like kerning, thought
to be too slow to implement, or is it just that the market for reportlab
simply doesn't include languages that don't have precomposed glyphs, or
something else?

When I and Vika originally implemented Unicode + TrueType support in
ReportLab, we didn't implement support for combining characters.  I
don't remember if the TTF/PDF specifications available at that time included
such support.

I guess nobody stepped up to add the missing support since then.

Technical details (that might be wrong if the code changed since 2003,
which it probably did--I wasn't keeping track): ReportLab takes apart
the TTF and builds multiple fonts, each containing a subset of the
original glyphs (up to 256).  These subsets discard any and all TTF
tables not explicitly copied, which I guess include the tables used for
rendering combining characters in a nice way.
I guess if anyone is to blame for this, it's me as ReportLab's founder.
The closest we got to even half-understanding the problem was about 6
years ago when an Arabic-speaking employee with a little knowledge of
Farsi took a look.  Unfortunately we are not rendering raster graphics
on screen.  We are trying to work out the right font descriptors and
sequences of bytes to put in the PDF file so that the right stuff
magically happens on screen.   When I did that with Japanese in about
2002-2003, with the advantages that (a) I can read and write the
language and (b) there is no special layout at all, it still took a
month of reverse-engineering other peoples' PDFs.    Not knowing any
of these languages, it's probably a big job, and we have not had any
volunteers from the open source community, nor any customers willing
to pay for the R&D.
I don't think it's a performance issue like kerning.  I would
sincerely hope that one just has to put the right byte sequences into
the PDF and that the font sorts it out for you.
Glenn,  my apologies - I had assumed you were discussing "unusual
languages" without re-reading the original email carefully.  It might
not be that bad.

There are two things we could do in the short term, and I'm keen to
keep the core library moving forwards:

(1) We could potentially provide a special flowable for kerned titles
and short phrases.  This would of course have to render a glyph at a
time in Python, doing the lookups and calculations
(2) If you can find another open source PDF generator in any language
which gets it right, and let us know, we can study a "hello world" PDF
out of that tool and see what it does.   This would be a big time
saver.
Glenn,  my apologies - I had assumed you were discussing "unusual
languages" without re-reading the original email carefully.  It might
not be that bad.

Meantime, seeing your approach of looking at Illustrator output, I had a
friend with Acrobat take my little test string and create a PDF from
Acrobat.  Results look good, and are at:
http://nevcal.com/temporary/openo-Acrobat.pdf  Maybe seeing what they do
will help.  File is big enough they must have embedded something or another,
font-wise.

Glenn, could you ask your friend exactly what they did with Acrobat to
create this?  i.e. did they use Acrobat Distiller to convert a
Postscript file, or create a word document and export it to PDF using
Acrobat?  If we can observe another program "doing it right" it may
help.

http://nevcal.com/temporary/openo-Nuance.pdf

He also tried the just-released Nitro 10, but it failed to create from the clipboard, failed to create from a UTF-8 text file, failed to create from a plain text file, and failed to create from a Word document... he has submitted a bug report, and is probably busy reinstalling the prior version of Nitro.  If Nitro 9 will do the job, that might give another file in the near future.

Actually, the reason he has so many, is that most of them have limitations, some are good for one sort of thing, but mess up on other things. Another will do the other things, but not something else. Etc.  He mostly uses the editing features to fine tune and work-around limitations in the PDF creation from other programs, rather than using them to create raw PDF files from other formats.

So in counting the input characters for my sample, there are 8 base/precomposed characters, and 4 combining diacriticals, for a total of 12.

The text stream from Acrobat is as follows:

15 0 obj 
<<
/Length 455
>>
stream
BT
/P <</MCID 0 >>BDC 
/CS0 cs 0.2 0 0.2  scn
/TT0 1 Tf
12 -0 0 12 72 709.2 Tm
( )Tj
/C2_0 1 Tf
36 -0 0 36 72 672.72 Tm
<0727>Tj
/TT0 1 Tf
0.443 0 Td
(\343)Tj
/C2_0 1 Tf
0.443 0 Td
<0727>Tj
0.447 -0.17 Td
<047A>Tj
/TT0 1 Tf
-0.003 0.17 Td
(\325)Tj
/C2_0 1 Tf
0.723 0 Td
<0690>Tj
0.557 0.047 Td
<047A>Tj
0.11 -0.047 Td
<072D072D>Tj
0.853 -0.17 Td
<047A>Tj
-0.013 0.17 Td
<0699>Tj
0.473 0.047 Td
<047A>Tj
/TT0 1 Tf
12 -0 0 12 218.16 672.72 Tm
( )Tj
EMC 
ET

endstream 
endobj 

The text stream from Nuance is as follows:

7 0 obj 
<<
/Length 600
>>
stream
0.1999 0 0.1999 rg
[]0 d 1 w 10 M 0 i 0 J 0 j 
BT
/F0 35.029 Tf
1 0 0 1 28.789 774.789 Tm 
0 Tc 0 Tw 0 Tr 100 Tz 0 Ts 
( ')Tj
1 0 0 1 44.268 774.789 Tm 
(\000m)Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 59.868 774.789 Tm 
( ')Tj
1 0 0 1 75.467 769.03 Tm 
( z)Tj
ET
BT
/F1 35.029 Tf
0.9999 0 0 0.9999 75.778 774.782 Tm 
( )Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 100.787 774.789 Tm 
(  )Tj
1 0 0 1 120.106 776.469 Tm 
( z)Tj
1 0 0 1 124.066 774.789 Tm 
( -)Tj
1 0 0 1 138.825 774.789 Tm 
( -)Tj
1 0 0 1 153.825 769.03 Tm 
( z)Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 153.585 774.789 Tm 
(  )Tj
1 0 0 1 170.145 776.469 Tm 
( z)Tj
ET

endstream 
endobj 

I was rather surprised to see that Nuance had control characters in the Tj paramters.  Acrobat has some too, though, but mostly hex-quads.  

Not counting the leading and trailing space characters that got included, Acrobat emits 12 characters, which means that it doesn't compose them in the font creation.

I'm really not sure how to count the control characters... most of the Nuance Tj have both a control character and a regular character.  Maybe that is a form of CID mapping?  It emits 23 characters using Tj, unless the control character+regular character pair should be counted as one, in which case it emits 12 characters... which sounds more correct.

What I notice particularly about this compared to other PDF files I have looked at the internals of, is that both Acrobat and Nuance emit text movement operators between _each_ character (except one pair, in the case of Acrobat, which are the sequential ɛ characters, one with and one without a diacritical).  Acrobat uses Td, and Nuance uses Tm.

[reportlab-users] Incorrect character composition

Glenn Linderman

Glenn Linderman

Marius Gedminas

Glenn Linderman

Andy Robinson

Glenn Linderman

Andy Robinson

Glenn Linderman

Glenn Linderman

Robin Becker

Robin Becker

Marius Gedminas

Glenn Linderman

Robin Becker

Glenn Linderman

Robin Becker

Robin Becker

Glenn Linderman

Andy Robinson

Glenn Linderman

Andy Robinson

Robin Becker

Glenn Linderman

Robin Becker

Glenn Linderman

Glenn Linderman

Glenn Linderman