_______________________________________________ reportlab-users mailing list reportl...@lists2.reportlab.com https://pairlist2.pair.net/mailman/listinfo/reportlab-users
On Tue, Apr 14, 2015 at 12:05:04PM -0700, Glenn Linderman wrote:6-7 weeks with no response, for a while I thought the list was dead, but now a flurry of messages.... I guess I didn't actually ask a question, but is this, like kerning, thought to be too slow to implement, or is it just that the market for reportlab simply doesn't include languages that don't have precomposed glyphs, or something else?When I and Vika originally implemented Unicode + TrueType support in ReportLab, we didn't implement support for combining characters. I don't remember if the TTF/PDF specifications available at that time included such support. I guess nobody stepped up to add the missing support since then. Technical details (that might be wrong if the code changed since 2003, which it probably did--I wasn't keeping track): ReportLab takes apart the TTF and builds multiple fonts, each containing a subset of the original glyphs (up to 256). These subsets discard any and all TTF tables not explicitly copied, which I guess include the tables used for rendering combining characters in a nice way.
I guess if anyone is to blame for this, it's me as ReportLab's founder.
The closest we got to even half-understanding the problem was about 6 years ago when an Arabic-speaking employee with a little knowledge of Farsi took a look. Unfortunately we are not rendering raster graphics on screen. We are trying to work out the right font descriptors and sequences of bytes to put in the PDF file so that the right stuff magically happens on screen. When I did that with Japanese in about 2002-2003, with the advantages that (a) I can read and write the language and (b) there is no special layout at all, it still took a month of reverse-engineering other peoples' PDFs. Not knowing any of these languages, it's probably a big job, and we have not had any volunteers from the open source community, nor any customers willing to pay for the R&D.
I don't think it's a performance issue like kerning. I would sincerely hope that one just has to put the right byte sequences into the PDF and that the font sorts it out for you.
Glenn, my apologies - I had assumed you were discussing "unusual languages" without re-reading the original email carefully. It might not be that bad. There are two things we could do in the short term, and I'm keen to keep the core library moving forwards: (1) We could potentially provide a special flowable for kerned titles and short phrases. This would of course have to render a glyph at a time in Python, doing the lookups and calculations
(2) If you can find another open source PDF generator in any language which gets it right, and let us know, we can study a "hello world" PDF out of that tool and see what it does. This would be a big time saver.
Wordaxe does support automatic hyphenation and kerning.
See the SVN trunk (current revision is 110) at
http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/
However, I failed to make it work with RL's ImageAndFlowables class.
That's why I did not release an official new version.
For an example with kerning support, see the file
http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/tests/test_truetype.py
I agree with Andy that kerning slows the paragraph-wrapping process down,
so personally I would only use it for headings and title, not for the
main text content.
On 4/15/2015 2:02 AM, Andy Robinson wrote:
Glenn, my apologies - I had assumed you were discussing "unusual languages" without re-reading the original email carefully. It might not be that bad.
2. Composite glyph positioning
Regarding composite characters made from multiple glyphs, the only scheme I can now find to adjust Y position is described at the very end of this link: https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html That shows the use of Td operator to do both X & Y position between glyphs, but doesn't show how to calculate X & Y from font metrics. It would seem that only linear kerning was a concern and was optimized in operators when the PDF format was designed (since it predates Unicode). The idea of composing glyphs on the fly probably hadn't crossed any English-speaking minds, back then. The first couple paragraphs at that link hint at that likelihood.
Speculation: Maybe there is some mechanism to create composite glyphs from the individual glyphs for the composite character codes, and embed that composite glyph in the PDF and use its internal code instead of positioning them in the stream via the Td operator... but I haven't found that... only a few things that seemed to hint at it. While Unicode didn't do that, because of the character code explosion that would result, any given PDF only needs to deal with the characters (individual or composite) actually used in any particular document. So there _might_ be a tradeoff between complexity of font embedding versus the complexity of font display.
Who is responsible for glyph positioning. I believe it is the font + the renderer who is responsible.
This means that many things that developers working in other file formats take for granted, such as just putting down Unicode codepoints and letting the renderer do all the hard work, have to be done manually with PDF.
well I guess the way to go is
1) try an experiment to see if PDF renderers will accept the GPOS information in
a specific font and make good use of it. I guess we can use illustrator or
equivalent to make a sample document. Examining the dejaVuSans font shows it
certainly has GPOS information.
2) If the answer to 1 is yes then we'll need to parse the GPOS information and
construct subsets that keep the required pairs together. From my understanding
of the way PDF uses text I see little hope of constructing a single font that
does this for all glyphs in a simple way (section 3.2.3 of the 1.7 PDF spec says
"A string object consists of a series of bytes—unsigned integer values in the
range 0 to 255"), so we're apparently limited to encodings of length 256 or
less. Presumably we'll have to be really smart about constructing our encodings
if many glyph+diacritic pairs are used.
........
The problem is that ReportLab doesn't embed the font directly. Instead
it constructs multiple subsets (each with < 256 codepoints), and those
subsets constructed by ReportLab do not have GPOS information (check the
TTFontFile.makeSubset method to see what TTF tables are copied and how
they're transformed; my apologies about the terrible code you'll find
therein).
The GPOS table cannot be copied directly: subsetting changes glyph
numbering, so the GPOS table would have to be taken apart and
reconstructed with the renumbered glyphs.
well I guess the way to go is
1) try an experiment to see if PDF renderers will accept the GPOS information in a specific font and make good use of it. I guess we can use illustrator or equivalent to make a sample document. Examining the dejaVuSans font shows it certainly has GPOS information.
2) If the answer to 1 is yes then we'll need to parse the GPOS information and construct subsets that keep the required pairs together. From my understanding of the way PDF uses text I see little hope of constructing a single font that does this for all glyphs in a simple way (section 3.2.3 of the 1.7 PDF spec says "A string object consists of a series of bytes—unsigned integer values in the range 0 to 255"), so we're apparently limited to encodings of length 256 or less. Presumably we'll have to be really smart about constructing our encodings if many glyph+diacritic pairs are used.
On 20/04/2015 11:54, Glenn Linderman wrote:
On 4/20/2015 2:20 AM, Robin Becker wrote:
..........
1) try an experiment to see if PDF renderers will accept the GPOS information
in a specific font and make good use of it. I guess we can use illustrator or
equivalent to make a sample document. Examining the dejaVuSans font shows it
certainly has GPOS information.
Maybe. The attempt will also be instructive regarding how Illustrator might
handle such combined characters... if it does (I don't have Illustrator to test
with, but since it is from Adobe, it well might)... and what the generated PDF
looks like... if it contains positioning instructions, or depends on the PDF
display tools to have a good renderer.
well unfortunately the illustrator test produced exactly the wrong results; I copied the text from my sample DejaVuSans output into an illustrator text box with font set at dejavusans book. Illustrator or the copy and paste did exactly the wrong thing and converted only those pairs that are in the font already.
I also tried the effect of hand typing an A and then selecting a diacritic from illustrator's text/glyph window. The characters were sort of composed in the input window, but they were not well displayed and looked the same in a saved PDF. Our own output actually looked better for this case.
Result indecisive. I will have to do further work to test the actual embedded font to see if it contains gpos info.
Meantime, seeing your approach of looking at Illustrator output, I had a friend with Acrobat take my little test string and create a PDF from Acrobat. Results look good, and are at: http://nevcal.com/temporary/openo-Acrobat.pdf Maybe seeing what they do will help. File is big enough they must have embedded something or another, font-wise.Glenn, could you ask your friend exactly what they did with Acrobat to create this? i.e. did they use Acrobat Distiller to convert a Postscript file, or create a word document and export it to PDF using Acrobat? If we can observe another program "doing it right" it may help.
http://nevcal.com/temporary/openo-Nuance.pdf He also tried the just-released Nitro 10, but it failed to create from the clipboard, failed to create from a UTF-8 text file, failed to create from a plain text file, and failed to create from a Word document... he has submitted a bug report, and is probably busy reinstalling the prior version of Nitro. If Nitro 9 will do the job, that might give another file in the near future. Actually, the reason he has so many, is that most of them have limitations, some are good for one sort of thing, but mess up on other things. Another will do the other things, but not something else. Etc. He mostly uses the editing features to fine tune and work-around limitations in the PDF creation from other programs, rather than using them to create raw PDF files from other formats. So in counting the input characters for my sample, there are 8 base/precomposed characters, and 4 combining diacriticals, for a total of 12. The text stream from Acrobat is as follows: 15 0 obj << /Length 455 >> stream BT /P <</MCID 0 >>BDC /CS0 cs 0.2 0 0.2 scn /TT0 1 Tf 12 -0 0 12 72 709.2 Tm ( )Tj /C2_0 1 Tf 36 -0 0 36 72 672.72 Tm <0727>Tj /TT0 1 Tf 0.443 0 Td (\343)Tj /C2_0 1 Tf 0.443 0 Td <0727>Tj 0.447 -0.17 Td <047A>Tj /TT0 1 Tf -0.003 0.17 Td (\325)Tj /C2_0 1 Tf 0.723 0 Td <0690>Tj 0.557 0.047 Td <047A>Tj 0.11 -0.047 Td <072D072D>Tj 0.853 -0.17 Td <047A>Tj -0.013 0.17 Td <0699>Tj 0.473 0.047 Td <047A>Tj /TT0 1 Tf 12 -0 0 12 218.16 672.72 Tm ( )Tj EMC ET endstream endobj The text stream from Nuance is as follows: 7 0 obj << /Length 600 >> stream 0.1999 0 0.1999 rg []0 d 1 w 10 M 0 i 0 J 0 j BT /F0 35.029 Tf 1 0 0 1 28.789 774.789 Tm 0 Tc 0 Tw 0 Tr 100 Tz 0 Ts ( ')Tj 1 0 0 1 44.268 774.789 Tm (\000m)Tj ET BT /F0 35.029 Tf 1 0 0 1 59.868 774.789 Tm ( ')Tj 1 0 0 1 75.467 769.03 Tm ( z)Tj ET BT /F1 35.029 Tf 0.9999 0 0 0.9999 75.778 774.782 Tm ( )Tj ET BT /F0 35.029 Tf 1 0 0 1 100.787 774.789 Tm ( )Tj 1 0 0 1 120.106 776.469 Tm ( z)Tj 1 0 0 1 124.066 774.789 Tm ( -)Tj 1 0 0 1 138.825 774.789 Tm ( -)Tj 1 0 0 1 153.825 769.03 Tm ( z)Tj ET BT /F0 35.029 Tf 1 0 0 1 153.585 774.789 Tm ( )Tj 1 0 0 1 170.145 776.469 Tm ( z)Tj ET endstream endobj I was rather surprised to see that Nuance had control characters in the Tj paramters. Acrobat has some too, though, but mostly hex-quads. Not counting the leading and trailing space characters that got included, Acrobat emits 12 characters, which means that it doesn't compose them in the font creation. I'm really not sure how to count the control characters... most of the Nuance Tj have both a control character and a regular character. Maybe that is a form of CID mapping? It emits 23 characters using Tj, unless the control character+regular character pair should be counted as one, in which case it emits 12 characters... which sounds more correct. What I notice particularly about this compared to other PDF files I have looked at the internals of, is that both Acrobat and Nuance emit text movement operators between _each_ character (except one pair, in the case of Acrobat, which are the sequential ɛ characters, one with and one without a diacritical). Acrobat uses Td, and Nuance uses Tm.So my "guessing about a lot of things I haven't figured out" conclusion, without knowing how to look at the embedded fonts, is that both Acrobat and Nuance are doing the kerning and character composition positioning on the way in to the PDF file, rather than expecting the PDF display tool renderer to be smart. This is consistent with my guessing after reading the quote from that safarionline book. No clue how they figure out the numbers... no doubt it is either from the font files directly, using their own rendering code, or from some font rendering library, or from Windows somehow. The latter seems doubtful for Acrobat, since it also runs on Mac... although that is no guarantee it uses the same (recompiled) code on both platforms... it could get it from Windows on Windows and from OS/X on Mac.
Glenn,
my reading of the control sequence(s) is that these glyphs are being individually positioned in PDF; I see 12 separate Tm operators.
I ideally we should see a single BT with a string containing 14 bytes which would imply that acrobat handles all the glyph positioning.
I believe that the text strings are actually using two bytes per glyph; the map looks like
6 beginbfchar
<006d> <00e3>
<047a> <0303>
<0690> <0186>
<0699> <0190>
<0727> <0254>
<072d> <025b>
endbfchar
so the byte strings required correspond to the first of each pair.
006d = 00 m = \000m
047a = 04 z = ^Dz the tilde
06?? = 06 ?? = ^f?
0727 = 07 ' = ^G'
072d = 07 - = ^G-
etc etc. My mailer can't actually cope with the odd characters in the 06 lines.
On 21/04/2015 11:50, Glenn Linderman wrote:
On 4/21/2015 2:51 AM, Robin Becker wrote:
Glenn,
my reading of the control sequence(s) is that these glyphs are being
individually positioned in PDF; I see 12 separate Tm operators.
I agree.
I ideally we should see a single BT with a string containing 14 bytes which
would imply that acrobat handles all the glyph positioning.
I think we are on the same wavelength here, but I think you meant to say "Adobe
Reader (or other PDF display tool)" where you said "Acrobat". I think it is the
case that "Acrobat", (or other PDF generation tool), is doing all the
positioning, and encoding it into the PDF file.
yes the positioning is not being done by the renderer (acrobat reader/evince etc etc).
If that is the case then positioning has to be done by the software that produces the PDF ie illustrator/acrobat reader pro/reportlab. If this is true then there's no point in including the GPOS information into the embedded fonts.
If reportlab has to do the positioning of glyphs it should not affect the existing standard mechanisms. Probably we'll need a cumbersome, slow and fairly complicated text output mechanism.
.........
The below seems to be referring to the Nuance generated file, the Acrobat file
used HEX codes.
"Ideally", of course, refers to the way it should work if the PDF viewer's
renderer was responsible for combined glyph positioning. Of course, if it was,
it should also be responsible for rendering the kerning too, and then you
wouldn't be able to do right justification very well... it would have to be
predicted in one place and matched in the other... so I think the PDF technique
is to have the viewer only convert curves to pixels, following instructions by
the PDF creator as to where those curves should be placed, actually produces
more consistent results across platforms and devices... as much as it hurts to
have to do the calculations for the Td or Tm parameters when generating the PDF.
well I think kerning is a separate issue. Here we are talking about a standard unicode approach to composite glyph construction. Pairs/groups of glyphs are supposed to be treated in a specific way; kerning is optional.
There are, I think, 4 issues, the first two of which I could definitely use if implemented, and which sound relatively easy, but likely have performance impact. They would enable _higher quality typesetting_ of Latin-based text into PDF files. The others could be hard, but would be required to support a wider range of languages with non-Latin fonts. I did read something recently about Micro$oft producing a font layout system (but they used a different word in the article that I cannot come up with right now) for all the various needs of different language systems... The closest thing I can find with Google right now is their DirectWrite, but whether it incorporates the technology I read about, I couldn't say, but maybe it does or will. I don't recall if this was something they were making generally available to make the world's typography improve, or if it was a proprietary come-on to promote/improve Windows. It sounded pretty general, language-wise.
- kerning
- composite glyph positioning
- Languages with huge numbers of ligatures, where characters appear differently, even to the point of requiring different glyphs, at the beginning or end of words (Arabic) or adjacent to other letters (Thai).
- RTL languages.
1. kerning
My research into kerning is below, since it was somewhat productive. Most of it was on this list. I have not had time to research composite glyph positioning, which