[reportlab-users] Incorrect character composition

513 views
Skip to first unread message

Glenn Linderman

unread,
Feb 21, 2015, 4:18:47 PM2/21/15
to reportl...@lists2.reportlab.com
Hi,

I've suddenly discovered a need to use Unicode characters that do not fall into the category of "precomposed glyphs", instead being forced to use "combining characters" for certain diacritical marks.

However, the combined result from reportlab looks rather stupid compared to the results seen in other programs (browsers, text editors, word processors, etc.).  I even displayed the results in two different PDF viewers, Sumatra and Adobe, before concluding it must be a reportlab thing.

The problem is with the characters called  open o  (upper and lower case), and  open e (at least lower case, the upper case version looks better, but that may be more due to the open E being narrower than due to proper handling) when combined with the combining tilde, and other similar diacriticals.

In my sample at  http://nevcal.com/temporary/openo.pdf I've also included a precomposed ã and Õ as well, for comparison of where the tilde should be placed.  Here are the same characters in email... I note that in my email client (Thunderbind) the precomposed tildes are slightly closer to the characters than the combining tilde, but in the reportlab-generated PDF, the lower case combining tildes are far too high, and those over (wider) upper case characters are not centered.  Times New Roman font in both this email (unless your client or the mailing list strips the fonts) and the PDF.

Glenn

ɔãɔ̃ÕƆ̃ɛɛ̃Ɛ̃

Glenn Linderman

unread,
Apr 14, 2015, 3:05:21 PM4/14/15
to reportl...@lists2.reportlab.com
6-7 weeks with no response, for a while I thought the list was dead, but now a flurry of messages....

I guess I didn't actually ask a question, but is this, like kerning, thought to be too slow to implement, or is it just that the market for reportlab simply doesn't include languages that don't have precomposed glyphs, or something else?
_______________________________________________
reportlab-users mailing list
reportl...@lists2.reportlab.com
https://pairlist2.pair.net/mailman/listinfo/reportlab-users

Marius Gedminas

unread,
Apr 15, 2015, 1:25:25 AM4/15/15
to reportl...@lists2.reportlab.com
On Tue, Apr 14, 2015 at 12:05:04PM -0700, Glenn Linderman wrote:
> 6-7 weeks with no response, for a while I thought the list was dead, but now
> a flurry of messages....
>
> I guess I didn't actually ask a question, but is this, like kerning, thought
> to be too slow to implement, or is it just that the market for reportlab
> simply doesn't include languages that don't have precomposed glyphs, or
> something else?

When I and Vika originally implemented Unicode + TrueType support in
ReportLab, we didn't implement support for combining characters. I
don't remember if the TTF/PDF specifications available at that time included
such support.

I guess nobody stepped up to add the missing support since then.

Technical details (that might be wrong if the code changed since 2003,
which it probably did--I wasn't keeping track): ReportLab takes apart
the TTF and builds multiple fonts, each containing a subset of the
original glyphs (up to 256). These subsets discard any and all TTF
tables not explicitly copied, which I guess include the tables used for
rendering combining characters in a nice way.
Marius Gedminas
--
What goes up, must come down. Ask any system administrator.
signature.asc

Glenn Linderman

unread,
Apr 15, 2015, 4:01:56 AM4/15/15
to reportlab-users
On 4/14/2015 10:25 PM, Marius Gedminas wrote:
On Tue, Apr 14, 2015 at 12:05:04PM -0700, Glenn Linderman wrote:
6-7 weeks with no response, for a while I thought the list was dead, but now
a flurry of messages....

I guess I didn't actually ask a question, but is this, like kerning, thought
to be too slow to implement, or is it just that the market for reportlab
simply doesn't include languages that don't have precomposed glyphs, or
something else?
When I and Vika originally implemented Unicode + TrueType support in
ReportLab, we didn't implement support for combining characters.  I
don't remember if the TTF/PDF specifications available at that time included
such support.

I guess nobody stepped up to add the missing support since then.

Technical details (that might be wrong if the code changed since 2003,
which it probably did--I wasn't keeping track): ReportLab takes apart
the TTF and builds multiple fonts, each containing a subset of the
original glyphs (up to 256).  These subsets discard any and all TTF
tables not explicitly copied, which I guess include the tables used for
rendering combining characters in a nice way.

Thanks for the response and explanation of the history... By the time I started using reportlab, TTF support already existed. Maybe no one else knew that combining character support didn't exist.

In looking at the details of PDF files, it seems that when kerning _is_ supported, it is not done by copying TTF kerning tables into Postscript Fonts, but rather by explicitly coding the distance to advance the caret when laying down text.

While I was waiting for a response, I was speculating about how combining characters might be coded in PDF files, but I haven't found any that contain combining characters that I have been able to take apart and examine (and I'm no expert at such examination).

I speculated that if kerning tables are not copied and used by the display engines, that probably combining character X & Y positions would not be either. The conclusion of that speculation was that it would be the responsibility of the PDF file creation tool, which has all the available font tables, to use the kerning feature to adjust the X position, and possibly just absolutely position the Y position.... I didn't find (but may have overlooked) any technique in the PDF specification for adjusting the Y position of the combining character short of simply making it its own character stream, with an adjusted Y baseline.

Andy Robinson

unread,
Apr 15, 2015, 4:19:07 AM4/15/15
to reportlab-users
I guess if anyone is to blame for this, it's me as ReportLab's founder.

The closest we got to even half-understanding the problem was about 6
years ago when an Arabic-speaking employee with a little knowledge of
Farsi took a look. Unfortunately we are not rendering raster graphics
on screen. We are trying to work out the right font descriptors and
sequences of bytes to put in the PDF file so that the right stuff
magically happens on screen. When I did that with Japanese in about
2002-2003, with the advantages that (a) I can read and write the
language and (b) there is no special layout at all, it still took a
month of reverse-engineering other peoples' PDFs. Not knowing any
of these languages, it's probably a big job, and we have not had any
volunteers from the open source community, nor any customers willing
to pay for the R&D.

I don't think it's a performance issue like kerning. I would
sincerely hope that one just has to put the right byte sequences into
the PDF and that the font sorts it out for you.

If anyone here is willing to help have a crack at it, stick their
hands up and I can suggest a general approach. We have done some
work in the past month updating the pyfribidi extension to compile on
Python 2.7, 3.3 and 3.4, whch is a prerequisite for anything here. I
think we probably need to crack Arabic and Hebrew as a first step.

- Andy

Glenn Linderman

unread,
Apr 15, 2015, 4:48:56 AM4/15/15
to reportl...@lists2.reportlab.com
On 4/15/2015 1:19 AM, Andy Robinson wrote:
I guess if anyone is to blame for this, it's me as ReportLab's founder.

I'm not looking to blame anyone, just trying to understand if there is or will be a solution to a problem. 

The closest we got to even half-understanding the problem was about 6
years ago when an Arabic-speaking employee with a little knowledge of
Farsi took a look.  Unfortunately we are not rendering raster graphics
on screen.  We are trying to work out the right font descriptors and
sequences of bytes to put in the PDF file so that the right stuff
magically happens on screen.   When I did that with Japanese in about
2002-2003, with the advantages that (a) I can read and write the
language and (b) there is no special layout at all, it still took a
month of reverse-engineering other peoples' PDFs.    Not knowing any
of these languages, it's probably a big job, and we have not had any
volunteers from the open source community, nor any customers willing
to pay for the R&D.

While I ran into the issue with Latin-based languages with unusual (among Latin-based languages) diacritical marks, I suppose you are right that support for Arabic and Hebrew (ha, and I think Thai is one of the worst at using combining characters, so don't forget Thai) is required for a general solution. Obviously, supporting all those languages makes the problem much larger than support LTR Latin+diacritical marks with proper positions.  Vietnamese seems to be another of interest in the smaller subset, though, as it uses two diacriticals on many vowels whereas the languages I'm dealing with so far only have one... but on some uncommon "extended Latin" characters. Vietnamese may hit me someday, but it hasn't yet.


I don't think it's a performance issue like kerning.  I would
sincerely hope that one just has to put the right byte sequences into
the PDF and that the font sorts it out for you.

After raising the issue of kerning, I found there is some limited support for kerning from 3rd parties outside of platypus, which might be sufficient for the type of international typesetting I'm doing. But it seems to me that even if proper kerning is a performance issue, if it can be turned on & off, then people that care could get good results, and people that don't could have fast results. Right now, people that care don't have a solution at all (in Python, for generating PDFs).

Still, without the combining character support, and if there is no "quick fix" (which I wouldn't expect if the general problem is solved), I'll likely have to look at other solutions and even other languages, probably, for generating my PDFs. Which I regret, because (1) I'm already using reportlab for current languages (2) it has a nice design (3) I like coding in Python.

Andy Robinson

unread,
Apr 15, 2015, 5:02:26 AM4/15/15
to reportlab-users
Glenn, my apologies - I had assumed you were discussing "unusual
languages" without re-reading the original email carefully. It might
not be that bad.

There are two things we could do in the short term, and I'm keen to
keep the core library moving forwards:

(1) We could potentially provide a special flowable for kerned titles
and short phrases. This would of course have to render a glyph at a
time in Python, doing the lookups and calculations

(2) If you can find another open source PDF generator in any language
which gets it right, and let us know, we can study a "hello world" PDF
out of that tool and see what it does. This would be a big time
saver.

Glenn Linderman

unread,
Apr 16, 2015, 1:47:16 AM4/16/15
to reportl...@lists2.reportlab.com
On 4/15/2015 2:02 AM, Andy Robinson wrote:
Glenn,  my apologies - I had assumed you were discussing "unusual
languages" without re-reading the original email carefully.  It might
not be that bad.

There are two things we could do in the short term, and I'm keen to
keep the core library moving forwards:

(1) We could potentially provide a special flowable for kerned titles
and short phrases.  This would of course have to render a glyph at a
time in Python, doing the lookups and calculations

When writing to fixed resolution devices, various fonts have hints for use at low resolution, and when rendering the font and character spacing it varies.  I don't know if PDF supports that directly, but I noticed when printing from a browser to a PDF printer, that the character spacing was weird in the printed result. When I told the browser to scale everything up really high, and then the browser's printer driver to scale to fit the page, the weird character spacing went away.  So in producing PDF for typesetting, it is best to ignore the "hints" for low resolution devices.  Of course screens are some of the lowest resolution devices, and that is what browsers aim at, mostly. Printing is sort of a side effect.

My data would be mostly a short word, or up to 3 lines of outdented text, without right justification.


(2) If you can find another open source PDF generator in any language
which gets it right, and let us know, we can study a "hello world" PDF
out of that tool and see what it does.   This would be a big time
saver.

There are, I think, 4 issues, the first two of which I could definitely use if implemented, and which sound relatively easy, but likely have performance impact. They would enable _higher quality typesetting_ of Latin-based text into PDF files. The others could be hard, but would be required to support a wider range of languages with non-Latin fonts.  I did read something recently about Micro$oft producing a font layout system (but they used a different word in the article that I cannot come up with right now) for all the various needs of different language systems... The closest thing I can find with Google right now is their DirectWrite, but whether it incorporates the technology I read about, I couldn't say, but maybe it does or will. I don't recall if this was something they were making generally available to make the world's typography improve, or if it was a proprietary come-on to promote/improve Windows. It sounded pretty general, language-wise.
  1. kerning
  2. composite glyph positioning
  3. Languages with huge numbers of ligatures, where characters appear differently, even to the point of requiring different glyphs, at the beginning or end of words (Arabic) or adjacent to other letters (Thai).
  4. RTL languages.

1. kerning

My research into kerning is below, since it was somewhat productive. Most of it was on this list. I have not had time to research composite glyph positioning, which

Here's a reference to how to emit kerning into a PDF file: http://stackoverflow.com/questions/18304954/how-is-kerning-encoded-on-embedded-adobe-type-1-fonts-in-pdf-files

On this mailing list, the following messages are about kerning, and the last two have sample PDF files that claim to have kerning. Seems like perhaps integrating Henning's Wordaxe kerning code into reportlab itself might make it easier to integrate and make it work with floawables. Anyway, it is a start.

From: Henning von Bargen <H.von...@t-p.com>
Date: Tue, 6 Jan 2015 07:16:15 +0000

Wordaxe does support automatic hyphenation and kerning.

See the SVN trunk (current revision is 110) at
http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/

However, I failed to make it work with RL's ImageAndFlowables class.
That's why I did not release an official new version.

For an example with kerning support, see the file
http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/tests/test_truetype.py

I agree with Andy that kerning slows the paragraph-wrapping process down,
so personally I would only use it for headings and title, not for the
main text content.

From: Dinu Gherman <ghe...@darwin.in-berlin.de>
Date: Tue, 6 Jan 2015 11:37:40 +0100

From: Dinu Gherman <ghe...@darwin.in-berlin.de>
Date: Tue, 6 Jan 2015 11:39:30 +0100

From: Dinu Gherman <ghe...@darwin.in-berlin.de>
Date: Tue, 6 Jan 2015 11:40:57 +0100


2. Composite glyph positioning

Regarding composite characters made from multiple glyphs, the only scheme I can now find to adjust Y position is described at the very end of this link:  https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html  That shows the use of Td operator to do both X & Y position between glyphs, but doesn't show how to calculate X & Y from font metrics. It would seem that only linear kerning was a concern and was optimized in operators when the PDF format was designed (since it predates Unicode). The idea of composing glyphs on the fly probably hadn't crossed any English-speaking minds, back then. The first couple paragraphs at that link hint at that likelihood.

Speculation: Maybe there is some mechanism to create composite glyphs from the individual glyphs for the composite character codes, and embed that composite glyph in the PDF and use its internal code instead of positioning them in the stream via the Td operator... but I haven't found that... only a few things that seemed to hint at it. While Unicode didn't do that, because of the character code explosion that would result, any given PDF only needs to deal with the characters (individual or composite) actually used in any particular document. So there _might_ be a tradeoff between complexity of font embedding versus the complexity of font display.



Maybe somewhat unrelated to the above issues, but interesting:

I also just found http://www.linuxfoundation.org/images/8/80/Textextraction_slides_small.pdf which is rather interesting... a bit short on details of how, but looks like it would be appropriate and useful when generating PDF files to use the "ToUnicode" feature, whatever it is... I seem to have found it in section 5.9.2 of the 1.7 version of the PDF reference, although I haven't absorbed it yet.

Glenn Linderman

unread,
Apr 16, 2015, 2:37:54 AM4/16/15
to reportl...@lists2.reportlab.com
On 4/15/2015 10:46 PM, Glenn Linderman wrote:
On 4/15/2015 2:02 AM, Andy Robinson wrote:
Glenn,  my apologies - I had assumed you were discussing "unusual
languages" without re-reading the original email carefully.  It might
not be that bad.
2. Composite glyph positioning

Regarding composite characters made from multiple glyphs, the only scheme I can now find to adjust Y position is described at the very end of this link:  https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html  That shows the use of Td operator to do both X & Y position between glyphs, but doesn't show how to calculate X & Y from font metrics. It would seem that only linear kerning was a concern and was optimized in operators when the PDF format was designed (since it predates Unicode). The idea of composing glyphs on the fly probably hadn't crossed any English-speaking minds, back then. The first couple paragraphs at that link hint at that likelihood.

Speculation: Maybe there is some mechanism to create composite glyphs from the individual glyphs for the composite character codes, and embed that composite glyph in the PDF and use its internal code instead of positioning them in the stream via the Td operator... but I haven't found that... only a few things that seemed to hint at it. While Unicode didn't do that, because of the character code explosion that would result, any given PDF only needs to deal with the characters (individual or composite) actually used in any particular document. So there _might_ be a tradeoff between complexity of font embedding versus the complexity of font display.

http://www.alanwood.net/unicode/combining_diacritical_marks.html is an interesting page. It is intended to test broswer capabilities to display combining characters... it seems to display well in Firefox on Windows.  Firefox is open source. However, it might well use OS services on various platforms to assist?

If it does, maybe that is the thing to do for reportlab too? Or maybe not, if the OS services translate Unicode string→displayed bits on screen. But if the OS services translate Unicode string→sequence of glyphs and positions, then it could be really handy.

On the other hand, if it uses other libraries to do the placement, maybe they would be available for reportlab as well, since FF is open source.

I converted the above page to PDF using the PDFize plugin for Firefox, the result is at nevcal.com/temporary/Combining Diacritical Marks – Test for Unicode support in Web browsers.pdf
and it looks pretty good, although I don't know how useful it will be to you.  My understanding is that FF uses the Cairo graphics package to do screen rendering and the extension plugs in to that to do PDF rendering also. Caveats about kerning and screen resolution sources apply.

Robin Becker

unread,
Apr 16, 2015, 5:52:02 AM4/16/15
to reportlab-users
On 16/04/2015 06:46, Glenn Linderman wrote:
> From: Henning von Bargen <H.von...@t-p.com>
> Date: Tue, 6 Jan 2015 07:16:15 +0000
>
>> Wordaxe does support automatic hyphenation and kerning.
>>
>> See the SVN trunk (current revision is 110) at
>> http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/
>>
>> However, I failed to make it work with RL's ImageAndFlowables class.
>> That's why I did not release an official new version.
>>
>> For an example with kerning support, see the file
>> http://sourceforge.net/p/deco-cow/code/HEAD/tree/trunk/tests/test_truetype.py
>>
>> I agree with Andy that kerning slows the paragraph-wrapping process down,
>> so personally I would only use it for headings and title, not for the
>> main text content.


I corresponded about the problem he observed with ImageAndFlowables, it turned
out that his paragraph wrapping was looping forever for the case of negative
available widths which could happen in some ImageAndFlowable instances. I did
put in a fix to the bitbucket repository, but haven't heard if that works for
Henning.
--
Robin Becker

Robin Becker

unread,
Apr 17, 2015, 10:22:39 AM4/17/15
to reportlab-users
Who is responsible for glyph positioning. I believe it is the font + the
renderer who is responsible.

I wrote the script below to test various diacritic behaviours in reportlab.

The TLDR is as follows, the TTF fonts seem to know about diacritics. The adobe
builtins may or may not know about them, but with our standard encoding
Helvetica clearly doesn't.

The script draws space + glyph + diacritic for some upper and lower case roman
letters. It also draws the same after unicode normalization.

Where seen, all the diacritics have zero width. The DejaVuSans font seems to do
slightly better than Arial in centring the common diacritics, where available
the composed glyphs (obtained by normalization) seem much better.

With no width for centring it would seem we need to examine the curves to get
any kind of centring right. DejaVu & Arial have some built in negative shifts as
can be seen by examining the tilde

> C:\tmp>python
> Python 2.7.8 (default, Jun 30 2014, 16:08:48) [MSC v.1500 64 bit (AMD64)] on win32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from reportlab.pdfbase.pdfmetrics import registerFont
>>>> from reportlab.pdfbase.ttfonts import TTFont
>>>> registerFont(TTFont('DejaVuSans','DejaVuSans.ttf'))
>>>> from reportlab.graphics.charts.textlabels import _text2PathDescription
>>>> p=_text2PathDescription(u'\u0303',fontName='DejaVuSans',fontSize=2048)
>>>> p
> [('moveTo', -518, 1370), (u'lineTo', -575, 1425), (u'curveTo', -589, 1438, -602, 1448, -613, 1454),
> (u'curveTo', -624, 1460, -634, 1464, -643, 1464), (u'curveTo', -668, 1464, -687, 1452, -699, 1427),
> (u'curveTo', -711, 1403, -717, 1364, -719, 1309), (u'lineTo', -844, 1309),
> (u'curveTo', -843, 1399, -825, 1468, -791, 1517), (u'curveTo', -757, 1566, -710, 1591, -649, 1591),
> (u'curveTo', -624, 1591, -601, 1587, -579, 1577), (u'curveTo', -558, 1568, -535, 1552, -510, 1530),
> (u'lineTo', -453, 1475), (u'curveTo', -439, 1462, -426, 1452, -414, 1445),
> (u'curveTo', -404, 1439, -394, 1436, -385, 1436), (u'curveTo', -360, 1436, -341, 1448, -329, 1472),
> (u'curveTo', -317, 1496, -311, 1536, -309, 1591), (u'lineTo', -184, 1591),
> (u'curveTo', -185, 1501, -203, 1432, -237, 1382), (u'curveTo', -271, 1334, -318, 1309, -379, 1309),
> (u'curveTo', -404, 1309, -427, 1313, -449, 1323), (u'curveTo', -470, 1332, -493, 1348, -518, 1370),
> 'closePath']
>>>> registerFont(TTFont('Arial','Arial.ttf'))
>>>> pa=_text2PathDescription(u'\u0303',fontName='Arial',fontSize=2048)
>>>> pa
> [('moveTo', -909, 1547), (u'curveTo', -909, 1615, -891, 1670, -853, 1712),
> (u'curveTo', -816, 1754, -767, 1775, -706, 1775), (u'curveTo', -665, 1775, -609, 1757, -537, 1721),
> (u'curveTo', -498, 1701, -467, 1691, -443, 1691), (u'curveTo', -403, 1691, -378, 1720, -370, 1778),
> (u'lineTo', -240, 1778), (u'curveTo', -244, 1626, -309, 1550, -436, 1550),
> (u'curveTo', -478, 1550, -533, 1568, -602, 1606), (u'curveTo', -646, 1630, -679, 1642, -700, 1642),
> (u'curveTo', -752, 1642, -778, 1611, -776, 1547), (u'lineTo', -909, 1547), 'closePath']
>>>>



ie the curve starts at -518/2048 and goes at least to -844/2048, but it's clear
no single shift can match the various upper and lower case widths that could
occur. The arial curve is even more negative.

If a combined glyph is in the font we should use it, I'm not sure we even have
an api for that; TTFont has charToGlyph unicode-->glyph number, but we have code
to escape if there are no glyph components defined for it so the test is quite hard.

Otherwise, generating a missing combined glyph dynamically is probably the way
to go, but to do that we need information about how each combining character is
supposed to be positioned. The alternative is to attempt to do the adjustment
every time we render text using pdf operators; we still need the same information.

#################################################################
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfbase.pdfmetrics import registerFont
from reportlab.pdfgen.canvas import Canvas
from reportlab.lib.pagesizes import A4 as pagesize
from reportlab.lib.utils import uniChr
from unicodedata import normalize as unormalize
registerFont(TTFont("Arial", "Arial.ttf"))
registerFont(TTFont("DejaVuSans", "DejaVuSans.ttf"))

c = Canvas('tdiacritics.pdf', pagesize=pagesize)
y0 = pagesize[1]-12
for fontName in ('Arial','DejaVuSans','Helvetica'):
c.setFont(fontName, 10)
y = y0
y -= 12
c.drawString(18,y,fontName)
for diacritic in range(0x300,0x370):
if y-24 < 0:
c.showPage()
c.setFont(fontName, 10)
y = y0
y -= 12
c.drawString(18,y,fontName)
y -= 12
x = 18
diacritic = uniChr(diacritic)
c.drawString(x,y,hex(ord(diacritic)))
x += 40
u = u' '+diacritic+(u' w=%s'%c.stringWidth(diacritic))
c.drawString(x,y,u)
x += max(c.stringWidth(u),40)
for g in u'AEIOUYaeiouy':
u = ' '+g+diacritic
c.drawString(x,y,u)
x += 20
c.showPage()
c.setFont(fontName, 10)
y = y0
y -= 12
c.drawString(18,y,fontName+' normalized')
for diacritic in range(0x300,0x370):
if y-24 < 0:
c.showPage()
c.setFont(fontName, 10)
y = y0
y -= 12
c.drawString(18,y,fontName+' normalized')
y -= 12
x = 18
diacritic = uniChr(diacritic)
c.drawString(x,y,hex(ord(diacritic)))
x += 40
u = u' '+diacritic+(u' w=%s'%c.stringWidth(diacritic))
c.drawString(x,y,u)
x += max(c.stringWidth(u),40)
for g in u'AEIOUYaeiouy':
u = unormalize('NFC',' '+g+diacritic)
c.drawString(x,y,u)
x += 20
c.showPage()
c.save()
#################################################################
--
Robin Becker

Marius Gedminas

unread,
Apr 17, 2015, 12:54:43 PM4/17/15
to reportl...@lists2.reportlab.com
On Fri, Apr 17, 2015 at 03:22:34PM +0100, Robin Becker wrote:
> Who is responsible for glyph positioning. I believe it is the font + the
> renderer who is responsible.

I think so too.

> I wrote the script below to test various diacritic behaviours in reportlab.
...
> The TLDR is as follows, the TTF fonts seem to know about diacritics. The
> adobe builtins may or may not know about them, but with our standard
> encoding Helvetica clearly doesn't.
>
> The script draws space + glyph + diacritic for some upper and lower case
> roman letters. It also draws the same after unicode normalization.
>
> Where seen, all the diacritics have zero width. The DejaVuSans font seems to
> do slightly better than Arial in centring the common diacritics, where
> available the composed glyphs (obtained by normalization) seem much better.
>
> With no width for centring it would seem we need to examine the curves to
> get any kind of centring right.

The font should have the positioning information in the GPOS table:
https://www.microsoft.com/typography/otspec/gpos.htm

I would hope that PDF renderers will take care to apply the position
adjustments, so the PDF generator doesn't have to emit any extra
positioning commands.

The problem is that ReportLab doesn't embed the font directly. Instead
it constructs multiple subsets (each with < 256 codepoints), and those
subsets constructed by ReportLab do not have GPOS information (check the
TTFontFile.makeSubset method to see what TTF tables are copied and how
they're transformed; my apologies about the terrible code you'll find
therein).

The GPOS table cannot be copied directly: subsetting changes glyph
numbering, so the GPOS table would have to be taken apart and
reconstructed with the renumbered glyphs.

> Otherwise, generating a missing combined glyph dynamically is probably the
> way to go, but to do that we need information about how each combining
> character is supposed to be positioned. The alternative is to attempt to do
> the adjustment every time we render text using pdf operators; we still need
> the same information.

I hope the issue can be solved in a simpler way.

Marius Gedminas
--
The memory management on the PowerPC can be used to frighten small children.
-- Linus Torvalds
signature.asc

Glenn Linderman

unread,
Apr 17, 2015, 3:48:19 PM4/17/15
to reportl...@lists2.reportlab.com
On 4/17/2015 7:22 AM, Robin Becker wrote:
Who is responsible for glyph positioning. I believe it is the font + the renderer who is responsible.

I believe you are correct, but from that Safari Books link I referenced a few emails ago:
<https://www.safaribooksonline.com/library/view/developing-with-pdf/9781449327903/ch04.html>

This means that many things that developers working in other file formats take for granted, such as just putting down Unicode codepoints and letting the renderer do all the hard work, have to be done manually with PDF.

So it seems to me that while most renderers render to bitmaps at some resolution, that for PDF files, you need to assume some (probably high) bitmap resolution, intercept the rendering process, and capture the glyphs and positions to convert to PDF instructions... where the fonts may well contain extra glyphs to be used for various composition techniques, that do not correspond to Unicode codepoints... the above statement implies to me that all the hard work of rendering has to be done before encoding to PDF, and that the PDF display tools only convert curves to a bitmap.

Back in 256-character font days, some fonts allocated codes for up to six different copies of the diacritical, all with zero width, but variant side bearings, and called them  "capital O <accent>" "capital E <accent>" "capital I <accent>" (and three more for lowercase), and then the user/program had to pick the right variety of the accent to go with the preceding base character (A & E & U being roughly the same width, generally, to cover all the vowels; but note that some diacritical go with consonants too)

I'm far from an expert in PDF files, not much more acquainted with font files, but that quote above from the Safaribooks, together with thinking about the complexity of transforming the tables in a useful manner, and doubt that PDF display tools would have freedom to use those tables even if they existed, and uncertainty about if they are even allowed to be embedded, makes me think that individual glyphs would have to be separately positioned by instructions in the PDF file.  The next paragraphs are an exposition of my thought process in arriving at this conclusion, but someone with more knowledge may well poke holes in it, and I'd be delighted to be presented with references to documentation that pokes those holes, that I couldn't find via Google.

The way I interpret that is that PDF display tools really only follow instructions about placing glyphs on the screen... in particular, since they need instructions (fairly well documented) to do simple things like kerning, I would find it surprising if they would not need instructions to do more complex things like character composition.

On the other hand, having read Marius' reply, I can't say with certainty that if there were more font tables included, that some particular PDF display tools may invoke a renderer that would do more work "during display". On the other hand, with the stated goal that PDF files can be reproduced identically by any PDF tool, I'm guessing that leaving _any_ work up to the renderer, other than following exact curve descriptions from a font, would not be compatible with identical reproduction.

I'm not clear on how the font embedding works; Marius hints at the possibility of rebuilding some of the tables to correspond to renumbered glyphs, but I've no clue if those tables are allowed to be embedded, or, if embedded, if they are allowed to be used by the PDF display tools anyway, for identical reproduction.

I found a reference to some howto guides for a font creation tool, and it was talking about having alternate glyphs for certain types of uses; one example was using a shorter accent mark about uppercase (taller) letters than above lowercase (shorter) letters. I've also heard about fonts containing "attachment points" (I don't know if that is one of the available tables, or is metadata per glyph) so that when characters are combined, they combine based on these attachment points.  Whether both glyphs of a combining pair must have attachment points, or whether only one, I couldn't say, but the goal would seem to be, for many diacriticals, to have the diacritical centered above or below the character... the zero-width thing for diacriticals doesn't achieve that because the base characters have have different widths, but checking for the left and right side-bearings may be an alternative determination of width for accents that want to be centered.

Of course some diacriticals are to be placed to the right or left of the characters, or connect two characters rather than being placed on one, making me wonder if there might be different attachment points for different types of diacriticals; a base character could have, potentially, up to 6 that I can think of, for diacriticals that should be centered, left edged, or right edged, times above and below.

When fonts are carved up into little chunks for pre-Unicode PDF font subsets, many of the base characters may wind up in different font subsets than the diacriticals that want to attach to them... this makes me wonder if it is even possible to rework the font tables as Marius suggested, even if it is legal and could be useful for some? all? PDF display tools. Maybe some glyphs would have to be repeated in mulitple subsets?

Robin Becker

unread,
Apr 20, 2015, 5:20:18 AM4/20/15
to reportl...@lists2.reportlab.com
........

> The problem is that ReportLab doesn't embed the font directly. Instead
> it constructs multiple subsets (each with < 256 codepoints), and those
> subsets constructed by ReportLab do not have GPOS information (check the
> TTFontFile.makeSubset method to see what TTF tables are copied and how
> they're transformed; my apologies about the terrible code you'll find
> therein).
>
> The GPOS table cannot be copied directly: subsetting changes glyph
> numbering, so the GPOS table would have to be taken apart and
> reconstructed with the renumbered glyphs.
>

well I guess the way to go is

1) try an experiment to see if PDF renderers will accept the GPOS information in
a specific font and make good use of it. I guess we can use illustrator or
equivalent to make a sample document. Examining the dejaVuSans font shows it
certainly has GPOS information.

2) If the answer to 1 is yes then we'll need to parse the GPOS information and
construct subsets that keep the required pairs together. From my understanding
of the way PDF uses text I see little hope of constructing a single font that
does this for all glyphs in a simple way (section 3.2.3 of the 1.7 PDF spec says
"A string object consists of a series of bytes—unsigned integer values in the
range 0 to 255"), so we're apparently limited to encodings of length 256 or
less. Presumably we'll have to be really smart about constructing our encodings
if many glyph+diacritic pairs are used.

Glenn Linderman

unread,
Apr 20, 2015, 6:55:03 AM4/20/15
to reportlab-users
On 4/20/2015 2:20 AM, Robin Becker wrote:
........
The problem is that ReportLab doesn't embed the font directly.  Instead
it constructs multiple subsets (each with < 256 codepoints), and those
subsets constructed by ReportLab do not have GPOS information (check the
TTFontFile.makeSubset method to see what TTF tables are copied and how
they're transformed; my apologies about the terrible code you'll find
therein).

The GPOS table cannot be copied directly: subsetting changes glyph
numbering, so the GPOS table would have to be taken apart and
reconstructed with the renumbered glyphs.


well I guess the way to go is

1) try an experiment to see if PDF renderers will accept the GPOS information in a specific font and make good use of it. I guess we can use illustrator or equivalent to make a sample document. Examining the dejaVuSans font shows it certainly has GPOS information.

Maybe. The attempt will also be instructive regarding how Illustrator might handle such combined characters... if it does (I don't have Illustrator to test with, but since it is from Adobe, it well might)... and what the generated PDF looks like... if it contains positioning instructions, or depends on the PDF display tools to have a good renderer.



2) If the answer to 1 is yes then we'll need to parse the GPOS information and construct subsets that keep the required pairs together. From my understanding of the way PDF uses text I see little hope of constructing a single font that does this for all glyphs in a simple way (section 3.2.3 of the 1.7 PDF spec says "A string object consists of a series of bytes—unsigned integer values in the range 0 to 255"), so we're apparently limited to encodings of length 256 or less. Presumably we'll have to be really smart about constructing our encodings if many glyph+diacritic pairs are used.

If #2 applies, such an analysis of encodings is probably best done after seeing all the combinations used in the file.  Would it make sense to have an iteration inside build() just to collect all the characters used in a document for such an analysis? I've really no clue at what iteration the current font subset generation takes place, whether it is first, last or somewhere in the middle... nor do I have a clue if more characters get added in various phases due to repagination, etc.


Robin Becker

unread,
Apr 20, 2015, 8:32:27 AM4/20/15
to reportlab-users
On 20/04/2015 11:54, Glenn Linderman wrote:
> On 4/20/2015 2:20 AM, Robin Becker wrote:
>> ........
..........
>> 1) try an experiment to see if PDF renderers will accept the GPOS information
>> in a specific font and make good use of it. I guess we can use illustrator or
>> equivalent to make a sample document. Examining the dejaVuSans font shows it
>> certainly has GPOS information.
>
> Maybe. The attempt will also be instructive regarding how Illustrator might
> handle such combined characters... if it does (I don't have Illustrator to test
> with, but since it is from Adobe, it well might)... and what the generated PDF
> looks like... if it contains positioning instructions, or depends on the PDF
> display tools to have a good renderer.
>
yes that's what I wanted to find out ie a) does the gpos info help and b) will
renderers take notice and c) how does a 'proper' implementation of subsets work.


>>
>> 2) If the answer to 1 is yes then we'll need to parse the GPOS information and
>.........
>> encodings if many glyph+diacritic pairs are used.
>
> If #2 applies, such an analysis of encodings is probably best done after seeing
> all the combinations used in the file. Would it make sense to have an iteration
> inside build() just to collect all the characters used in a document for such an
> analysis? I've really no clue at what iteration the current font subset
> generation takes place, whether it is first, last or somewhere in the middle...
> nor do I have a clue if more characters get added in various phases due to
> repagination, etc.
....
I'm not certain any extra pass will be required unless we want to be 'optimal'
in some sense. Currently we make subsets on demand ie in response to the glyphs
that are actually used. If we see usage of diacritics we will need to ensure
that splitString(text,doc) --> (subset0, bytes0),... does the right thing if it
sees a pair of glyphs that must be in the same font then it will have to ensure
that even if it means a particular glyph gets duplicate mappings. Normally we
create a new subset only when the previous one gets filled, but I suspect we may
need to allow subsets to be created in more places and for more reasons. Luckily
all these 'dynamic' fonts are lazily constructed afterwards so we could maintain
separate diacritic usage subsets if needed.

Robin Becker

unread,
Apr 20, 2015, 11:38:10 AM4/20/15
to reportlab-users
On 20/04/2015 11:54, Glenn Linderman wrote:
> On 4/20/2015 2:20 AM, Robin Becker wrote:
>..........
>>
>> 1) try an experiment to see if PDF renderers will accept the GPOS information
>> in a specific font and make good use of it. I guess we can use illustrator or
>> equivalent to make a sample document. Examining the dejaVuSans font shows it
>> certainly has GPOS information.
>
> Maybe. The attempt will also be instructive regarding how Illustrator might
> handle such combined characters... if it does (I don't have Illustrator to test
> with, but since it is from Adobe, it well might)... and what the generated PDF
> looks like... if it contains positioning instructions, or depends on the PDF
> display tools to have a good renderer.
>
>

well unfortunately the illustrator test produced exactly the wrong results; I
copied the text from my sample DejaVuSans output into an illustrator text box
with font set at dejavusans book. Illustrator or the copy and paste did exactly
the wrong thing and converted only those pairs that are in the font already.

I also tried the effect of hand typing an A and then selecting a diacritic from
illustrator's text/glyph window. The characters were sort of composed in the
input window, but they were not well displayed and looked the same in a saved
PDF. Our own output actually looked better for this case.

Result indecisive. I will have to do further work to test the actual embedded
font to see if it contains gpos info.

Glenn Linderman

unread,
Apr 20, 2015, 7:00:39 PM4/20/15
to reportlab-users
On 4/20/2015 8:38 AM, Robin Becker wrote:
On 20/04/2015 11:54, Glenn Linderman wrote:
On 4/20/2015 2:20 AM, Robin Becker wrote:
..........

1) try an experiment to see if PDF renderers will accept the GPOS information
in a specific font and make good use of it. I guess we can use illustrator or
equivalent to make a sample document. Examining the dejaVuSans font shows it
certainly has GPOS information.

Maybe. The attempt will also be instructive regarding how Illustrator might
handle such combined characters... if it does (I don't have Illustrator to test
with, but since it is from Adobe, it well might)... and what the generated PDF
looks like... if it contains positioning instructions, or depends on the PDF
display tools to have a good renderer.



well unfortunately the illustrator test produced exactly the wrong results; I copied the text from my sample DejaVuSans output into an illustrator text box with font set at dejavusans book. Illustrator or the copy and paste did exactly the wrong thing and converted only those pairs that are in the font already.

I also tried the effect of hand typing an A and then selecting a diacritic from illustrator's text/glyph window. The characters were sort of composed in the input window, but they were not well displayed and looked the same in a saved PDF. Our own output actually looked better for this case.

Result indecisive. I will have to do further work to test the actual embedded font to see if it contains gpos info.

It would be interesting, perhaps, to file that as a bug report with Adobe, and see they how they handle the report, or if they explain a workaround, or just admit to lack of support for such things.

Meantime, seeing your approach of looking at Illustrator output, I had a friend with Acrobat take my little test string and create a PDF from Acrobat.  Results look good, and are at: http://nevcal.com/temporary/openo-Acrobat.pdf  Maybe seeing what they do will help.  File is big enough they must have embedded something or another, font-wise.

Glenn

Andy Robinson

unread,
Apr 21, 2015, 2:14:39 AM4/21/15
to reportlab-users
> Meantime, seeing your approach of looking at Illustrator output, I had a
> friend with Acrobat take my little test string and create a PDF from
> Acrobat. Results look good, and are at:
> http://nevcal.com/temporary/openo-Acrobat.pdf Maybe seeing what they do
> will help. File is big enough they must have embedded something or another,
> font-wise.

Glenn, could you ask your friend exactly what they did with Acrobat to
create this? i.e. did they use Acrobat Distiller to convert a
Postscript file, or create a word document and export it to PDF using
Acrobat? If we can observe another program "doing it right" it may
help.

- Andy

Glenn Linderman

unread,
Apr 21, 2015, 3:28:16 AM4/21/15
to reportlab-users
On 4/20/2015 11:14 PM, Andy Robinson wrote:
Meantime, seeing your approach of looking at Illustrator output, I had a
friend with Acrobat take my little test string and create a PDF from
Acrobat.  Results look good, and are at:
http://nevcal.com/temporary/openo-Acrobat.pdf  Maybe seeing what they do
will help.  File is big enough they must have embedded something or another,
font-wise.
Glenn, could you ask your friend exactly what they did with Acrobat to
create this?  i.e. did they use Acrobat Distiller to convert a
Postscript file, or create a word document and export it to PDF using
Acrobat?  If we can observe another program "doing it right" it may
help.
Oh, he told me up front, I just didn't think the process mattered so much as the results.

He started from my email, selected and then copied the text string into the clipboard, and apparently Acrobat has a feature to create a PDF file from the clipboard contents, so he used that to create the file.  First time through he selected the whole email, but I thought that would just add clutter, so asked him to just do the "interesting" text.

He's quite fond of PDF editors, it would seem... he has several. 

Using the same process, there is now another file at
http://nevcal.com/temporary/openo-Nuance.pdf

He also tried the just-released Nitro 10, but it failed to create from the clipboard, failed to create from a UTF-8 text file, failed to create from a plain text file, and failed to create from a Word document... he has submitted a bug report, and is probably busy reinstalling the prior version of Nitro.  If Nitro 9 will do the job, that might give another file in the near future.

Actually, the reason he has so many, is that most of them have limitations, some are good for one sort of thing, but mess up on other things. Another will do the other things, but not something else. Etc.  He mostly uses the editing features to fine tune and work-around limitations in the PDF creation from other programs, rather than using them to create raw PDF files from other formats.

So in counting the input characters for my sample, there are 8 base/precomposed characters, and 4 combining diacriticals, for a total of 12.

The text stream from Acrobat is as follows:

15 0 obj 
<<
/Length 455
>>
stream
BT
/P <</MCID 0 >>BDC 
/CS0 cs 0.2 0 0.2  scn
/TT0 1 Tf
12 -0 0 12 72 709.2 Tm
( )Tj
/C2_0 1 Tf
36 -0 0 36 72 672.72 Tm
<0727>Tj
/TT0 1 Tf
0.443 0 Td
(\343)Tj
/C2_0 1 Tf
0.443 0 Td
<0727>Tj
0.447 -0.17 Td
<047A>Tj
/TT0 1 Tf
-0.003 0.17 Td
(\325)Tj
/C2_0 1 Tf
0.723 0 Td
<0690>Tj
0.557 0.047 Td
<047A>Tj
0.11 -0.047 Td
<072D072D>Tj
0.853 -0.17 Td
<047A>Tj
-0.013 0.17 Td
<0699>Tj
0.473 0.047 Td
<047A>Tj
/TT0 1 Tf
12 -0 0 12 218.16 672.72 Tm
( )Tj
EMC 
ET

endstream 
endobj 

The text stream from Nuance is as follows:

7 0 obj 
<<
/Length 600
>>
stream
0.1999 0 0.1999 rg
[]0 d 1 w 10 M 0 i 0 J 0 j 
BT
/F0 35.029 Tf
1 0 0 1 28.789 774.789 Tm 
0 Tc 0 Tw 0 Tr 100 Tz 0 Ts 
( ')Tj
1 0 0 1 44.268 774.789 Tm 
(\000m)Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 59.868 774.789 Tm 
( ')Tj
1 0 0 1 75.467 769.03 Tm 
( z)Tj
ET
BT
/F1 35.029 Tf
0.9999 0 0 0.9999 75.778 774.782 Tm 
( )Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 100.787 774.789 Tm 
(  )Tj
1 0 0 1 120.106 776.469 Tm 
( z)Tj
1 0 0 1 124.066 774.789 Tm 
( -)Tj
1 0 0 1 138.825 774.789 Tm 
( -)Tj
1 0 0 1 153.825 769.03 Tm 
( z)Tj
ET
BT
/F0 35.029 Tf
1 0 0 1 153.585 774.789 Tm 
(  )Tj
1 0 0 1 170.145 776.469 Tm 
( z)Tj
ET

endstream 
endobj 


I was rather surprised to see that Nuance had control characters in the Tj paramters.  Acrobat has some too, though, but mostly hex-quads.  

Not counting the leading and trailing space characters that got included, Acrobat emits 12 characters, which means that it doesn't compose them in the font creation.

I'm really not sure how to count the control characters... most of the Nuance Tj have both a control character and a regular character.  Maybe that is a form of CID mapping?  It emits 23 characters using Tj, unless the control character+regular character pair should be counted as one, in which case it emits 12 characters... which sounds more correct.

What I notice particularly about this compared to other PDF files I have looked at the internals of, is that both Acrobat and Nuance emit text movement operators between _each_ character (except one pair, in the case of Acrobat, which are the sequential ɛ characters, one with and one without a diacritical).  Acrobat uses Td, and Nuance uses Tm.
So my "guessing about a lot of things I haven't figured out" conclusion, without knowing how to look at the embedded fonts, is that both Acrobat and Nuance are doing the kerning and character composition positioning on the way in to the PDF file, rather than expecting the PDF display tool renderer to be smart. This is consistent with my guessing after reading the quote from that safarionline book.  No clue how they figure out the numbers... no doubt it is either from the font files directly, using their own rendering code, or from some font rendering library, or from Windows somehow. The latter seems doubtful for Acrobat, since it also runs on Mac... although that is no guarantee it uses the same (recompiled) code on both platforms... it could get it from Windows on Windows and from OS/X on Mac.

I didn't attempt to do the math to see what all those Tm and Td operations are doing, but they seem to have produced equivalent results, visually. I sort of know what a matrix multiply is, but have maybe only done one by hand in some math class many years ago, and haven't figured out _how_ they scale and move graphics blobs, or what the 6 or 9 numbers actually mean in 2D space, but am aware that 2D graphics scaling, rotation, skewing, and translation can all be done via matrix math. There should be no skewing or rotation for this work, which probably simplifies the Tm operand... enough that Acrobat knows how to shoehorn it into a Td operators' smaller set of parameters... in fact, quickly referring to the documentation for Td and Tm, it seems that   p1 p2 Td   is simply shorthand for   1 0 0 1 p1 p2 Tm  and I see that all of Nuance's Tm parameter lists start with  1 0 0 1, so they are missing an space optimization in their PDF generation (well, one of them is .9999 0 0 .9999, and that is likely a rounding error).

I find it interesting that both tools generated two fonts... there are only 8 Unicode codepoints, but I could imagine that the "upper case" and "lower case" diacriticals could be separate glyphs, even though they look the same, but that would still only be 9. Hardly justification for a second font...

In the Acrobat file, it seems from the sequence of character codes used, that the /C2_0 font contains the precomposed characters, and all the others (base characters, and combining diacriticals) are in the /TT0 font.  For Nuance, the precomposed Õ seems to be the only one in /F1, with the rest all coming from /F0.  Curiously, that one is also the one that has the .9999 0 0 .9999 matrix data.

Well, that is about the end of what I can figure from this information.

Andy Robinson

unread,
Apr 21, 2015, 3:45:51 AM4/21/15
to reportlab-users
Thanks.. It matters a lot as Acrobat is a whole family of tools, but
it does mean that we can repeat the process here in the office (e.g.
with "hello world") and see how simple a file it creates and what
structures are in it. Traditionally the simplest way was to write a
"Hello World" postscript program and run it through Distiller.

I don't have time to get involved in this for a couple more weeks, but
when things quiet down at work (admittedly I have been saying that for
about ten years now!) I need to get back to some of the things we did
in ReportLab's early days, along those lines; and also see if any of
the other open source PDF generators "get it right" so we can see how.

Robin Becker

unread,
Apr 21, 2015, 5:51:07 AM4/21/15
to reportlab-users
Glenn,

my reading of the control sequence(s) is that these glyphs are being
individually positioned in PDF; I see 12 separate Tm operators.

I ideally we should see a single BT with a string containing 14 bytes which
would imply that acrobat handles all the glyph positioning.

I believe that the text strings are actually using two bytes per glyph; the map
looks like

6 beginbfchar
<006d> <00e3>
<047a> <0303>
<0690> <0186>
<0699> <0190>
<0727> <0254>
<072d> <025b>
endbfchar

so the byte strings required correspond to the first of each pair.

006d = 00 m = \000m
047a = 04 z = ^Dz the tilde
06?? = 06 ?? = ^f?
0727 = 07 ' = ^G'
072d = 07 - = ^G-

etc etc. My mailer can't actually cope with the odd characters in the 06 lines.
--
Robin Becker

Glenn Linderman

unread,
Apr 21, 2015, 6:50:42 AM4/21/15
to reportl...@lists2.reportlab.com
On 4/21/2015 2:51 AM, Robin Becker wrote:
Glenn,

my reading of the control sequence(s) is that these glyphs are being individually positioned in PDF; I see 12 separate Tm operators.

I agree.


I ideally we should see a single BT with a string containing 14 bytes which would imply that acrobat handles all the glyph positioning.

I think we are on the same wavelength here, but I think you meant to say "Adobe Reader (or other PDF display tool)" where you said "Acrobat".  I think it is the case that "Acrobat", (or other PDF generation tool), is doing all the positioning, and encoding it into the PDF file.

The below seems to be referring to the Nuance generated file, the Acrobat file used HEX codes.

"Ideally", of course, refers to the way it should work if the PDF viewer's renderer was responsible for combined glyph positioning. Of course, if it was, it should also be responsible for rendering the kerning too, and then you wouldn't be able to do right justification very well... it would have to be predicted in one place and matched in the other... so I think the PDF technique is to have the viewer only convert curves to pixels, following instructions by the PDF creator as to where those curves should be placed, actually produces more consistent results across platforms and devices... as much as it hurts to have to do the calculations for the Td or Tm parameters when generating the PDF.



I believe that the text strings are actually using two bytes per glyph; the map looks like

6 beginbfchar
<006d> <00e3>
<047a> <0303>
<0690> <0186>
<0699> <0190>
<0727> <0254>
<072d> <025b>
endbfchar

Ah, yes, I missed looking at the map... so I was unaware that it was legal to use the character codes themselves in the <>, I thought <> was only for HEX codes... but then again, that was just by observation of various PDF files, not from the spec... And I've not tried to understand very many.



so the byte strings required correspond to the first of each pair.

006d = 00 m = \000m
047a = 04 z = ^Dz   the tilde
06?? = 06 ?? = ^f?
0727 = 07 ' = ^G'
072d = 07 - = ^G-

etc etc. My mailer can't actually cope with the odd characters in the 06 lines.

Understood... my mailer seemed to drop those control characters, also.

Robin Becker

unread,
Apr 21, 2015, 8:45:04 AM4/21/15
to reportlab-users
On 21/04/2015 11:50, Glenn Linderman wrote:
> On 4/21/2015 2:51 AM, Robin Becker wrote:
>> Glenn,
>>
>> my reading of the control sequence(s) is that these glyphs are being
>> individually positioned in PDF; I see 12 separate Tm operators.
>
> I agree.
>
>> I ideally we should see a single BT with a string containing 14 bytes which
>> would imply that acrobat handles all the glyph positioning.
>
> I think we are on the same wavelength here, but I think you meant to say "Adobe
> Reader (or other PDF display tool)" where you said "Acrobat". I think it is the
> case that "Acrobat", (or other PDF generation tool), is doing all the
> positioning, and encoding it into the PDF file.

yes the positioning is not being done by the renderer (acrobat reader/evince etc
etc).

If that is the case then positioning has to be done by the software that
produces the PDF ie illustrator/acrobat reader pro/reportlab. If this is true
then there's no point in including the GPOS information into the embedded fonts.

If reportlab has to do the positioning of glyphs it should not affect the
existing standard mechanisms. Probably we'll need a cumbersome, slow and fairly
complicated text output mechanism.
>
> The below seems to be referring to the Nuance generated file, the Acrobat file
> used HEX codes.
>
> "Ideally", of course, refers to the way it should work if the PDF viewer's
> renderer was responsible for combined glyph positioning. Of course, if it was,
> it should also be responsible for rendering the kerning too, and then you
> wouldn't be able to do right justification very well... it would have to be
> predicted in one place and matched in the other... so I think the PDF technique
> is to have the viewer only convert curves to pixels, following instructions by
> the PDF creator as to where those curves should be placed, actually produces
> more consistent results across platforms and devices... as much as it hurts to
> have to do the calculations for the Td or Tm parameters when generating the PDF.
.........

well I think kerning is a separate issue. Here we are talking about a standard
unicode approach to composite glyph construction. Pairs/groups of glyphs are
supposed to be treated in a specific way; kerning is optional.

Glenn Linderman

unread,
Apr 23, 2015, 1:03:47 AM4/23/15
to reportl...@lists2.reportlab.com
On 4/21/2015 5:45 AM, Robin Becker wrote:
On 21/04/2015 11:50, Glenn Linderman wrote:
On 4/21/2015 2:51 AM, Robin Becker wrote:
Glenn,

my reading of the control sequence(s) is that these glyphs are being
individually positioned in PDF; I see 12 separate Tm operators.

I agree.

I ideally we should see a single BT with a string containing 14 bytes which
would imply that acrobat handles all the glyph positioning.

I think we are on the same wavelength here, but I think you meant to say "Adobe
Reader (or other PDF display tool)" where you said "Acrobat".  I think it is the
case that "Acrobat", (or other PDF generation tool), is doing all the
positioning, and encoding it into the PDF file.

yes the positioning is not being done by the renderer (acrobat reader/evince etc etc).

If that is the case then positioning has to be done by the software that produces the PDF ie illustrator/acrobat reader pro/reportlab. If this is true then there's no point in including the GPOS information into the embedded fonts.

If reportlab has to do the positioning of glyphs it should not affect the existing standard mechanisms. Probably we'll need a cumbersome, slow and fairly complicated text output mechanism.

Sounds like it. Having read only small snippets of reportlab code, I wonder if there is a reasonable way to say up front... I want "typeset quality" (or some number of variant levels thereof†) versus I want "speed".




The below seems to be referring to the Nuance generated file, the Acrobat file
used HEX codes.

"Ideally", of course, refers to the way it should work if the PDF viewer's
renderer was responsible for combined glyph positioning. Of course, if it was,
it should also be responsible for rendering the kerning too, and then you
wouldn't be able to do right justification very well... it would have to be
predicted in one place and matched in the other... so I think the PDF technique
is to have the viewer only convert curves to pixels, following instructions by
the PDF creator as to where those curves should be placed, actually produces
more consistent results across platforms and devices... as much as it hurts to
have to do the calculations for the Td or Tm parameters when generating the PDF.
.........

well I think kerning is a separate issue. Here we are talking about a standard unicode approach to composite glyph construction. Pairs/groups of glyphs are supposed to be treated in a specific way; kerning is optional.

Kerning and composite glyph construction are, indeed, two separate items. Both, however, are needed for quality typesetting. I suppose a case could be made that composite glyph construction is needed for accuracy, not just quality typesetting. Kerning only for quality typesetting, and in that sense it could be considered optional... except to folks that want to produce quality typesetting.... and really, I guess there are only two reasons to use PDF files... ubiquitous format (display everywhere), and quality typesetting (printers/publishers accept the PDF format, and often accept _only_ PDF format).

†I could see the following "levels" of quality / complexity, maybe individual features should be turned on/off separately, or maybe that would overly complicate the code, and some ordering among the following items would provide increasing capability without a combinatorial explosion.
1. Speed
2. Kerning
3. combining diacritical character composition
4. other combining characters (ligatures, alternate glyphs in localized context)
5. RTL language support
"Speed" is what you seem to have now, and there is certainly a benefit to it when the other needs are not present, and any additional work is going to sacrifice at least some speed. The order I list is not based on knowledge of the code, but is mostly based on "bang for the coding buck"... Kerning is useful for all Latin & Cyrillic languages (I'm not sure about others, not needed for Chinese, I guess), and a significant implementation exists, if it can be reasonably integrated; diacriticals allow support of many more Latin-based languages; the last two are critical to supporting some languages, but the incremental market is no doubt smaller.

Glenn Linderman

unread,
May 28, 2015, 10:08:47 PM5/28/15
to reportl...@lists2.reportlab.com
On 4/15/2015 10:46 PM, Glenn Linderman wrote:
There are, I think, 4 issues, the first two of which I could definitely use if implemented, and which sound relatively easy, but likely have performance impact. They would enable _higher quality typesetting_ of Latin-based text into PDF files. The others could be hard, but would be required to support a wider range of languages with non-Latin fonts.  I did read something recently about Micro$oft producing a font layout system (but they used a different word in the article that I cannot come up with right now) for all the various needs of different language systems... The closest thing I can find with Google right now is their DirectWrite, but whether it incorporates the technology I read about, I couldn't say, but maybe it does or will. I don't recall if this was something they were making generally available to make the world's typography improve, or if it was a proprietary come-on to promote/improve Windows. It sounded pretty general, language-wise.
  1. kerning
  2. composite glyph positioning
  3. Languages with huge numbers of ligatures, where characters appear differently, even to the point of requiring different glyphs, at the beginning or end of words (Arabic) or adjacent to other letters (Thai).
  4. RTL languages.

1. kerning

My research into kerning is below, since it was somewhat productive. Most of it was on this list. I have not had time to research composite glyph positioning, which

Seems I forgot to finish this sentence in the original email, as the next thing was a different paragraph. And I won't now either, because I don't know what I was going to say.

In what I said in the first paragraph about M$ producing a font layout system, here is the link where you can read what I read about that, which I finally found again.

http://blogs.windows.com/bloggingwindows/2015/02/23/windows-shapes-the-worlds-languages/

This sort of technology and complexity would be required for items 3 & 4 in my complexity list. Probably it also handles 1 & 2, but they would be the simple cases.

There are some interesting statistics in the article about numbers of languages that require shaping engine support, as well as a pointer to the full specification for the Universal Shaping Engine, which is somewhat eye-glazing, and yet only an overview of the categorizations done, not containing any implementation hints that I could see.

I've no idea how many languages are supported by Cairo, say, which is the graphics rendering system used by Firefox (and likely open source), nor how well. Cairo does handle all the cases I've mentioned.

Glenn Linderman

unread,
May 29, 2015, 12:55:16 AM5/29/15
to reportl...@lists2.reportlab.com
Hmm. Some more Googling found references to OpenType/Pango/Harfbuzz, start here, and they claim to be what is used by Firefox (rather than Cairo? In addition to Cairo?).  Pango presentation claims it transforms from UTF-8 text to positioned glyphs (exactly what is needed for PDF files, it would seem); maybe Cairo then is the "how to render positioned glyphs to graphics images" part (exactly what is done by PDF viewers, it would seem). All this stuff seems to be open source. Not sure if the Harfbuzz shaping engine supports all the languages that M$'s Universal Shaping Engine does... This old blog post contrasts what Pango and Harfbuzz are and aren't, and how they can work together. No doubt things have changed somewhat since then.
Reply all
Reply to author
Forward
0 new messages