Printing UTF8 (Unicode)

David Newall

unread,

Jan 21, 2022, 5:56:56 AM1/21/22

to Glaukon

Hello All,

I've written some PostScript to allow me to print UTF8-encoded strings:

(UTF-8 Encoded String.....) utfshow

I'm happy to send you the full source, or, if appropriate, publish it
here; however, the exposition below includes everything you should need.

I use a UTF-8 decoder which was written (in C) by Bjoern Hoehrmann (see
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/):

%/ Copyright (c) 2008-2010 Bjoern Hoehrmann <bjo...@hoehrmann.de>
%/ See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.

/UTF8_ACCEPT 0 def
/UTF8_REJECT 12 def

/utf8d [
%/ The first part of the table maps bytes to character classes that
%/ to reduce the size of the transition table and create bitmasks.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
10 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 11 6 6 6 5 8 8 8 8 8 8 8 8 8 8 8

%/ The second part is a transition table that maps a combination
%/ of a state of the automaton and a character class to a state.
0 12 24 36 60 96 84 12 12 12 48 72 12 12 12 12 12 12 12 12 12 12 12 12
12 0 12 12 12 12 12 0 12 0 12 12 12 24 12 12 12 12 12 24 12 24 12 12
12 12 12 12 12 12 12 24 12 12 12 12 12 24 12 12 12 12 12 12 12 24 12 12
12 12 12 12 12 12 12 36 12 36 12 12 12 36 12 12 12 12 12 36 12 36 12 12
12 36 12 12 12 12 12 12 12 12 12 12
] def

% codep state byte decode codep' state'
/decode {
utf8d 1 index get % type
% codep state byte type
2 index UTF8_ACCEPT ne % state not UTF8_ACCEPT?
{ exch 16#3F and 4 -1 roll 6 bitshift or }
{ dup neg 16#FF exch bitshift 3 -1 roll and 4 -1 roll pop }
ifelse % state type codep'
3 1 roll add 256 add utf8d exch get % codep' state'
} def

%***************************************************************************/

I also use a table which Adobe published ("UNICODE translation table for
non-ASCII characters"), which they say is for going from a glyph name to
a Unicode codepoint. I (ab)use it in the reverse direction. I turned
it into a dictionary keyed on the codepoint.

The table is currently at https://github.com/adobe-type-tools/agl-aglfn.
Some codepoints have multiple possible glyph names, so the dictionary
has an array of potential glyph names for each codepoint. Finally,
fonts often have glyphs named /uniHHHH, where HHHH is the codepoint.

I converted the table to PS using awk:

BEGIN{FS="[; ]"}
{
for(i=2; i<=NF; i++) {
if(!($i in h)) {h[$i]=++n;v[n]=$i}
g[$i]=g[$i]"/"$1
}
}
END{
print "/unicode <<"
for(i=1;i<=n;i++) print "\t16#"v[i]"["g[v[i]]"/uni"toupper(v[i])"]"
print ">> def"
}

Adobe's table is turned into this:

/unicode <<
16#0041[/A/uni0041]
16#00C6[/AE/uni00C6]
...
16#305A[/zuhiragana/uni305A]
16#30BA[/zukatakana/uni30BA]
>> def

The crux of printing Unicode code points is to find which of the
possible glyphs the current font defines. I search currentfont's
CharStrings.

% look for one of the glyphs in fontdict's CharStrings
% [/glyph ...] fontdict chooseglyph /glyph true
% false
/chooseglyph {
/CharStrings get exch % the glyphs defined in fontdict
false 3 1 roll % assume we don't find a glyph
% false CharStrings [glyphs]
{ 2 copy known {true 4 2 roll exch pop exit}{pop} ifelse } forall
pop % remove CharStrings
} def

I've noticed that Symbol sometimes contains glyphs that other fonts
don't, so, if I don't find a glyph in currentfont I look through Symbol.

I thought it might be a good idea to also try ZapfDingbats. In
retrospect, that might be a red herring.

Adobe also publish a table like the Unicode table, giving the names of
that font's glyphs. It's at the same place, and converts using the same
awk:

/zapf <<
16#275E[/a100/uni275E]
16#2761[/a101/uni2761]
...
16#275D[/a99/uni275D]
16#2720[/a9/uni2720]
>> def

This is the code which prints a unicode code point (or .notdef if a
glyph cannot be found):

% SPDX-License-Identifier: LGPL-2.1-or-later
%
% Copyright (c) 2022 by davidnewall.com. All rights reserved.

% print a single unicode codepoint:
% integer unicodeshow -
/unicodeshow {
% load array of known glyph names for this code point
unicode 1 index known
{unicode exch get} % array of possible glyphs
{ pop []} % unknown code point
ifelse
{
dup currentfont chooseglyph { glyphshow exit } if
dup /ZapfDingbats findfont chooseglyph {
currentfont exch /ZapfDingbats currentfontsize selectfont
glyphshow setfont exit } if
dup /Symbol findfont chooseglyph {
currentfont exch /Symbol currentfontsize selectfont
glyphshow setfont exit } if
/.notdef glyphshow exit
} loop
pop
} def

I get the current font size using this:

/currentfontsize {
currentfont dup /OrigFont get
2 { /FontMatrix get 3 get exch } repeat div
} bind def

Finally (at last!), to print a UTF-8 string:

/utfshow {
UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
4 -1 roll {
decode
dup UTF8_ACCEPT eq { 1 index unicodeshow } if
dup UTF8_REJECT eq {
(%% Bad UTF-8 sequence\n) print pop
UTF8_ACCEPT /.notdef glyphshow } if
3 -1 roll pop dup 3 1 roll % prev = current
} forall
pop pop pop
} def

Regards,

David

Carlos

unread,

Jan 21, 2022, 8:23:07 AM1/21/22

to

David Newall <dav...@davidnewall.com>:

> Hello All,
>
> I've written some PostScript to allow me to print UTF8-encoded
> strings:

This is great!

[...]

> % print a single unicode codepoint:
> % integer unicodeshow -
> /unicodeshow {

[...]

> /utfshow {
> UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
> 4 -1 roll {
> decode
> dup UTF8_ACCEPT eq { 1 index unicodeshow } if

[...]

Doesn't "x glyphshow y glyphshow" lose the kerning between x and y?
(I'm not really sure)

If it does, an alternative could be to create a (probably composite)
temporary font out of the characters used in the string and "show" a
reencoded string using that font. Too complicated though :)

Carlos.

David Newall

unread,

Jan 21, 2022, 8:28:09 PM1/21/22

to

On 22/1/22 12:23 am, Carlos wrote:
> David Newall <dav...@davidnewall.com>:

>> I've written some PostScript to allow me to print UTF8-encoded
>> strings:
>
> This is great!

Thank you. It seemed a problem which needed to be solved. I hope I've
made a start that's good enough to criticize.

> Doesn't "x glyphshow y glyphshow" lose the kerning between x and y?
> (I'm not really sure)

PostScript doesn't automatically kern. There are operators you can use
to do that, but it is something you have to do.

David Newall

unread,

Jan 22, 2022, 9:32:04 PM1/22/22

to

On 21/1/22 9:56 pm, David Newall wrote:
> I've written some PostScript to allow me to print UTF8-encoded strings

There was an error in unicodeshow. I wasn't attempting /uniXXXX for
codepoints that weren't in Adobe's table.

Apparently it's also not uncommon to use /uXXXX through /uXXXXXX (4 to 6
hex digits), so I check for those, too.

% integer unicodeshow - show glyph for unicode code point
/unicodeshow {
% load array of known glyph names for this code point, supplemented
% with /uXXXXXX (4 - 6 hex chars) and /uniXXXX (when codepoint fits
% in 4 hex chars)
[
unicode 2 index known {unicode 2 index get aload pop} if
% convert number to hex for /uXXXX.. and /uniXXXX
(0000000) 6 counttomark 1 add index % string index number
{
% number must fit in 6 hex digits
1 index 0 eq {
pop pop pop
/.error where {pop .error} {signalerror} ifelse
} if
dup 0 eq { pop exit } if
3 copy 16 mod dup 9 gt { 55 } { 48 } ifelse add put
16 idiv exch 1 sub exch
} loop
% require min 4 hex digits
dup 2 gt { -1 3 { 1 index exch 16#30 put } for 2 } if
% /uXXXX - /uXXXXXX
2 copy 7 1 index sub getinterval dup 0 16#75 put cvn 3 1 roll
% /uniXXXX
2 eq { dup 0 (uni) putinterval dup cvn exch } if
pop
] exch pop
%[(candidates)2 index]== pstack(---)==
dup currentfont chooseglyph not { /.notdef } if glyphshow
pop
} bind def

David Newall

unread,

Jan 22, 2022, 10:10:23 PM1/22/22

to

Hi All,

I'm soliciting opinions...

On 21/1/22 9:56 pm, David Newall wrote:

> I've written some PostScript to allow me to print UTF8-encoded strings

> ...

> I also use a table which Adobe published ("UNICODE translation table for
> non-ASCII characters"), which they say is for going from a glyph name to
> a Unicode codepoint. I (ab)use it in the reverse direction. I turned
> it into a dictionary keyed on the codepoint.

Many (most?) fonts have glyphs which aren't in Adobe's table, or which
are named differently. Fontforge can write a table of glyphs in a font
and their corresponding codepoints. Using that table, unicodeshow looks
more like this:

% lookup a unicode codepoint (int) in a list of known glyphs (dict)
% and display the glyph found.
% dict int unicodeshow -
/unicodeshow {
2 copy known { get } { pop pop /.notdef } ifelse glyphshow
} bind def

While this looks much neater, it requires pre-generating a dictionary
for each font used.

I can't decide which approach is better.

I'm not delighted by needing to add a dictionary that's specific to the
current font to utfshow and unicodeshow because it feels wrong.

I suppose whatever fonts are used to print unicode will be embedded in
the PS, so I could add the table to each font's dictionary. I wonder if
that would cause confusion to anybody reading the code:

/unicodeshow { % int unicodeshow -
currentfont /unicode 2 copy known not {
pop pop /unicodeshow cvx /invalidfont

/.error where {pop .error} {signalerror} ifelse
} if

get exch 2 copy known { get } { pop pop /.notdef } ifelse glyphshow
} bind def

Maybe that's not so awful.

Opinions? Would adding to a font dictionary going to break things?
(I'm looking at you, Acrobat and Distiller.)

Regards,

David

Carlos

unread,

Jan 23, 2022, 7:35:14 AM1/23/22

to

V Sun, 23 Jan 2022 13:31:54 +1100
David Newall <dav...@davidnewall.com> napsáno:

> On 21/1/22 9:56 pm, David Newall wrote:
> > I've written some PostScript to allow me to print UTF8-encoded
> > strings
>
> There was an error in unicodeshow. I wasn't attempting /uniXXXX for
> codepoints that weren't in Adobe's table.
>
> Apparently it's also not uncommon to use /uXXXX through /uXXXXXX (4
> to 6 hex digits), so I check for those, too.

Adobe's table (or one similar to it) is included in Ghostscript
(AdobeGlyphList), and maybe other interpreters, too.

Here's an old snippet that gets a glyph name (or uniXXXX) based on its
code:

/RevList AdobeGlyphList length dict dup begin
AdobeGlyphList { exch def } forall
end def
% code -- (uniXXXX)
/uniX { 16 6 string cvrs dup length 7 exch sub exch
(uni0000) 7 string copy dup 4 2 roll putinterval } def
% font code -- glyphname
/unitoname { dup RevList exch known
{ RevList exch get }
{ uniX cvn } ifelse
exch /CharStrings get 1 index known not
{ pop /.notdef } if
} def

(It doesn't contemplate several names per code... I thought it was a
1-1 relationship.)

If you know you are dealing with modern fonts that include the uni/u
aliases, you can get rid of the Adobe table lookup altogether... You
don't need the canonical glyph names for those fonts.

Carlos

unread,

Jan 23, 2022, 7:56:13 AM1/23/22

to

V Sun, 23 Jan 2022 14:10:12 +1100
David Newall <dav...@davidnewall.com> napsáno:

> Hi All,
>
> I'm soliciting opinions...
>
> On 21/1/22 9:56 pm, David Newall wrote:
> > I've written some PostScript to allow me to print UTF8-encoded
> > strings ...
> > I also use a table which Adobe published ("UNICODE translation
> > table for non-ASCII characters"), which they say is for going from
> > a glyph name to a Unicode codepoint. I (ab)use it in the reverse
> > direction. I turned it into a dictionary keyed on the codepoint.
> Many (most?) fonts have glyphs which aren't in Adobe's table, or which
> are named differently. Fontforge can write a table of glyphs in a
> font and their corresponding codepoints. Using that table,
> unicodeshow looks more like this:
>
> % lookup a unicode codepoint (int) in a list of known glyphs (dict)
> % and display the glyph found.
> % dict int unicodeshow -
> /unicodeshow {
> 2 copy known { get } { pop pop /.notdef } ifelse glyphshow
> } bind def
>
> While this looks much neater, it requires pre-generating a dictionary
> for each font used.
>
> I can't decide which approach is better.

I think if a font has a mapping between unicode points and glyphs that
you can extract (with Fontforge or whatever), then it surely also has
uni/u aliases. The Adobe table is for older fonts that don't have them,
so it's the only lookup table you need.

> I'm not delighted by needing to add a dictionary that's specific to
> the current font to utfshow and unicodeshow because it feels wrong.

Also, having to pre-process the files to insert the tables is not good.

[...]

> Opinions? Would adding to a font dictionary going to break things?
> (I'm looking at you, Acrobat and Distiller.)

Don't know about that, I only use Ghostscript. But if the reason to add
a lookup is speed, a possible optimization could be not to call
unicodeshow on each codepoint, but identify string intervals where all
bytes are either <= 127 or > 127. Call show on the former, and utfshow
on the latter.

C.

luser droog

unread,

Jan 24, 2022, 11:33:14 AM1/24/22

to

On Saturday, January 22, 2022 at 9:10:23 PM UTC-6, David Newall wrote:

> Opinions? Would adding to a font dictionary going to break things?
> (I'm looking at you, Acrobat and Distiller.)
>
> Regards,
>
> David

I don't see how that could be a problem unless the additions conflict
with existing names. It's possible that findfont will give you a dictionary
without write access. But you could copy everything into a new dictionary
and then call `definefont` on that and you should be good to go. (Take
care *not* to copy the /UniqueID key since definefont will want to
generate a new one.)

luser droog

unread,

Jan 24, 2022, 11:37:59 AM1/24/22

to

On Sunday, January 23, 2022 at 6:56:13 AM UTC-6, Carlos wrote:
> V Sun, 23 Jan 2022 14:10:12 +1100
> David Newall <dav...@davidnewall.com> napsáno:

> [...]
> > Opinions? Would adding to a font dictionary going to break things?
> > (I'm looking at you, Acrobat and Distiller.)
> Don't know about that, I only use Ghostscript. But if the reason to add
> a lookup is speed, a possible optimization could be not to call
> unicodeshow on each codepoint, but identify string intervals where all
> bytes are either <= 127 or > 127. Call show on the former, and utfshow
> on the latter.
>
> C.

Or if speed is not a problem, you could implement a replacement for
kshow instead of show. Then the whole show family can easily be built
off of that.

David Newall

unread,

Jan 25, 2022, 10:59:15 PM1/25/22

to

Hi Carlos,

Thanks for your very useful feedback.

I will say, up-front, that using Adobe Glyph List (glyphlist.txt found
at https://github.com/adobe-type-tools/agl-aglfn) is often sufficient,
depending on what unicode values need to be painted and what font is to
be used. But I want to do better than "often".

I'm using https://antofthy.gitlab.io/info/data/utf8-demo.txt to test my
code. It's coverage is ... extensive (and my current code seems to work
for all of it -- font withstanding.)

On 23/1/22 11:35 pm, Carlos wrote:
> Adobe's table (or one similar to it) is included in Ghostscript
> (AdobeGlyphList), and maybe other interpreters, too.

I didn't know about AdobeGlyphList. The one in Ghostscript (9.50) has
multiple names for some unicode values. Converseley Adobe Glyph List
(glyphlist.txt found at //github.com/adobe-type-tools/agl-aglfn)
contains multiple values for some names.

No font is guaranteed to use any of these names and many fonts that I've
examined use different names for unicode values (and different values
for some names.)

> If you know you are dealing with modern fonts that include the uni/u
> aliases, you can get rid of the Adobe table lookup altogether... You
> don't need the canonical glyph names for those fonts.

No font that I've examined includes uni/u names for every glyph, or even
for most glyphs.

One can't rely on any pre-determined glyph name, nor any pre-determined
lookup table. What a mess.

On 23/1/22 11:56 pm, Carlos wrote:
> I think if a font has a mapping between unicode points and glyphs that
> you can extract (with Fontforge or whatever), then it surely also has
> uni/u aliases. The Adobe table is for older fonts that don't have them,
> so it's the only lookup table you need.

I wish that were true, but it's not.

After your comment about older fonts, I examined Courier, a Type 1 font
(https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
The CharStrings array breaks my
assumptions and my code completely fails.

>> I'm not delighted by needing to add a dictionary that's specific to
>> the current font to utfshow and unicodeshow because it feels wrong.
>
> Also, having to pre-process the files to insert the tables is not good.

I completely agree. I don't like it. I want to be able to use any font
without preprocessing, but I can't see how.

> a possible optimization could be not to call
> unicodeshow on each codepoint, but identify string intervals where all
> bytes are either <= 127 or > 127. Call show on the former, and utfshow
> on the latter.

Agreed. Ps2pdf slows down dramatically with large number of glyphshows.
https://antofthy.gitlab.io/info/data/utf8-demo.txt, which is 50K, takes
4 minutes to process using utf8show and ps2pdf. The utf8-decode phase
takes 20ms and Ghostscript takes 510ms.

For anyone interested, https://davidnewall/software/utf8show. It's
still a work-in-progress.

David

David Newall

unread,

Jan 25, 2022, 11:06:46 PM1/25/22

to

On 25/1/22 3:33 am, luser droog wrote:
> On Saturday, January 22, 2022 at 9:10:23 PM UTC-6, David Newall wrote:
>
>> Opinions? Would adding to a font dictionary going to break things?
>> (I'm looking at you, Acrobat and Distiller.)
>

> I don't see how that could be a problem unless the additions conflict
> with existing names. It's possible that findfont will give you a dictionary
> without write access. But you could copy everything into a new dictionary
> and then call `definefont` on that and you should be good to go.

Thanks. I can't see how it could, either, but I have little experience
with actual Adobe software, as I use Ghostscript for almost all of my
PostScript work.

I might have been unclear in "adding to a font dictionary". I'm not
contemplating /name findfont { modify } definefont, but fontforge font;
awk '...' font.g2n; vi font.t42.

Regards,

David

Carlos

unread,

Feb 10, 2022, 9:05:40 AM2/10/22

to

On Wed, 26 Jan 2022 14:59:09 +1100
David Newall <dav...@davidnewall.com> wrote:
> No font is guaranteed to use any of these names and many fonts that
> I've examined use different names for unicode values (and different
> values for some names.)
>
> > If you know you are dealing with modern fonts that include the uni/u
> > aliases, you can get rid of the Adobe table lookup altogether... You
> > don't need the canonical glyph names for those fonts.
>
> No font that I've examined includes uni/u names for every glyph, or
> even for most glyphs.
>
> One can't rely on any pre-determined glyph name, nor any
> pre-determined lookup table. What a mess.

Well, that's disappointing...

> After your comment about older fonts, I examined Courier, a Type 1
> font
> (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
> The CharStrings array breaks my assumptions and my code completely
> fails.

What assumptions?

C.

David Newall

unread,

Feb 15, 2022, 9:55:40 PM2/15/22

to Carlos

On 11/2/22 01:05, Carlos wrote:
>> After your comment about older fonts, I examined Courier, a Type 1
>> font
>> (https://web.archive.org/web/20010617080950/http://www.ctan.org/tex-archive/fonts/psfonts/courier/).
>> The CharStrings array breaks my assumptions and my code completely
>> fails.
> What assumptions?

The issue wasn't type 1 fonts, after all, that was just the thread I
pulled at. The issue is CharStrings. Not all fonts have one. In
particular, type 3 fonts don't. Type 3 fonts have a BuildGlyph or
BuildChar procedure which often use a CharProcs dictionary, but that's
not guaranteed.

I now taking the position that a font must have CharStrings or CharProcs
to be used with this body of code. In practice that's unlikely to be a
problem.