Hello All,
I've written some PostScript to allow me to print UTF8-encoded strings:
(UTF-8 Encoded String.....) utfshow
I'm happy to send you the full source, or, if appropriate, publish it
here; however, the exposition below includes everything you should need.
I use a UTF-8 decoder which was written (in C) by Bjoern Hoehrmann (see
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/):
%/ Copyright (c) 2008-2010 Bjoern Hoehrmann <
bjo...@hoehrmann.de>
%/ See
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
/UTF8_ACCEPT 0 def
/UTF8_REJECT 12 def
/utf8d [
%/ The first part of the table maps bytes to character classes that
%/ to reduce the size of the transition table and create bitmasks.
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
8 8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
10 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 11 6 6 6 5 8 8 8 8 8 8 8 8 8 8 8
%/ The second part is a transition table that maps a combination
%/ of a state of the automaton and a character class to a state.
0 12 24 36 60 96 84 12 12 12 48 72 12 12 12 12 12 12 12 12 12 12 12 12
12 0 12 12 12 12 12 0 12 0 12 12 12 24 12 12 12 12 12 24 12 24 12 12
12 12 12 12 12 12 12 24 12 12 12 12 12 24 12 12 12 12 12 12 12 24 12 12
12 12 12 12 12 12 12 36 12 36 12 12 12 36 12 12 12 12 12 36 12 36 12 12
12 36 12 12 12 12 12 12 12 12 12 12
] def
% codep state byte decode codep' state'
/decode {
utf8d 1 index get % type
% codep state byte type
2 index UTF8_ACCEPT ne % state not UTF8_ACCEPT?
{ exch 16#3F and 4 -1 roll 6 bitshift or }
{ dup neg 16#FF exch bitshift 3 -1 roll and 4 -1 roll pop }
ifelse % state type codep'
3 1 roll add 256 add utf8d exch get % codep' state'
} def
%***************************************************************************/
I also use a table which Adobe published ("UNICODE translation table for
non-ASCII characters"), which they say is for going from a glyph name to
a Unicode codepoint. I (ab)use it in the reverse direction. I turned
it into a dictionary keyed on the codepoint.
The table is currently at
https://github.com/adobe-type-tools/agl-aglfn.
Some codepoints have multiple possible glyph names, so the dictionary
has an array of potential glyph names for each codepoint. Finally,
fonts often have glyphs named /uniHHHH, where HHHH is the codepoint.
I converted the table to PS using awk:
BEGIN{FS="[; ]"}
{
for(i=2; i<=NF; i++) {
if(!($i in h)) {h[$i]=++n;v[n]=$i}
g[$i]=g[$i]"/"$1
}
}
END{
print "/unicode <<"
for(i=1;i<=n;i++) print "\t16#"v[i]"["g[v[i]]"/uni"toupper(v[i])"]"
print ">> def"
}
Adobe's table is turned into this:
/unicode <<
16#0041[/A/uni0041]
16#00C6[/AE/uni00C6]
...
16#305A[/zuhiragana/uni305A]
16#30BA[/zukatakana/uni30BA]
>> def
The crux of printing Unicode code points is to find which of the
possible glyphs the current font defines. I search currentfont's
CharStrings.
% look for one of the glyphs in fontdict's CharStrings
% [/glyph ...] fontdict chooseglyph /glyph true
% false
/chooseglyph {
/CharStrings get exch % the glyphs defined in fontdict
false 3 1 roll % assume we don't find a glyph
% false CharStrings [glyphs]
{ 2 copy known {true 4 2 roll exch pop exit}{pop} ifelse } forall
pop % remove CharStrings
} def
I've noticed that Symbol sometimes contains glyphs that other fonts
don't, so, if I don't find a glyph in currentfont I look through Symbol.
I thought it might be a good idea to also try ZapfDingbats. In
retrospect, that might be a red herring.
Adobe also publish a table like the Unicode table, giving the names of
that font's glyphs. It's at the same place, and converts using the same
awk:
/zapf <<
16#275E[/a100/uni275E]
16#2761[/a101/uni2761]
...
16#275D[/a99/uni275D]
16#2720[/a9/uni2720]
>> def
This is the code which prints a unicode code point (or .notdef if a
glyph cannot be found):
% SPDX-License-Identifier: LGPL-2.1-or-later
%
% Copyright (c) 2022 by
davidnewall.com. All rights reserved.
% print a single unicode codepoint:
% integer unicodeshow -
/unicodeshow {
% load array of known glyph names for this code point
unicode 1 index known
{unicode exch get} % array of possible glyphs
{ pop []} % unknown code point
ifelse
{
dup currentfont chooseglyph { glyphshow exit } if
dup /ZapfDingbats findfont chooseglyph {
currentfont exch /ZapfDingbats currentfontsize selectfont
glyphshow setfont exit } if
dup /Symbol findfont chooseglyph {
currentfont exch /Symbol currentfontsize selectfont
glyphshow setfont exit } if
/.notdef glyphshow exit
} loop
pop
} def
I get the current font size using this:
/currentfontsize {
currentfont dup /OrigFont get
2 { /FontMatrix get 3 get exch } repeat div
} bind def
Finally (at last!), to print a UTF-8 string:
/utfshow {
UTF8_ACCEPT 0 UTF8_ACCEPT % prev codep current
4 -1 roll {
decode
dup UTF8_ACCEPT eq { 1 index unicodeshow } if
dup UTF8_REJECT eq {
(%% Bad UTF-8 sequence\n) print pop
UTF8_ACCEPT /.notdef glyphshow } if
3 -1 roll pop dup 3 1 roll % prev = current
} forall
pop pop pop
} def
Regards,
David