"Unprintable" 8-bit characters

Conrad J. Sabatier

unread,

Nov 8, 2011, 7:42:36 PM11/8/11

to

Pardon me if this may seem like a stupid question, but this is
something that's been bugging me for a long time, and none of my
research has turned up anything useful yet.

I've been trying to understand what the deal is with regards to the
displaying of the "extended" 8-bit character set, i.e., 8-bit characters
with the MSB set.

More specifically, I'm trying to figure out how to get the "ls" command
to properly display filenames containing characters in this extended
set. I have some MP3 files, for instance, whose names contain certain
European characters, such as the lowercase "u" with umlaut (code 0xfc
in the Latin set, according to gucharmap), that I just can't get ls to
display properly. These characters seem to be considered by ls as
"unprintable", and the best I've been able to produce in the ls
output is backslash interpretations of the characters using either the
-B or -b options, otherwise the default "?" is displayed in their place.

The strange thing is that these characters will display just fine in
xterm, gnome-terminal, etc. I can copy and paste them from the
gucharmap utility into a shell command line or other application, and
they appear as they should, but ls simply refuses to display them. I
can print them using the printf command, even bash's builtin echo seems
to have no problem with them. Only ls appears to have this problem.

I've experimented with using various locales, using the LC_*
variables, as well as the LANG variable (as documented in the
environment section of the ls man page), all to no avail.

Is this an inherent limitation of ls, or is there some workaround or
other solution? Do we need a new en_*.UTF-16 locale? Should we
consider extending the ls command to handle these characters? Or is
there just something about all of this that I'm just not "getting"?

As an additional note, I notice that in the text console, this same
character code (0xfc) produces an entirely different character (a
lowercase n in a raised position, as for the exponent in a mathematical
expression). Is there, in fact, no standardization re: the
representation of these "high bit" characters?

Thanks to anyone who can help clear up this long-standing mystery for
me.

--
Conrad J. Sabatier
con...@cox.net
_______________________________________________
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questi...@freebsd.org"

Robert Bonomi

unread,

Nov 8, 2011, 8:17:27 PM11/8/11

to

On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote:
>
> I've been trying to understand what the deal is with regards to the
> displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> with the MSB set.

Quite simply Unix dates from the days where the 8th bit was used as a 'parity'
bit. Allowing detection of *all* single-bit errors -- especially over the
notoriously un-reliable connections known as 'serial ports'.

>
> More specifically, I'm trying to figure out how to get the "ls" command
> to properly display filenames containing characters in this extended
> set. I have some MP3 files, for instance, whose names contain certain
> European characters, such as the lowercase "u" with umlaut (code 0xfc
> in the Latin set, according to gucharmap), that I just can't get ls to
> display properly. These characters seem to be considered by ls as
> "unprintable", and the best I've been able to produce in the ls
> output is backslash interpretations of the characters using either the
> -B or -b options, otherwise the default "?" is displayed in their place.
>
> The strange thing is that these characters will display just fine in
> xterm, gnome-terminal, etc. I can copy and paste them from the
> gucharmap utility into a shell command line or other application, and
> they appear as they should, but ls simply refuses to display them. I
> can print them using the printf command, even bash's builtin echo seems
> to have no problem with them. Only ls appears to have this problem.
>
> I've experimented with using various locales, using the LC_*
> variables, as well as the LANG variable (as documented in the
> environment section of the ls man page), all to no avail.

Obviously you never read as far as the '-w' switch. <grin>

> Is this an inherent limitation of ls,

It is -not- a limitation; rather it is a _desired_ behavior -- so that
one can _tell_ where there is an 'unprintable' character (like \r, or\b)
in a filename. There are *good*reasons*(TM) why -q is the default behavior
for 'terminal' output.

> or is there some workaround or
> other solution? Do we need a new en_*.UTF-16 locale? Should we
> consider extending the ls command to handle these characters?

There _are_ "improved" versions of ls that do understand the 'locale'
environment variables -- but those programs introduce a whole bunch of
*other* 'not necessarily desired' behaviors -- like sorting upper-case and
lower-case letters as 'equals', rather than regarding any upper-case as
sorting before any lowercase.

> Or is
> there just something about all of this that I'm just not "getting"?
>
> As an additional note, I notice that in the text console, this same
> character code (0xfc) produces an entirely different character (a
> lowercase n in a raised position, as for the exponent in a mathematical
> expression). Is there, in fact, no standardization re: the
> representation of these "high bit" characters?

"The nice thing about standards is that there are so many to choose from"
applies. WITH A VENGANCE!!

There are at least FIFTEEN different sets of glyphs for the 'high bit set'
byte codes *JUST* for the 'iso-8859' base charset. Plus 'utf-8' And not
counting the various bastardiztions (e.g. 'CP-1252', etc.) that Microsoft
has introduced.

> Thanks to anyone who can help clear up this long-standing mystery for
> me.

<R>eading <t>he <f>ine <m>anpage -- with particular attention to the '-q'
and '-w' options should provie some enlightenment.

Michael Ross

unread,

Nov 8, 2011, 8:51:31 PM11/8/11

to

Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier <con...@cox.net>:

> Pardon me if this may seem like a stupid question, but this is
> something that's been bugging me for a long time, and none of my
> research has turned up anything useful yet.
>

> I've been trying to understand what the deal is with regards to the
> displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> with the MSB set.
>

> More specifically, I'm trying to figure out how to get the "ls" command
> to properly display filenames containing characters in this extended
> set. I have some MP3 files, for instance, whose names contain certain
> European characters, such as the lowercase "u" with umlaut (code 0xfc
> in the Latin set, according to gucharmap), that I just can't get ls to
> display properly. These characters seem to be considered by ls as
> "unprintable", and the best I've been able to produce in the ls
> output is backslash interpretations of the characters using either the
> -B or -b options, otherwise the default "?" is displayed in their place.

Unsure if I understand you correctly.
("extended" 8-bit character set with MSB? utf-16?)
I'm confused by this charset stuff in general.

Assuming you want \0xfc displayed as "ü",

> cat test.py && python test.py && ls -l

#!/usr/local/bin/python
# -*- coding: utf-8 -*-

f=open('\xfc','w')
f.close()
total 2

-rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py
-rw-r--r-- 1 michael wheel 0 9 Nov 02:44 ü

here is what works for me:

in my login class in /etc/login.conf:

:charset=ISO-8859-1:\
:lang=de_DE.ISO8859-1:\

``cap_mkdb /etc/login.conf'' after changes

in /etc/rc.conf:

scrnmap="iso-8859-1_to_cp437"
font8x8="cp850-8x8"
font8x14="cp850-8x14"
font8x16="cp850-8x16"

and in /etc/ttys, console type is set to ``cons25l1''

Regards,

Michael

Conrad J. Sabatier

unread,

Nov 8, 2011, 8:58:04 PM11/8/11

to

On Tue, 8 Nov 2011 19:17:27 -0600 (CST)
Robert Bonomi <bon...@mail.r-bonomi.com> wrote:

>
> On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote:
> >

> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit
> > characters with the MSB set.
>

> Quite simply Unix dates from the days where the 8th bit was used as a
> 'parity' bit. Allowing detection of *all* single-bit errors --
> especially over the notoriously un-reliable connections known as
> 'serial ports'.

Ah, yes! The "good old days". :-)

> > More specifically, I'm trying to figure out how to get the "ls"
> > command to properly display filenames containing characters in this
> > extended set. I have some MP3 files, for instance, whose names
> > contain certain European characters, such as the lowercase "u" with
> > umlaut (code 0xfc in the Latin set, according to gucharmap), that I
> > just can't get ls to display properly. These characters seem to be
> > considered by ls as "unprintable", and the best I've been able to
> > produce in the ls output is backslash interpretations of the
> > characters using either the -B or -b options, otherwise the default
> > "?" is displayed in their place.
> >

> > The strange thing is that these characters will display just fine in
> > xterm, gnome-terminal, etc. I can copy and paste them from the
> > gucharmap utility into a shell command line or other application,
> > and they appear as they should, but ls simply refuses to display
> > them. I can print them using the printf command, even bash's
> > builtin echo seems to have no problem with them. Only ls appears
> > to have this problem.
> >
> > I've experimented with using various locales, using the LC_*
> > variables, as well as the LANG variable (as documented in the
> > environment section of the ls man page), all to no avail.
>
> Obviously you never read as far as the '-w' switch. <grin>

Yes, somehow that one went right past me. Haste makes waste! :-)

> > Is this an inherent limitation of ls,
>
> It is -not- a limitation; rather it is a _desired_ behavior -- so
> that one can _tell_ where there is an 'unprintable' character (like
> \r, or\b) in a filename. There are *good*reasons*(TM) why -q is the
> default behavior for 'terminal' output.

OK, I can see that. :-)

> > or is there some workaround or
> > other solution? Do we need a new en_*.UTF-16 locale? Should we
> > consider extending the ls command to handle these characters?
>
> There _are_ "improved" versions of ls that do understand the 'locale'
> environment variables -- but those programs introduce a whole bunch of
> *other* 'not necessarily desired' behaviors -- like sorting
> upper-case and lower-case letters as 'equals', rather than regarding
> any upper-case as sorting before any lowercase.

Well, *that* certainly won't do! That should be the exception, not the
rule.

> > Or is
> > there just something about all of this that I'm just not "getting"?
> >
> > As an additional note, I notice that in the text console, this same
> > character code (0xfc) produces an entirely different character (a
> > lowercase n in a raised position, as for the exponent in a
> > mathematical expression). Is there, in fact, no standardization
> > re: the representation of these "high bit" characters?
>
> "The nice thing about standards is that there are so many to choose
> from" applies. WITH A VENGANCE!!
>
> There are at least FIFTEEN different sets of glyphs for the 'high bit
> set' byte codes *JUST* for the 'iso-8859' base charset. Plus
> 'utf-8' And not counting the various bastardiztions (e.g. 'CP-1252',
> etc.) that Microsoft has introduced.
>
> > Thanks to anyone who can help clear up this long-standing mystery
> > for me.
>
> <R>eading <t>he <f>ine <m>anpage -- with particular attention to the
> '-q' and '-w' options should provie some enlightenment.

Thank you very much. Some of this matched the suspicions I already had
re: this matter.

Don't know how I completely missed the -w switch. Mea culpa. :-)

So, what would be the safest bet as far as the most "universal"
representation for these characters? Something I've long wondered
about when I've e-mailed people and copied/pasted these characters (are
they really seeing what I'm seeing?). :-)

--
Conrad J. Sabatier
con...@cox.net

Conrad J. Sabatier

unread,

Nov 8, 2011, 9:09:20 PM11/8/11

to

On Tue, 8 Nov 2011 19:17:27 -0600 (CST)
Robert Bonomi <bon...@mail.r-bonomi.com> wrote:

>
> On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote:
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit
> > characters with the MSB set.
>
> Quite simply Unix dates from the days where the 8th bit was used as a
> 'parity' bit. Allowing detection of *all* single-bit errors --
> especially over the notoriously un-reliable connections known as
> 'serial ports'.
> >

> > More specifically, I'm trying to figure out how to get the "ls"
> > command to properly display filenames containing characters in this
> > extended set. I have some MP3 files, for instance, whose names
> > contain certain European characters, such as the lowercase "u" with
> > umlaut (code 0xfc in the Latin set, according to gucharmap), that I
> > just can't get ls to display properly. These characters seem to be
> > considered by ls as "unprintable", and the best I've been able to
> > produce in the ls output is backslash interpretations of the
> > characters using either the -B or -b options, otherwise the default
> > "?" is displayed in their place.
> >
> > The strange thing is that these characters will display just fine in
> > xterm, gnome-terminal, etc. I can copy and paste them from the
> > gucharmap utility into a shell command line or other application,
> > and they appear as they should, but ls simply refuses to display
> > them. I can print them using the printf command, even bash's
> > builtin echo seems to have no problem with them. Only ls appears
> > to have this problem.
> >
> > I've experimented with using various locales, using the LC_*
> > variables, as well as the LANG variable (as documented in the
> > environment section of the ls man page), all to no avail.
>
> Obviously you never read as far as the '-w' switch. <grin>

Just a quickie followup:

Setting LC_ALL=en_US.UTF-8 and using "ls -w" was, in fact, the magic
key (at least, in any of the X terminal apps; still getting the little
"exponential n" in the console)!

Thank you so much. I'll sleep much better tonight. :-)

Polytropon

unread,

Nov 8, 2011, 9:10:24 PM11/8/11

to

On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
> Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier <con...@cox.net>:
>
> > Pardon me if this may seem like a stupid question, but this is
> > something that's been bugging me for a long time, and none of my
> > research has turned up anything useful yet.
> >

> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit characters
> > with the MSB set.
> >

> > More specifically, I'm trying to figure out how to get the "ls" command
> > to properly display filenames containing characters in this extended
> > set. I have some MP3 files, for instance, whose names contain certain
> > European characters, such as the lowercase "u" with umlaut (code 0xfc
> > in the Latin set, according to gucharmap), that I just can't get ls to
> > display properly. These characters seem to be considered by ls as
> > "unprintable", and the best I've been able to produce in the ls
> > output is backslash interpretations of the characters using either the
> > -B or -b options, otherwise the default "?" is displayed in their place.
>

> Unsure if I understand you correctly.
> ("extended" 8-bit character set with MSB? utf-16?)
> I'm confused by this charset stuff in general.
>
> Assuming you want \0xfc displayed as "ü",
>
> > cat test.py && python test.py && ls -l
>
> #!/usr/local/bin/python
> # -*- coding: utf-8 -*-
>
> f=open('\xfc','w')
> f.close()
> total 2
>
> -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py
> -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 ü
>
>
> here is what works for me:
>
> in my login class in /etc/login.conf:
>
> :charset=ISO-8859-1:\
> :lang=de_DE.ISO8859-1:\
>
> ``cap_mkdb /etc/login.conf'' after changes

Ah, thanks - that seems to be the proper way to have
the environmental variables set - instead of my (ab)use
of setenv's in the csh config file. :-)

Note the "precedence" of $LANG vs. $LC_* (as they can
be used to configure things more precisely, e. g.
regarding system messages or date formats; see example
following).

> in /etc/rc.conf:
>
> scrnmap="iso-8859-1_to_cp437"

Hm? CP437? Codepage? Isn't that some MS-DOS thing?
I've never needed a screenmap to make "extended
characters" (everything beyong US-ASCII) work.

> font8x8="cp850-8x8"
> font8x14="cp850-8x14"
> font8x16="cp850-8x16"
>
>
> and in /etc/ttys, console type is set to ``cons25l1''

I have a similar setting here, but that does _not_ work
wuth UTF-8 codec characters. If I want to use them, I
have to change some environmental variables, from

#-------GERMAN/ENGLISH------------------------ <=== DEFAULT
setenv LC_ALL en_US.ISO8859-1
setenv LC_MESSAGES en_US.ISO8859-1
setenv LC_COLLATE de_DE.ISO8859-1
setenv LC_CTYPE de_DE.ISO8859-1
setenv LC_MONETARY de_DE.ISO8859-1
setenv LC_NUMERIC de_DE.ISO8859-1
setenv LC_TIME de_DE.ISO8859-1
unsetenv LANG

to

#-------INTERNATIONAL-------------------------
setenv LC_ALL en_US.UTF-8
setenv LC_MESSAGES en_US.UTF-8
setenv LC_COLLATE de_DE.UTF-8
setenv LC_CTYPE de_DE.UTF-8
setenv LC_MONETARY de_DE.UTF-8
setenv LC_NUMERIC de_DE.UTF-8
setenv LC_TIME de_DE.UTF-8
setenv LANG de_DE.UTF-8

Then I can use UTF-8 characters inside rxvt-unicode. Of
course, text mode console is limited to the first set
of configuration, using the ISO 8859-1 character set.

This worked long before UTF-8 arrived with the glorious
idea that I should have 2 bytes where one is sufficient,
to describe our (german) 6 umlauts and the Eszett ligature. :-)

Improper settings will result in [][] or A-tilde three
quarters upside-down question mark, depending on editor
or terminal used.

But returning to the original question, I think Robert
did explain it very well: There is no real consensus
about what the different codings should mean. They
were meant to unify the representation of a very large
set of characters, but basically there are many inter-
pretations now, and how they show up to the user depends
on the font in use, _if_ it has this mapping or that,
or none.

For running ls, -w is the right option to use - but IN
COMBINATION with correct settings for the terminal
emulation AND the presence of a font that will do.

Again a fine demonstration why file names should be
limited to printable ASCII and no spaces if you want
them to work everywhere. :-)

--
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...

Polytropon

unread,

Nov 8, 2011, 9:19:55 PM11/8/11

to

On Tue, 8 Nov 2011 19:58:04 -0600, Conrad J. Sabatier wrote:
> So, what would be the safest bet as far as the most "universal"
> representation for these characters? Something I've long wondered
> about when I've e-mailed people and copied/pasted these characters (are
> they really seeing what I'm seeing?). :-)

With lots of experience in "how not to do it", I would
like to suggest the following: Use US-ASCII letters only.
This makes _sure_ they will display correctly everywhere
and even on ultra-worst conditions (e. g. you are at a
real serial console, a real DEC vt100).

Filenames like kloesze_mit_muesli_foerdern_baerenhunger.mp3
can be processed by _any_ ls or mailer program. There is
no need to worry about... hmmm... do they have the same
character settings that I use? Do they have a font installed
that can show the file names properly?

Rules: Substitute umlauts properly (*e). Substitute ß
to sz ("teletype convention"). Remove accents or other
marks completely, as well as "strokes through characters"
or similar typographical specialities. If you can, use
lowercase only. No spaces, use _ instead. Avoid any other
special characters. Make everything plain ASCII, and you
can _still_ easily get the meaning.

The file system ITSELF doesn't care for the meaning of
the characters. SAVING them and DISPLAYING them are two
fully different things. Nobody stops you from making
filenames like öÜÖßß€Łµ³¼`łøæſđ̣ĸ»¢.mp3, but they can
cause trouble you can't predict. You _never_ know...

Conrad J. Sabatier

unread,

Nov 8, 2011, 9:24:18 PM11/8/11

to

On Wed, 09 Nov 2011 02:51:31 +0100
"Michael Ross" <g...@ross.cx> wrote:

> Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier
> <con...@cox.net>:
>
> > Pardon me if this may seem like a stupid question, but this is
> > something that's been bugging me for a long time, and none of my
> > research has turned up anything useful yet.
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit
> > characters with the MSB set.
> >
> > More specifically, I'm trying to figure out how to get the "ls"
> > command to properly display filenames containing characters in this
> > extended set. I have some MP3 files, for instance, whose names
> > contain certain European characters, such as the lowercase "u" with
> > umlaut (code 0xfc in the Latin set, according to gucharmap), that I
> > just can't get ls to display properly. These characters seem to be
> > considered by ls as "unprintable", and the best I've been able to
> > produce in the ls output is backslash interpretations of the
> > characters using either the -B or -b options, otherwise the default
> > "?" is displayed in their place.
>
> Unsure if I understand you correctly.
> ("extended" 8-bit character set with MSB? utf-16?)
> I'm confused by this charset stuff in general.

That is to say, "8-bit characters with the most significant bit set",
or "characters greater than 0x7f".

I can certainly appreciate your confusion; this is definitely a
confusing area. In gucharmap, selecting the unlauted "u" in the Latin
set, the "Character Details" tab reveals the following:

U+00FC LATIN SMALL LETTER U WITH DIAERESIS

General Character Properties

In Unicode since: 1.1
Unicode category: Letter, Lowercase
Canonical decomposition: U+0075 LATIN SMALL LETTER U + U+0308 COMBINING
DIAERESIS

Various Useful Representations

UTF-8: 0xC3 0xBC
UTF-16: 0x00FC

C octal escaped UTF-8: \303\274
XML decimal entity: ü

So apparently, it's a "wide" character in UTF-8, which really throws a
monkey wrench into the works in certain situations (for example, one of
the little scripts I've written to process MP3 files uses the "cut"
command, which complains about an "illegal byte sequence").

Even more confusing, selecting the character and copying it to the
clipboard, the UTF-16 representation (0xfc) is what actually gets
used. Pasting this single-byte version into an X terminal (any of
them: xterm, gnome-terminal, etc.) does display the correct character,
an umlauted "u", even if using an 8-bit locale, such as UTF-8. Majorly
confusing!

> Assuming you want \0xfc displayed as "ü",

Yes, exactly.

> > cat test.py && python test.py && ls -l
>
> #!/usr/local/bin/python
> # -*- coding: utf-8 -*-
>
> f=open('\xfc','w')
> f.close()
> total 2
>
> -rw-r--r-- 1 michael wheel 29 9 Nov 02:43 test.py
> -rw-r--r-- 1 michael wheel 0 9 Nov 02:44 ü
>
>
> here is what works for me:
>
> in my login class in /etc/login.conf:
>
> :charset=ISO-8859-1:\
> :lang=de_DE.ISO8859-1:\
>
> ``cap_mkdb /etc/login.conf'' after changes
>
>

> in /etc/rc.conf:
>
> scrnmap="iso-8859-1_to_cp437"

> font8x8="cp850-8x8"
> font8x14="cp850-8x14"
> font8x16="cp850-8x16"
>
>
> and in /etc/ttys, console type is set to ``cons25l1''

Thanks, I hadn't considered making those sorts of changes for the
console. I work so seldom nowadays in the console, I'd forgotten all
about that stuff (use it or lose it, as they say!). I'll certainly give
that a try.

Much appreciation for both yours and Robert's replies.

--
Conrad J. Sabatier
con...@cox.net

Daniel Staal

unread,

Nov 8, 2011, 8:27:16 PM11/8/11

to

--As of November 8, 2011 7:58:04 PM -0600, Conrad J. Sabatier is alleged to

have said:

> So, what would be the safest bet as far as the most "universal"
> representation for these characters? Something I've long wondered
> about when I've e-mailed people and copied/pasted these characters (are
> they really seeing what I'm seeing?). :-)

--As for the rest, it is mine.

These days, the safest bet is UTF-8, or some other Unicode character set,
in something that can convey what character set it is in. (Email can,
depending on the mail client.)

Not that Unicode is universal yet, but it designed to be (and is,
generally) a solution to the 'multiple character encodings' problem. (By,
of course, defining a new encoding.) It has a decent amount of traction,
and in a decade or so - once other options have been firmly depreciated -
I'd expect we could start discussing whether to switch ls to using it by
default. ;)

All this is of course if you *must* go beyond 7-bit ASCII. (Which all
forms of Unicode is designed to be a strict superset of.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Conrad J. Sabatier

unread,

Nov 8, 2011, 9:59:48 PM11/8/11

to

On Wed, 9 Nov 2011 03:10:24 +0100
Polytropon <fre...@edvax.de> wrote:

> On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
> > Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier
> > <con...@cox.net>:

[snip]

> > > I've been trying to understand what the deal is with regards to
> > > the displaying of the "extended" 8-bit character set, i.e., 8-bit
> > > characters with the MSB set.

[snip]

> > Unsure if I understand you correctly.
> > ("extended" 8-bit character set with MSB? utf-16?)
> > I'm confused by this charset stuff in general.
> >

> > Assuming you want \0xfc displayed as "ü",

[snip]

> > here is what works for me:
> >
> > in my login class in /etc/login.conf:
> >
> > :charset=ISO-8859-1:\
> > :lang=de_DE.ISO8859-1:\
> >
> > ``cap_mkdb /etc/login.conf'' after changes
>

> Ah, thanks - that seems to be the proper way to have
> the environmental variables set - instead of my (ab)use
> of setenv's in the csh config file. :-)

Same here. I've been "guilty" as well of neglecting to properly adjust
my console configuration.

> Note the "precedence" of $LANG vs. $LC_* (as they can
> be used to configure things more precisely, e. g.
> regarding system messages or date formats; see example
> following).
>
>
>
> > in /etc/rc.conf:
> >
> > scrnmap="iso-8859-1_to_cp437"
>
> Hm? CP437? Codepage? Isn't that some MS-DOS thing?
> I've never needed a screenmap to make "extended
> characters" (everything beyong US-ASCII) work.
>
>
>

> > font8x8="cp850-8x8"
> > font8x14="cp850-8x14"
> > font8x16="cp850-8x16"
> >
> >
> > and in /etc/ttys, console type is set to ``cons25l1''
>

> I have a similar setting here, but that does _not_ work
> wuth UTF-8 codec characters. If I want to use them, I
> have to change some environmental variables, from
>
> #-------GERMAN/ENGLISH------------------------ <=== DEFAULT
> setenv LC_ALL en_US.ISO8859-1
> setenv LC_MESSAGES en_US.ISO8859-1
> setenv LC_COLLATE de_DE.ISO8859-1
> setenv LC_CTYPE de_DE.ISO8859-1
> setenv LC_MONETARY de_DE.ISO8859-1
> setenv LC_NUMERIC de_DE.ISO8859-1
> setenv LC_TIME de_DE.ISO8859-1
> unsetenv LANG
>
> to
>
> #-------INTERNATIONAL-------------------------
> setenv LC_ALL en_US.UTF-8
> setenv LC_MESSAGES en_US.UTF-8
> setenv LC_COLLATE de_DE.UTF-8
> setenv LC_CTYPE de_DE.UTF-8
> setenv LC_MONETARY de_DE.UTF-8
> setenv LC_NUMERIC de_DE.UTF-8
> setenv LC_TIME de_DE.UTF-8
> setenv LANG de_DE.UTF-8

Doesn't using "LC_ALL" obviate the need to set any of the other LC_*
variables? At least, that's always been my understanding of it.

But, getting back to something you said earlier, what did you mean
exactly about the precedence of LANG vs. LC_*?

> Then I can use UTF-8 characters inside rxvt-unicode. Of
> course, text mode console is limited to the first set
> of configuration, using the ISO 8859-1 character set.
>
> This worked long before UTF-8 arrived with the glorious
> idea that I should have 2 bytes where one is sufficient,
> to describe our (german) 6 umlauts and the Eszett ligature. :-)

<grin>

Yes, and this is one area where the labels are more than a little
misleading as well. My natural inclination is think of UTF-8 as being a
single-byte representation for each character in the set, whereas
UTF-16, as the name implies, would be the "wide", 2-byte version.
Nonetheless, as I posted earlier in this thread, according to the info
in gucharmap, the representations of the umlauted "u" are just the
opposite of this:

UTF-8: 0xC3 0xBC
UTF-16: 0x00FC

Go figure, huh? :-)

> Improper settings will result in [][] or A-tilde three
> quarters upside-down question mark, depending on editor
> or terminal used.

Yes, I will definitely have to try using the recommendations that have
come up in this thread re: the console.

> But returning to the original question, I think Robert
> did explain it very well: There is no real consensus
> about what the different codings should mean. They
> were meant to unify the representation of a very large
> set of characters, but basically there are many inter-
> pretations now, and how they show up to the user depends
> on the font in use, _if_ it has this mapping or that,
> or none.

This seems rather unfortunate to me. You would think that, by now,
some "standard" character set might have emerged that would allow one
to use, at the very least, the "Western" characters (as opposed to
the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable
expectation that others will see what was intended.

> For running ls, -w is the right option to use - but IN
> COMBINATION with correct settings for the terminal
> emulation AND the presence of a font that will do.

Yes. I'm still a little embarrassed for having completely overlooked
that option earlier. Hasty (impatient) reading of man pages. :-)

> Again a fine demonstration why file names should be
> limited to printable ASCII and no spaces if you want
> them to work everywhere. :-)

Well, for myself, personally, I'm a bit of a stickler for "language
authenticity", you might call it. Having studied both German and
French rather extensively in my younger days, I'm quite fond of both
languages, and rather keen on seeing them represented accurately (I
especially wince at the use of the plain, unaccented vowel followed by
an "e" in place of the umlaut, and to a lesser degree, the use of "ss"
in place of Esszett), which has caused me no small amount of confusion,
aggravation and frustration over the years, to be sure! :-)

--
Conrad J. Sabatier
con...@cox.net

Conrad J. Sabatier

unread,

Nov 8, 2011, 10:29:09 PM11/8/11

to

On Tue, 08 Nov 2011 21:27:16 -0400
Daniel Staal <DSt...@usa.net> wrote:

> --As of November 8, 2011 7:58:04 PM -0600, Conrad J. Sabatier is
> alleged to have said:
>
> > So, what would be the safest bet as far as the most "universal"
> > representation for these characters? Something I've long wondered
> > about when I've e-mailed people and copied/pasted these characters
> > (are they really seeing what I'm seeing?). :-)
>
> --As for the rest, it is mine.
>
> These days, the safest bet is UTF-8, or some other Unicode character
> set, in something that can convey what character set it is in.
> (Email can, depending on the mail client.)
>
> Not that Unicode is universal yet, but it designed to be (and is,
> generally) a solution to the 'multiple character encodings' problem.
> (By, of course, defining a new encoding.) It has a decent amount of
> traction, and in a decade or so - once other options have been firmly
> depreciated - I'd expect we could start discussing whether to switch
> ls to using it by default. ;)
>
> All this is of course if you *must* go beyond 7-bit ASCII. (Which
> all forms of Unicode is designed to be a strict superset of.)

That sounds sane and sensible. :-)

I've adjusted my environment to include:

export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8

And also adjusted my console configuration to display these characters:

font8x14="iso-8x14"
font8x16="iso-8x16"
font8x8="iso-8x8"

And, last but not least, aliased "ls" to ensure these characters will
actually be displayed:

alias ls='ls -Fw'

Looking good here now:

conrads:~$ cd "Music/Progressive Rock/Yes/The Yes Album"
conrads:~/Music/Progressive Rock/Yes/The Yes Album$ ls *03*
Yes - The Yes Album - 03 - Starship Trooper: a. Life Seeker - b.
Disillusion - c. Würm.mp3

Many thanks to everyone for all the very helpful, useful information.

Conrad J. Sabatier

unread,

Nov 8, 2011, 10:35:20 PM11/8/11

to

On Tue, 8 Nov 2011 20:24:18 -0600
"Conrad J. Sabatier" <con...@cox.net> wrote:

> Even more confusing, selecting the character and copying it to the
> clipboard, the UTF-16 representation (0xfc) is what actually gets
> used. Pasting this single-byte version into an X terminal (any of
> them: xterm, gnome-terminal, etc.) does display the correct character,
> an umlauted "u", even if using an 8-bit locale, such as UTF-8.
> Majorly confusing!

Just realized on reading this how weird it sounds. What I was getting
at here was that the (single-byte) UTF-16 code displays the correct
character in a UTF-8 locale, even though the UTF-8 code for the
character is supposedly a 2-byte sequence.

Anyway, enough about that. I've managed to get the results I was
hoping for now, so I'm satisfied. :-)

Thanks again for all the responses.

Robert Bonomi

unread,

Nov 9, 2011, 12:04:25 AM11/9/11

to

"Conrad J. Sabatier" <con...@cox.net> wrote:
>

> <grin>
>
> Yes, and this is one area where the labels are more than a little
> misleading as well. My natural inclination is think of UTF-8 as being a
> single-byte representation for each character in the set, whereas
> UTF-16, as the name implies, would be the "wide", 2-byte version.

"Not exactly."

> Nonetheless, as I posted earlier in this thread, according to the info
> in gucharmap, the representations of the umlauted "u" are just the
> opposite of this:

"not exactly." Again.

> UTF-8: 0xC3 0xBC
> UTF-16: 0x00FC
>
> Go figure, huh? :-)

In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has -four-
nybbles after the '0x.' Every character boundary is on a multiple of 16
bits.

In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are represented
by a single byte. 'extended' characters are represented by two bytes.
Thus, 'characters' have a *variable*length* representation -- one or two
bytes. A character, whether it is represented by one or two bytes, can
begin on -any- byte boundary within a data stream, depending on 'what came
before it'. UTF-8 2-byte representations are designed such that one can
jump to any _byte_ offset within the file, and determine -- by looking *only*
at the value of that byte whether is is (a) a single-byte character, (b) the
first byte of a two-byte sequence, or (c) the second byte of a two-byte
sequence.

With UTF-16 you can position directly to any -character-, by jumping to
a _byte_ offset that is twice the index of the character you want. Given
a byte offset, you always know the 'equivalent' _character_ offset.

With UTF-8, you have to read the character stream, counting 'characters'
as you go, to get to the desired point. You can seek to an arbitrary
_byte_ offset, but you do not know how mny 'characters' into the file
that offset is.

UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and
simplicity of addessing/representation (UTF-16).

> This seems rather unfortunate to me. You would think that, by now,
> some "standard" character set might have emerged that would allow one
> to use, at the very least, the "Western" characters (as opposed to
> the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable
> expectation that others will see what was intended.

Heh.

How many 'character' codes are you willing to devote to national 'currency
symbols', just for starters? Probable minimum of two per currency -- one
for the minimum coinage unit (cent, pence, pfennig, etc.) and one for
the denomination unit (dollar, pound, mark, kroner, etc.)

Now, one (obviously) has to have the basic 'Roman' alphabet.

Then there are all the diacritical markings (accent, accent grave, dot
umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And cedilla,
tilde, etc., for select consonants. Plus language specific symbols like
ess-zett , 'thorn', etc.

How about phonetic symbols, like 'schwa' ?

And Greek for all sorts of scientific use?

What about Cyrilic characters, for many Eastern Eurpean languages?

Now, consider punctuation marks:
the 'typewriter' basics,
How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen' are needed?
How many of 'accent, accent grave, apostrophe, opening/closing single-quote'
are needed?
opening/closing double-quotes, and/or a 'position neutral' double-quote?

"Other symbols", like --
digits,
common fractions,
'Trademark','Registered trademark','copyright'
'paragraph','section',
superscripts -- exponents, footnotes, etc.
subscripts -- chemical formulae, etc.
"Simple line-drawing graphics"

Diphthongs?? Ligatures??

Start counting things up.

An 8-bit 'address space' gets used used up _really_ quick.

<wry grin>

Polytropon

unread,

Nov 9, 2011, 12:25:44 PM11/9/11

to

On Tue, 8 Nov 2011 20:59:48 -0600, Conrad J. Sabatier wrote:
> Same here. I've been "guilty" as well of neglecting to properly adjust
> my console configuration.

Sometimes "just works" in combination with lazyness beats
all proper concepts of doing things. :-)

> Doesn't using "LC_ALL" obviate the need to set any of the other LC_*
> variables? At least, that's always been my understanding of it.

I have to admit that I haven't fully understood everything
in that relation, but it seems that the $LC_* (!ALL) can
modify "subsets" of what $LC_ALL defines. Languages and
character sets can be assigned independently (e. g. english
program messages, but german file names properly displayed).

> But, getting back to something you said earlier, what did you mean
> exactly about the precedence of LANG vs. LC_*?

There is, if I remember correctly, the idea that _if_
$LANG is set, $LC_* won't be considered at all, even
if they are set.

http://www.freebsd.org/doc/handbook/using-localization.html
See 24.3.4.1.1.1 and 24.3.4.1.2.

> Yes, and this is one area where the labels are more than a little
> misleading as well. My natural inclination is think of UTF-8 as being a
> single-byte representation for each character in the set, whereas
> UTF-16, as the name implies, would be the "wide", 2-byte version.

> Nonetheless, as I posted earlier in this thread, according to the info
> in gucharmap, the representations of the umlauted "u" are just the
> opposite of this:
>

> UTF-8: 0xC3 0xBC
> UTF-16: 0x00FC
>
> Go figure, huh? :-)

I think Robert did explain it very good: While UTF-16 is
a "fixed width" (2 byte) representation, UTF-8 is "variable
width" (1 byte _or_ two byte).

> > But returning to the original question, I think Robert
> > did explain it very well: There is no real consensus
> > about what the different codings should mean. They
> > were meant to unify the representation of a very large
> > set of characters, but basically there are many inter-
> > pretations now, and how they show up to the user depends
> > on the font in use, _if_ it has this mapping or that,
> > or none.
>

> This seems rather unfortunate to me. You would think that, by now,
> some "standard" character set might have emerged that would allow one
> to use, at the very least, the "Western" characters (as opposed to
> the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable
> expectation that others will see what was intended.

Assumptions, wishes, conclusions and hopes do differ from
reality. :-)

For example, in October I had to assist working on a
document containing german text and chinese symbols.
Decision: We use UTF-8 so the chinese symbols can appear
in the input. A name: Weng Tonghe [][][]. The brackets
should symbolize the three characters for that name.
They did show up properly in the editor, but on the
printed page... Weng Tonghe [][]. What? Two? But there
were three on input! As we found out, the "he" used
in input was the wrong one (there are several "he"s),
and the font used to render the text did not have that
particular "he". When we found the correct one, finally
three characters appeared, as intended and correct.

This should show: You _never_ know where things are
wrong when something is missing - settings, fonts,
who knows. In relation to file names, this is not a
problem of the file system as it will store any name
you want, but if you can actually SEE or USE that
file name - that's a completely different thing.

> > Again a fine demonstration why file names should be
> > limited to printable ASCII and no spaces if you want
> > them to work everywhere. :-)
>
> Well, for myself, personally, I'm a bit of a stickler for "language
> authenticity", you might call it. Having studied both German and
> French rather extensively in my younger days, I'm quite fond of both
> languages, and rather keen on seeing them represented accurately (I
> especially wince at the use of the plain, unaccented vowel followed by
> an "e" in place of the umlaut, and to a lesser degree, the use of "ss"
> in place of Esszett), which has caused me no small amount of confusion,
> aggravation and frustration over the years, to be sure! :-)

Make sure to call it "Eszett" ("Es" = S and "Zett" = Z).
The teletyping conventions suggests to dissolve "ß" to "sz",
because it's easier to recombine "sz" to "ß" because it's
likely to be correct, whereas recombining "ss" to "ß" is
often wrong, as there are too many correct "ss" in texts.

Example:
Mißwirtschaft -> Miszwirtschaft -> Mißwirtschaft ===> good.
Messer -> Meßer ===> wrong.

In names (e. g. of towns): Staßfurt (right) != Stassfurt (wrong).

Note that !("sz" <-> "ß") in all cases, and !("ss" <-> "ß")
as well, as the rule states that only a non-truncatable "ss"
is to be set as Eszett. There are only few "sz" that are
"real 'sz'", typically in word gaps, e. g. Reiszange. :-)

The "funny" things start when diacritic marks and other
non-US-ASCII representable elements change the meaning
of a word. In such cases, it's often justified to use
the proper localized representation. However, this is
also the point where problems may start if you're doing
it wrong (which means: others do not conform to the
language settings or fonts you're using).

The (limited) US-ASCII set of characters is the only
easy way to avoid that. It may not _always_ look pretty,
but in worst cases, it works - and you can RELY on that.

--
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...

David Brodbeck

unread,

Nov 9, 2011, 12:45:33 PM11/9/11

to

It's worth noting, too, that most of the non-Unicode encoding systems
predate the Internet. When computers weren't really talking to each
other, there was no real emphasis on interoperability, and every OS
tended to come up with their own way of encoding foreign languages.
Languages like French, German, and English generally have it easy --
almost everything ended up being Latin1 (aka ISO 8859-1). For other
languages it can be much more complicated. There are at least three
commonly used encoding systems for Chinese. Unicode is gradually
winning, but you'll still find, for example, a lot of Chinese
documents in GB2312 and Big5.

Conrad J. Sabatier

unread,

Nov 10, 2011, 8:12:34 PM11/10/11

to

On Tue, 8 Nov 2011 23:04:25 -0600 (CST)
Robert Bonomi <bon...@mail.r-bonomi.com> wrote:

>
> "Conrad J. Sabatier" <con...@cox.net> wrote:
> >
> > <grin>
> >

> > Yes, and this is one area where the labels are more than a little
> > misleading as well. My natural inclination is think of UTF-8 as
> > being a single-byte representation for each character in the set,
> > whereas UTF-16, as the name implies, would be the "wide", 2-byte
> > version.
>

> "Not exactly."

>
> > Nonetheless, as I posted earlier in this thread, according to the
> > info in gucharmap, the representations of the umlauted "u" are just
> > the opposite of this:
>

> "not exactly." Again.

>
> > UTF-8: 0xC3 0xBC
> > UTF-16: 0x00FC
> >
> > Go figure, huh? :-)
>

> In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has
> -four- nybbles after the '0x.' Every character boundary is on a
> multiple of 16 bits.

Ah yes! I hadn't noticed that.

What's really weird, as I mentioned in a later private email to
Polytropon, last night, the copy-and-paste in gucharmap suddenly
decided to start copying the UTF-8 code instead of the UTF-16. I have
no idea why that changed.

> In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are
> represented by a single byte. 'extended' characters are represented
> by two bytes. Thus, 'characters' have a *variable*length*
> representation -- one or two bytes. A character, whether it is
> represented by one or two bytes, can begin on -any- byte boundary
> within a data stream, depending on 'what came before it'. UTF-8
> 2-byte representations are designed such that one can jump to any
> _byte_ offset within the file, and determine -- by looking *only* at
> the value of that byte whether is is (a) a single-byte character, (b)
> the first byte of a two-byte sequence, or (c) the second byte of a
> two-byte sequence.
>
> With UTF-16 you can position directly to any -character-, by jumping
> to a _byte_ offset that is twice the index of the character you want.
> Given a byte offset, you always know the 'equivalent' _character_
> offset.
>
> With UTF-8, you have to read the character stream, counting
> 'characters' as you go, to get to the desired point. You can seek to
> an arbitrary _byte_ offset, but you do not know how mny 'characters'
> into the file that offset is.

I see. Yes, that could certainly complicate things.

> UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and
> simplicity of addessing/representation (UTF-16).
>

> > This seems rather unfortunate to me. You would think that, by now,
> > some "standard" character set might have emerged that would allow
> > one to use, at the very least, the "Western" characters (as opposed
> > to the "Eastern" or "Oriental" or "Asian", if you will) with a
> > reasonable expectation that others will see what was intended.
>

I certainly get the point. :-) Thanks for that very thorough
elucidation. :-)

Now I just have to figure out what the heck's going on here, why
suddenly I'm seeing the exact opposite of what I was seeing yesterday.
Thought I had everything straightened out for a while there. :-(

Oh, this is madness! :-)

--
Conrad J. Sabatier
con...@cox.net