Filenames with "é" in the names

herman...@invalid.be

unread,

Jul 19, 2009, 6:32:42 AM7/19/09

to

I have a lot of docs (genealogy book) which I edited a few years ago (.sdw
files of OO1.5?). I had these on CD.

Now it is time to do some update, so I copy all files from the CD using
Dolphin. All works well, except for two files which happen to have an "é"
(or "è") in the filename (they are french texts). I get the message that the
file does not exist, although they are shown, and they have a real size.

I could overcome the copy problem by copying via the CLI, but then these
files would not open in OO3.0, giving the same "does not exist message. That
I could also overcome in the CLI by renaming the files.

I think this all is caused by the fact that in earlier KDE3 versions i
always used ISO-8859-15 character sets, but now (MDV2009.1 KDE4.2) on
installation I think I have accepted the default UTF-8.

I've been chasing around in the configuration apps of KDE and MCC, but I
couldn't find where to change this characterset.

HELP

Herman Viaene
--
Veel mensen danken hun goed geweten aan hun slecht geheugen. (G. Bomans)

Lots of people owe their good conscience to their bad memory (G. Bomans)

Marcel Bruinsma

unread,

Jul 19, 2009, 10:17:46 AM7/19/09

to

herman...@invalid.be wrote:

> I have a lot of docs (genealogy book) which I edited a few years ago
> (.sdw files of OO1.5?). I had these on CD.
>
> Now it is time to do some update, so I copy all files from the CD
> using Dolphin. All works well, except for two files which happen to
> have an "é" (or "è") in the filename (they are french texts). I get
> the message that the file does not exist, although they are shown, and
> they have a real size.
>
> I could overcome the copy problem by copying via the CLI, but then
> these files would not open in OO3.0, giving the same "does not exist
> message. That I could also overcome in the CLI by renaming the files.
>
> I think this all is caused by the fact that in earlier KDE3 versions i
> always used ISO-8859-15 character sets, but now (MDV2009.1 KDE4.2) on
> installation I think I have accepted the default UTF-8.

Try to rename the files via the CLI, for example:

$ ln f*minin.sdw féminin.sdw

If that went well (check with ls), then move the new (utf8) link
to a different directory, and remove the old link:

$ rm f*minin.sdw

Or, if you are certain the name is in ISO-8859-15 encoding:

$ mv $(printf féminin.sdw|iconv -futf8 -tISO-8859-15) féminin.sdw

--
printf -v email $(echo \ 155 141 162 143 145 154 142 162 165 151 \
156 163 155 141 100 171 141 150 157 157 056 143 157 155|tr \ \\\\)
# Live every life as if it were your last! #

Marcel Bruinsma

unread,

Jul 19, 2009, 10:46:04 AM7/19/09

to

Marcel Bruinsma wrote:

> herman...@invalid.be wrote:
>
>> I have a lot of docs (genealogy book) which I edited a few years ago
>> (.sdw files of OO1.5?). I had these on CD.
>>
>> Now it is time to do some update, so I copy all files from the CD
>> using Dolphin. All works well, except for two files which happen to
>> have an "é" (or "è") in the filename (they are french texts). I get
>> the message that the file does not exist, although they are shown,
>> and they have a real size.
>>
>> I could overcome the copy problem by copying via the CLI, but then
>> these files would not open in OO3.0, giving the same "does not exist
>> message. That I could also overcome in the CLI by renaming the files.
>>
>> I think this all is caused by the fact that in earlier KDE3 versions
>> i always used ISO-8859-15 character sets, but now (MDV2009.1 KDE4.2)
>> on installation I think I have accepted the default UTF-8.
>
> Try to rename the files via the CLI, for example:

Oops. Sorry, you mentioned „CD”: no rename possible. Try:

$ mkdir ~/test
$ new="féminin.sdw"
$ old="$(printf '%s' "$new"|iconv -futf8 -tISO-8859-15)"
$ cp "$old" ~/test/"$new"
$ ls -l ~/test

If you get garbage, run “rm -rf ~/test”.

James Kerr

unread,

Jul 19, 2009, 3:42:25 PM7/19/09

to

herman...@invalid.be wrote:

> I have a lot of docs (genealogy book) which I edited a few years ago
> (.sdw files of OO1.5?). I had these on CD.
>
> Now it is time to do some update, so I copy all files from the CD
> using Dolphin. All works well, except for two files which happen to
> have an "é" (or "è") in the filename (they are french texts). I get
> the message that the file does not exist, although they are shown,
> and they have a real size.
>
> I could overcome the copy problem by copying via the CLI, but then
> these files would not open in OO3.0, giving the same "does not exist
> message. That I could also overcome in the CLI by renaming the
> files.
>
> I think this all is caused by the fact that in earlier KDE3 versions
> i always used ISO-8859-15 character sets, but now (MDV2009.1 KDE4.2)
> on installation I think I have accepted the default UTF-8.
>
> I've been chasing around in the configuration apps of KDE and MCC,
> but I couldn't find where to change this characterset.
>

This page may be of some help:

http://wiki.mandriva.com/en/Development/Howto/UTF8_Migration

Jim

herman...@invalid.be

unread,

Jul 20, 2009, 3:50:55 AM7/20/09

to

James Kerr wrote:

No, it doesn't, since you cann't change anything on a CD, and trying to
fiddle on the device is also not an option, since you don't know what you
will bring into it next. Right?

Herman

herman...@invalid.be

unread,

Jul 20, 2009, 3:52:57 AM7/20/09

to

Marcel Bruinsma wrote:

> Marcel Bruinsma wrote:
>
>> herman...@invalid.be wrote:
>>
>>> I have a lot of docs (genealogy book) which I edited a few years ago
>>> (.sdw files of OO1.5?). I had these on CD.
>>>
>>> Now it is time to do some update, so I copy all files from the CD
>>> using Dolphin. All works well, except for two files which happen to
>>> have an "é" (or "è") in the filename (they are french texts). I get
>>> the message that the file does not exist, although they are shown,
>>> and they have a real size.
>>>
>>> I could overcome the copy problem by copying via the CLI, but then
>>> these files would not open in OO3.0, giving the same "does not exist
>>> message. That I could also overcome in the CLI by renaming the files.
>>>
>>> I think this all is caused by the fact that in earlier KDE3 versions
>>> i always used ISO-8859-15 character sets, but now (MDV2009.1 KDE4.2)
>>> on installation I think I have accepted the default UTF-8.
>>
>> Try to rename the files via the CLI, for example:
>
> Oops. Sorry, you mentioned „CD”: no rename possible. Try:
>
> $ mkdir ~/test
> $ new="féminin.sdw"
> $ old="$(printf '%s' "$new"|iconv -futf8 -tISO-8859-15)"
> $ cp "$old" ~/test/"$new"
> $ ls -l ~/test
>
> If you get garbage, run “rm -rf ~/test”.
>

As I mentioned in my original post, I resolved the problem by renaming the
files in the copy. But I saved your suggestion, it might come in handy some
other time.

Herman

Doug Laidlaw

unread,

Jul 20, 2009, 11:28:04 PM7/20/09

to

herman...@invalid.be wrote:

>
> As I mentioned in my original post, I resolved the problem by renaming the
> files in the copy. But I saved your suggestion, it might come in handy
> some other time.
>
> Herman
>

As you said higher up, you can't rename files on the CD. You could make a
fresh CD with everything in UTF-8.

Another thought: Perhaps you could install a French font. Can you discover
what character set the original CD uses? You may be able to install that
and let it read the original filenames. I don't know enough to say that it
would work.

Doug.
--
Life is either a daring adventure, or nothing.
- Helen Keller

herman...@invalid.be

unread,

Jul 21, 2009, 3:43:38 AM7/21/09

to

Doug Laidlaw wrote:

> herman...@invalid.be wrote:
>
>>
>> As I mentioned in my original post, I resolved the problem by renaming
>> the files in the copy. But I saved your suggestion, it might come in
>> handy some other time.
>>
>> Herman
>>
> As you said higher up, you can't rename files on the CD. You could make a
> fresh CD with everything in UTF-8.
>
> Another thought: Perhaps you could install a French font. Can you
> discover
> what character set the original CD uses? You may be able to install that
> and let it read the original filenames. I don't know enough to say that
> it would work.
>

I used to pick ISO-8859-15 Western, if that is what you mean.

And how would you install that? is that a separate rpm?

David W. Hodgins

unread,

Jul 22, 2009, 12:09:57 AM7/22/09

to

On Tue, 21 Jul 2009 03:43:38 -0400, <herman...@invalid.be> wrote:

> I used to pick ISO-8859-15 Western, if that is what you mean.

I'd try creating an fstab entry for the old cds. Use the uuid
format, rather then /dev/cdrom, so it doesn't interfere with
automounting of other cds.

Include the users,umask=0,iocharset=iso8859-15,codepage=850,noauto
parameters. Then manually mount/umount the cd.

Regards, Dave Hodgins

--
Change nomail.afraid.org to ody.ca to reply by email.
(nomail.afraid.org has been set up specifically for
use in usenet. Feel free to use it yourself.)

Aragorn

unread,

Jul 22, 2009, 12:30:24 AM7/22/09

to

On Tuesday 21 July 2009 09:43, someone identifying as
*herman...@invalid.be* wrote in /alt.os.linux.mandriva:/

> Doug Laidlaw wrote:
>
>> herman...@invalid.be wrote:
>>
>>> As I mentioned in my original post, I resolved the problem by
>>> renaming the files in the copy. But I saved your suggestion, it
>>> might come in handy some other time.
>>>

>> As you said higher up, you can't rename files on the CD. You could
>> make a fresh CD with everything in UTF-8.
>>
>> Another thought: Perhaps you could install a French font. Can you
>> discover what character set the original CD uses? You may be able to
>> install that and let it read the original filenames. I don't know
>> enough to say that it would work.
>>
> I used to pick ISO-8859-15 Western, if that is what you mean.

The most commonly used character sets for Western languages are UTF-8,
ISO-8859-15 and ISO-8859-1. The latter two are identical except for
the fact that ISO-8859-1 doesn't contain the Euro symbol - this could
be why it's the preferred character set for Americans; saves up a few
bytes of memory and all that... :pp

Still, all funny remarks aside, most distributions are now switching to
UTF-8 for the default characterset, as it's intended to be "universal".
All Unicode charactersets - there is more than just UTF-8 - support all
characters available, but the differently numbered ones have certain
characters characters at different positions in the table and might
occupy more memory; e.g. a Unicode characterset for someone who
primarily uses double-byte characters - e.g. Hebrew - will use UTF-16,
which also contains Western characters.

For all intents and purposes, and some conservative notions set aside,
ISO-8859-1 and ISO-8859-15 are becoming deprecated. At least, in the
*nix* world, because Windows will always use its own charactersets.
<grin>

> And how would you install that? is that a separate rpm?

UTF-8 should already be available on your system as it is, even if it
hasn't been set up to use it as the default. You should be able to
change this from within the MCC. ;-)

--
*Aragorn*
(registered GNU/Linux user #223157)

herman...@invalid.be

unread,

Jul 22, 2009, 4:16:41 AM7/22/09

to

David W. Hodgins wrote:

> On Tue, 21 Jul 2009 03:43:38 -0400, <herman...@invalid.be> wrote:
>
>> I used to pick ISO-8859-15 Western, if that is what you mean.
>
> I'd try creating an fstab entry for the old cds. Use the uuid
> format, rather then /dev/cdrom, so it doesn't interfere with
> automounting of other cds.
>
> Include the users,umask=0,iocharset=iso8859-15,codepage=850,noauto
> parameters. Then manually mount/umount the cd.

Yes, that's a possibility. I'll keep that in mind. But what I still do not
understand is that at the CLI, I can copy/rename these files - although the
"é" is represented with an BW inverted "?", and apps in KDE stumble upon
those files.

Maurice Batey

unread,

Jul 22, 2009, 7:17:05 AM7/22/09

to

On Wed, 22 Jul 2009 06:30:24 +0200, Aragorn wrote:

> UTF-8 should already be available on your system as it is, even if it
> hasn't been set up to use it as the default. You should be able to
> change this from within the MCC. ;-)

Whereabouts in MCC, please?
--
/\/\aurice
(Replace "nomail.afraid" by "bcs" to reply by email)

Aragorn

unread,

Jul 22, 2009, 7:32:44 AM7/22/09

to

On Wednesday 22 July 2009 13:17, someone identifying as *Maurice Batey*
wrote in /alt.os.linux.mandriva:/

> On Wed, 22 Jul 2009 06:30:24 +0200, Aragorn wrote:
>
>> UTF-8 should already be available on your system as it is, even if it
>> hasn't been set up to use it as the default. You should be able to
>> change this from within the MCC. ;-)
>
> Whereabouts in MCC, please?

Hey, you're the guy with Mandriva on his hard disk, so look around. ;-)
I'm still using an old Mandrake 10.0 on this box here. :-)

Maurice Batey

unread,

Jul 22, 2009, 7:50:54 AM7/22/09

to

On Wed, 22 Jul 2009 13:32:44 +0200, Aragorn wrote:

> you're the guy with Mandriva on his hard disk, so look around. ;-)

Correct - and I have looked around, but didn't find...
That's why I asked!

David W. Hodgins

unread,

Jul 22, 2009, 1:15:01 PM7/22/09

to

On Wed, 22 Jul 2009 07:50:54 -0400, Maurice Batey <mau...@nomail.afraid.org> wrote:

> Correct - and I have looked around, but didn't find...
> That's why I asked!

mcc/System/Manage localization for your system/Advanced.
Check the "Old compatibility (non UTF-8) encoding" to
turn off utf-8. Logout/in to take affect.

David W. Hodgins

unread,

Jul 22, 2009, 12:38:22 PM7/22/09

to

On Wed, 22 Jul 2009 04:16:41 -0400, <herman...@invalid.be> wrote:

> Yes, that's a possibility. I'll keep that in mind. But what I still do not
> understand is that at the CLI, I can copy/rename these files - although the
> "é" is represented with an BW inverted "?", and apps in KDE stumble upon
> those files.

Using characters in file names that require "escaping", such as space,
newline, or characters outside of the current character set, is
pretty straight forward in the command line, but it can quickly
become a royal pain, when dealing with multiple levels of scripts,
and with some commands.

If the cd is mounted with the correct iocharset, then I'd expect
both cli and kde applications to work properly.

For the kde apps to fail, when the correct charset is used for
the mount point, then it is a bug in kde, and should be reported
to Mandriva at http://qa.mandriva.com/

I don't know if it is possible to determine what charset was used
when the filesystem was created. If not, it would be nice if udev
or the kde helper scripts could check for invalid characters in
the filenames, and at least make it easier for the user to specify
which charset to use. I think an enhancement bug report would
be appropriate.

Maurice Batey

unread,

Jul 22, 2009, 1:49:23 PM7/22/09

to

On Wed, 22 Jul 2009 13:15:01 -0400, David W. Hodgins wrote:

> mcc/System/Manage localization

Ah! The one place I assumed had nothing to do with UTF-8...
Box was not checked, i.e. UTF-8 in force.

Many thanks!

Aragorn

unread,

Jul 23, 2009, 4:08:56 AM7/23/09

to

On Wednesday 22 July 2009 18:38, someone identifying as *David W.
Hodgins* wrote in /alt.os.linux.mandriva:/

> On Wed, 22 Jul 2009 04:16:41 -0400, <herman...@invalid.be> wrote:
>
>> Yes, that's a possibility. I'll keep that in mind. But what I still
>> do not understand is that at the CLI, I can copy/rename these files -
>> although the "é" is represented with an BW inverted "?", and apps in
>> KDE stumble upon those files.
>
> Using characters in file names that require "escaping", such as space,
> newline, or characters outside of the current character set, is
> pretty straight forward in the command line, but it can quickly
> become a royal pain, when dealing with multiple levels of scripts,
> and with some commands.
>
> If the cd is mounted with the correct iocharset, then I'd expect
> both cli and kde applications to work properly.

CLI and KDE are two entirely different and independently configurable
environments, just as was the case with Windows 3.x and the DOS version
it ran on top of, albeit that this is not entirely a correct analogy.

The main differences between the configuration of Windows 3.x (and
Windows 95/98/ME) and their underlying DOS version were a matter of
memory management - i.e. DOS ran in real mode and the Windows part ran
in DPMI (DOS Protected Mode Interface), which thus stored its data
outside of the DOS-accessible memory range - while on UNIX, the
distinction exists because of the nature of the environment, i.e. the
X11 protocol is an entirely different environment from a console
terminal. As such, X11 uses its own keyboard layout and
character/codepage settings, albeit that in any modern GNU/Linux or
UNIX version, consistency dictates that they should be set up in a
compatible manner at install time.

Yet, what /should/ be, *is* not necessarily the case. The differences
in nature between a character mode terminal being displayed at the
console and the X11 protocol allow for such discrepancies, as some
working environments may even require it.

One must not forget or overlook the fact that UNIX is not Windows, and
that what one is seeing on one's screen - be it either character mode
or X11 - is not per definition "the local screen", as is the case with
a single-user operating system like Windows. In UNIX, your screen is
your terminal, and there is a distinct difference between a character
mode shell and an X11 session. Character mode shells are started
by /init,/ while X11 is a server of its own, with whatever one sees on
one's screen as a client, with emphasis on the "a" - newsreader
formatting standards don't allow bold/underscored/italicized single
characters. ;-)

In addition to this - but I am out on a limb here - I do not know in how
far KDE's own settings differ from those of the underlying X11 server.
Given that we've already seen reports of discrepancies between KDE's
own DPMS settings and those of the underlying X.Org, I suspect that
KDE's own interpretation of character sets and codepages may differ
there too, adding another layer where possible discrepancies could
occur.

If this then indeed occurs, then that can be filed as a bug report with
the distromaker - in casu Mandriva - because although X11 and a
character mode terminal on the local console may actually require being
set up differently under certain circumstances, KDE as a desktop
environment should normally be and remain fully consistent with the X11
server it is running on.

> For the kde apps to fail, when the correct charset is used for
> the mount point, then it is a bug in kde, and should be reported
> to Mandriva at http://qa.mandriva.com/

It need not necessarily be considered a bug. Traditionally, UNIX does
not display characters in filenames when those characters lie outside
of the available codepage, although it does offer the means to convert
them into characters that do fall within the specified codepage or
characterset.

It is actually a nice feature of /bash/ - I am presuming that this is
the shell my fellow countryman Herman is using - that it displays the
unknown character positions in the name as color-inverted question
marks, because to my knowledge, the original Bourne shell and the
non-builtins would not do this unless specifically told to via a
commandline option. Perhaps /bash/ also needs such commandline
options, but then it still gives credit to either Mandriva or the GNU
developers that they enabled this feature by default.

> I don't know if it is possible to determine what charset was used
> when the filesystem was created.

Neither do I, and I strongly suspect that it isn't. However, depending
on who made the CDs - I haven't followed the entire thread - this could
be seen as an indicator of what characterset and codepages were used.

If it was created on Windows, then it will for most part be the Windows
default. If it was done on GNU/Linux, then it'll most likely be an
ISO-8859-1/15 or UTF-8 characterset.

> If not, it would be nice if udev or the kde helper scripts could check
> for invalid characters in the filenames, and at least make it easier
> for the user to specify which charset to use. I think an enhancement
> bug report would be appropriate.

Such an enhancement would require a lot of coding and would by its
complexity seriously slow down the system, so I don't think it would be
desirable. ;-)

Marcel Bruinsma

unread,

Jul 23, 2009, 7:06:52 AM7/23/09

to

Aragorn wrote:

> It is actually a nice feature of /bash/ - I am presuming that this is
> the shell my fellow countryman Herman is using - that it displays the
> unknown character positions in the name as color-inverted question
> marks,

Actually, that's a feature of the unicode standard and the terminal
emulator. Unicode defines the code point “REPLACEMENT
CHARACTER” (U+fffd), which is “used to replace an incoming
character whose value is unknown or unrepresentable in Unicode.”
The terminal emulator ‘translates’ these unknown/unrepresentable
characters to U+fffd. The glyph in the font that Herman selected
happens to be an inverted question mark.

In the font I'm using to type this article, the glyph for U+fffd is a
small rectangle. This here is U+fffd: �. What does it look like
to you?

The good thing about bash is, that, although it fully supports
multibyte character encodings (such as utf-8), it doesn't
change illegal multibyte sequences. Suppose, there are two
files with latin1 encoded names in the same directory; for
example fóó (f\363\363 in latin1) and fòò (f\362\362).
Suppose, bash would change the \363 and \362 bytes,
which are both an “illegal sequence” in UTF-8, to U+fffd,
then “mv f* ..” would effectively unlink one of the original
names, because both names would become f�� (f\357\277
\275\357\277\275 that is).

So, in conclusion: It is actually a nice feature of /bash/
that it leaves the unknown character positions in the
name the way they are. :-)

herman...@invalid.be

unread,

Jul 23, 2009, 9:03:18 AM7/23/09

to

David W. Hodgins wrote:

> On Wed, 22 Jul 2009 04:16:41 -0400, <herman...@invalid.be> wrote:
>
>> Yes, that's a possibility. I'll keep that in mind. But what I still do
>> not understand is that at the CLI, I can copy/rename these files -
>> although the "é" is represented with an BW inverted "?", and apps in KDE
>> stumble upon those files.
>
> Using characters in file names that require "escaping", such as space,
> newline, or characters outside of the current character set, is
> pretty straight forward in the command line, but it can quickly
> become a royal pain, when dealing with multiple levels of scripts,
> and with some commands.

I understand this general remark, but in my case, there is no escaping
sequence involved, the replacing character is in the middle of a "normal"
sequence of characters.

>
> If the cd is mounted with the correct iocharset, then I'd expect
> both cli and kde applications to work properly.
>
> For the kde apps to fail, when the correct charset is used for
> the mount point, then it is a bug in kde, and should be reported
> to Mandriva at http://qa.mandriva.com/
>
> I don't know if it is possible to determine what charset was used
> when the filesystem was created. If not, it would be nice if udev
> or the kde helper scripts could check for invalid characters in
> the filenames, and at least make it easier for the user to specify
> which charset to use. I think an enhancement bug report would
> be appropriate.

As I stated already before - and I am referring here to other posts of
Aragorn en Maurice - the CD's were created when I was using a KDE3 version
(going back to Mandrake 9.X or 10.X, I cann't remember) and at that time I
consistently chose ISO 8859-15 as characterset.

I understand that trying to read text (or filename) written in another
characterset than the one I am currently using can produce strange looking
results.
After all, the problem existed already way back in the times of DOS when you
got files from Spain or whatever , where people used another character set
than what is used in many of the English speaking - or conforming to those
languages - parts of the world.

What bothers me is that KDE4 just is not able to give you a decent way out
to handle such instances, instead of just blocking everything. Is that a
bug? I should think so, but then I cannot imagine nobody ever stumbled upon
this before. My access to the KDE bug database - not being a registered
developer (or how it is called) - is limited. I have an account, but other
bugs I submitted always have to travel via someone else, so I am not sure I
have full visibility on all of the database.
Otherwise, I'm quite willing to submit a bug.

Herman

>
> Regards, Dave Hodgins

Aragorn

unread,

Jul 23, 2009, 9:31:01 AM7/23/09

to

On Thursday 23 July 2009 15:03, someone identifying as

*herman...@invalid.be* wrote in /alt.os.linux.mandriva:/

> David W. Hodgins wrote:

>
>> On Wed, 22 Jul 2009 04:16:41 -0400, <herman...@invalid.be> wrote:
>>
>>> Yes, that's a possibility. I'll keep that in mind. But what I still
>>> do not understand is that at the CLI, I can copy/rename these files
>>> - although the "é" is represented with an BW inverted "?", and apps
>>> in KDE stumble upon those files.
>>
>> Using characters in file names that require "escaping", such as
>> space, newline, or characters outside of the current character set,
>> is pretty straight forward in the command line, but it can quickly
>> become a royal pain, when dealing with multiple levels of scripts,
>> and with some commands.
>
> I understand this general remark, but in my case, there is no escaping
> sequence involved, the replacing character is in the middle of a
> "normal" sequence of characters.

This would indeed be the way they are represented, if the proper switch
option is given to whatever external command or shell built-in you are
using, but I presume that the GNU developers have enabled this by
default in /bash./

> As I stated already before - and I am referring here to other posts of
> Aragorn en Maurice - the CD's were created when I was using a KDE3
> version (going back to Mandrake 9.X or 10.X, I cann't remember) and at
> that time I consistently chose ISO 8859-15 as characterset.

There should be tools available on your hard disk - as scripts - to
convert filenames from one characterset to another one. I haven't
properly looked into this, but my old Mandrake 10.0 here seems to
already have them, so I suspect that a newer Mandriva release would
also carry them.

> I understand that trying to read text (or filename) written in another
> characterset than the one I am currently using can produce strange
> looking results.

Indeed so. :-)

> After all, the problem existed already way back in the times of DOS
> when you got files from Spain or whatever , where people used another
> character set than what is used in many of the English speaking - or
> conforming to those languages - parts of the world.

Hmm... No, that was not a different characterset, but a different
codepage setting. A characterset is exactly what its name says, i.e. a
set of characters. A codepage defines the order of the available
characters in the characterset.

DOS used a ASCII-compatible characterset, but the differences in
codepages would change the way those characters appeared on-screen, or
at least, for those above the 127th position in the table, as the first
127 were supposed to be identical across codepages. Or at least, so I
gather - I didn't play around with those too often, but I remember the
differences between cp437 and cp850. ;-)

Aragorn

unread,

Jul 23, 2009, 9:50:12 AM7/23/09

to

On Thursday 23 July 2009 13:06, someone identifying as *Marcel Bruinsma*
wrote in /alt.os.linux.mandriva:/

> Aragorn wrote:
>
>> It is actually a nice feature of /bash/ - I am presuming that this is
>> the shell my fellow countryman Herman is using - that it displays the
>> unknown character positions in the name as color-inverted question
>> marks,
>
> Actually, that's a feature of the unicode standard and the terminal
> emulator.

Right, I forgot about the terminal emulator there for a second. :-)

> Unicode defines the code point “REPLACEMENT CHARACTER” (U+fffd),
> which is “used to replace an incoming character whose value is unknown
> or unrepresentable in Unicode.”
> The terminal emulator ‘translates’ these unknown/unrepresentable
> characters to U+fffd. The glyph in the font that Herman selected
> happens to be an inverted question mark.

The inverting would then be a function of the terminal, but the question
mark is traditionally the way illegal characters are displayed in the
Bourne Shell, and thus also in GNU Bash. :-)

> In the font I'm using to type this article, the glyph for U+fffd is a
> small rectangle. This here is U+fffd: �. What does it look like
> to you?

A small rectangle, yes.

> The good thing about bash is, that, although it fully supports
> multibyte character encodings (such as utf-8), it doesn't
> change illegal multibyte sequences. Suppose, there are two
> files with latin1 encoded names in the same directory; for
> example fóó (f\363\363 in latin1) and fòò (f\362\362).
> Suppose, bash would change the \363 and \362 bytes,
> which are both an “illegal sequence” in UTF-8, to U+fffd,
> then “mv f* ..” would effectively unlink one of the original
> names, because both names would become f�� (f\357\277
> \275\357\277\275 that is).

I understand that, yes, but why would "fóó" or "fòò" be illegal
sequences in UTF-8? Accented characters are part of UTF-8, are they
not?

> So, in conclusion: It is actually a nice feature of /bash/
> that it leaves the unknown character positions in the
> name the way they are. :-)

Yes, one has to explicitly rename the files while using the "?"
wildcard, e.g.:

mv f?? foo

However, what I was referring to was the fact that an illegal character
might not show up at all in the output of a standard /ls/ command as it
existed in proprietary UNIX without a special switch to /ls/ - which
one I have forgotten - but that either the GNU version of */bin/ls*
or /bash/ - I don't know which one handles the output exactly, and the
terminal itself also comes into play, of course - converts an
"invisible" character to the illegal character glyph by default, albeit
that KDE might not be doing that; I haven't come across any illegal
characters in filenames yet in KDE's filemanager, but then again I
usually handle files from the commandline.

Either way, my point to Herman was that if KDE is indeed not showing any
illegal characters, then this is not a bug but rather legacy UNIX
behavior, and if /bash/ or any other commandline GNU tools do show
those characters by default, then this can only serve to give them
credit. :-)

Aragorn

unread,

Jul 23, 2009, 9:58:25 AM7/23/09

to

On Thursday 23 July 2009 15:50, someone identifying as *Aragorn* wrote
in /alt.os.linux.mandriva:/

> On Thursday 23 July 2009 13:06, someone identifying as *Marcel
> Bruinsma* wrote in /alt.os.linux.mandriva:/
>

>> So, in conclusion: It is actually a nice feature of /bash/
>> that it leaves the unknown character positions in the
>> name the way they are. :-)
>
> Yes, one has to explicitly rename the files while using the "?"
> wildcard, e.g.:
>
> mv f?? foo

Correction, the proper example would have be...

mv f??oo foo

... where the "?" represents an invisible character.

Example of the context (in traditional UNIX):

- /ls/ shows a single file called abcd.txt
- when trying to move abcd.txt, the shell complains that
no such file exists; conclusion, the file contains
invisible illegal characters.
- the output of "ls -b" shows that the file is actually
called a?bc?d.txt, with "?" being the glyph for illegal
characters
- one can move/rename the file using "?" - i.e. the real
question mark character - as a wildcard, i.e.

mv a?bc?d.txt abcd.txt

Sorry for poorly having expressed myself. I do that sometimes. :-/

Marcel Bruinsma

unread,

Jul 23, 2009, 8:08:05 PM7/23/09

to

Aragorn wrote:

>> In the font I'm using to type this article, the glyph for U+fffd is a
>> small rectangle. This here is U+fffd: �. What does it look like
>> to you?
>
> A small rectangle, yes.

To see the glyph that Herman described, open a Konsole
window, select the DejaVu Sans Mono font, and run
printf '\357\277\275\n' (the utf8 encoding of U+fffd).
It should display as a question mark in a ‘zeshoekige’
(hexagonal?) shape.

>> The good thing about bash is, that, although it fully supports
>> multibyte character encodings (such as utf-8), it doesn't
>> change illegal multibyte sequences. Suppose, there are two
>> files with latin1 encoded names in the same directory; for
>> example fóó (f\363\363 in latin1) and fòò (f\362\362).
>> Suppose, bash would change the \363 and \362 bytes,
>> which are both an “illegal sequence” in UTF-8, to U+fffd,
>> then “mv f* ..” would effectively unlink one of the original
>> names, because both names would become f�� (f\357\277
>> \275\357\277\275 that is).
>
> I understand that, yes, but why would "fóó" or "fòò" be illegal
> sequences in UTF-8? Accented characters are part of UTF-8,
> are they not?

The ‘o with grave accent’ (ò) is represented by one byte with the
value 0362 (octal notation) in latin1 encoding. In utf8 encoding
there is no multi-byte sequence that can start with a byte of that
value (and the single-byte sequences in utf8 are between 0 and
0177). Therefor, 0362 is by definition an illegal sequence in
utf8, i.e. 0362 is valid in latin1 context, but invalid in utf8
context.

The ‘o with grave accent’ is encoded as the two byte sequence
0303 0262 in utf8 encoding, which represents the unicode code
point U+00f2 (official name: LATIN SMALL LETTER O WITH
GRAVE). Obviously, 0xf2 is equal to 0362, so iso-8859-1 and
unicode ò have the same value, but the utf8 encoding of U+00f2
consists of two bytes.

If a multi-byte aware program discovers an illegal sequence, it
can respond in any way it wants, for example: print an error
message and quit, or (if run interactively) ask the user how to
proceed, or discard the illegal sequence, or copy the illegal
sequence verbatim. Bash chooses the latter option, which is
the most sensible action for a shell.

>> So, in conclusion: It is actually a nice feature of /bash/
>> that it leaves the unknown character positions in the
>> name the way they are. :-)
>
> Yes, one has to explicitly rename the files while using
> the "?" wildcard, e.g.:
>
> mv f?? foo

Unless you know the encoding of the file name, e.g

mv $(printf 'fóó'|iconv -futf8 -tl1) fóó

for a conversion from iso-8859-1 to utf-8 (assuming
LC_ALL or LC_CTYPE or LANG has .UTF-8 set), is
probably the safest course of action.

> However, what I was referring to was the fact that an illegal
> character might not show up at all in the output of a standard /ls/
> command as it existed in proprietary UNIX

Those were the days when some terminals would ‘hang’ upon
receipt of characters with the most significant bit set, right? :-)

> without a special switch
> to /ls/ - which one I have forgotten - but that either the GNU version
> of */bin/ls* or /bash/ - I don't know which one handles the output
> exactly, and the terminal itself also comes into play, of course -
> converts an "invisible" character to the illegal character glyph by
> default, albeit that KDE might not be doing that;

I think the problem could be that some KDE applications either
convert an illegal sequence, or just strip it from the input. In both
cases the resulting file name would no longer match the name
stored in the directory, and OS functions, such as open() or
chdir(), will fail with a “no such file or directory” error.
Unfortunately, GUI applications rarely tell you what they're
up to in a clear, unambiguous way.

> I haven't come
> across any illegal characters in filenames yet in KDE's filemanager,
> but then again I usually handle files from the commandline.

So do I ;-) , accept for ftp downloads of complete sub-trees; just
one drag-and-drop to copy 5000+ files, can't beat that with any
number of CLI tools!

> Either way, my point to Herman was that if KDE is indeed not showing
> any illegal characters, then this is not a bug but rather legacy UNIX
> behavior, and if /bash/ or any other commandline GNU tools do show
> those characters by default, then this can only serve to give them
> credit. :-)

I ran:

$ mkdir $(printf cinéma|iconv -futf8 -tl1)
$ mkdir cinéma
$ ls -ld ci*
cinéma cin?ma
$ echo ci*
cinéma cin�ma

Interesting! The ls command still converts the (in my UTF-8 locale)
unprintable latin1 byte (0362) to a question mark, but it does use
the iswprint() function. The echo command (a bash built-in) OTOH
converts this unprintable byte to the unicode U+fffd code point.

Konqueror displays the latin1 encoded name as ‘cinRRma’, where
each of the Rs is a small rectangle. Very peculiar! I can, however,
open the directory, then open the embedded terminal, and bash
shows ‘cinRma’ in its prompt. Then:

~/cin�ma> ls -ld "$(pwd)"
drwxr-xr-x 2 marcel users 4096 jul 24 01:52 /home/marcel/cin?ma

Well, well, amazing! Fortunately, it's only the way they display the
name that differs; none of these programs reports a “no such file
or directory” error. Internally they all use the same name.

Aragorn

unread,

Jul 23, 2009, 10:15:30 PM7/23/09

to

On Friday 24 July 2009 02:08, someone identifying as *Marcel Bruinsma*
wrote in /alt.os.linux.mandriva:/

> Aragorn wrote:

>
>>> In the font I'm using to type this article, the glyph for U+fffd is
>>> a small rectangle. This here is U+fffd: �. What does it look like
>>> to you?
>>
>> A small rectangle, yes.
>
> To see the glyph that Herman described, open a Konsole
> window, select the DejaVu Sans Mono font, and run
> printf '\357\277\275\n' (the utf8 encoding of U+fffd).
> It should display as a question mark in a ‘zeshoekige’
> (hexagonal?) shape.

"Hexagonal" is indeed the correct word in English, as well as in many
other languages. :-)

Oh, okay, now I understand what you're saying. ;-) For a moment there,
I thought that you meant to imply that the glyph of the character
itself does not exist in Unicode - which of course didn't make sense as
I could see and type both accented "o"'s and I am using Unicode myself
here. (I chose ISO885-15 upon installation time many summers ago -
animals were still talking back then :p - but have recently begun
converting everything to UTF-8 on this machine.)

> If a multi-byte aware program discovers an illegal sequence, it
> can respond in any way it wants, for example: print an error
> message and quit, or (if run interactively) ask the user how to
> proceed, or discard the illegal sequence, or copy the illegal
> sequence verbatim. Bash chooses the latter option, which is
> the most sensible action for a shell.

From what I believe to remember, the original Bourne Shell used to
discard illegal sequences by default. A lot sure has changed since the
days of the old Bourne Shell. GNU Bash is simply awesome, like all the
GNU tools for that matter. It's no wonder Linus Torvalds already
referred to GNU as "professional stuff" when he first announced his
kernel project on Usenet in 1991. ;-)

As an aside, this is one of the reasons why I refer to our beloved
operating system as GNU/Linux - or on occasion, vocally as "GNU plus
Linux" - instead of just plain "Linux". Linux is only a kernel, and it
wouldn't be what it is today without the GNU toolchain.

Of course, I realize that not everything is GNU - e.g. X.Org uses a
different, BSD-style license, and Apache has its own GPL-compatible
license - but X.Org and Apache are not components of the so-called
"baselayout" of the operating system, which comprises solely of the
Linux kernel and the GNU userland, added with a few extra GNU tools.

I therefore think it's a shame that Linus Torvalds, Andrew Morton et al
prefer stroking their egos instead of giving the GNU initiative (and by
consequence, the FSF) full recognition for their work, while the FSF
itself does recognize and confirm the importance of the Linux kernel.

One of the kernel developers - it might have been Linus himself, but I'm
not sure about that - even went so far as to say "What have those FSF
guys done, really? It's not like they write any code or anything."
Quite obviously the man was totally oblivious with regard to the
existence of /gcc,/ /bash,/ /glibc/ or even /Emacs,/ which was already
written by Richard Stallman himself long before Linux even existed. ;-)

>>> So, in conclusion: It is actually a nice feature of /bash/
>>> that it leaves the unknown character positions in the
>>> name the way they are. :-)
>>
>> Yes, one has to explicitly rename the files while using
>> the "?" wildcard, e.g.:
>>
>> mv f?? foo
>
> Unless you know the encoding of the file name, e.g
>
> mv $(printf 'fóó'|iconv -futf8 -tl1) fóó
>
> for a conversion from iso-8859-1 to utf-8 (assuming
> LC_ALL or LC_CTYPE or LANG has .UTF-8 set), is
> probably the safest course of action.

This is interesting stuff... ;-) I've never actually fiddled with that
before. :-)

>> However, what I was referring to was the fact that an illegal
>> character might not show up at all in the output of a standard /ls/
>> command as it existed in proprietary UNIX
>
> Those were the days when some terminals would ‘hang’ upon
> receipt of characters with the most significant bit set, right? :-)

Oh yeah... :-) Can we say "Unisys"? :p

>> without a special switch to /ls/ - which one I have forgotten - but
>> that either the GNU version of */bin/ls* or /bash/ - I don't know
>> which one handles the output exactly, and the terminal itself also
>> comes into play, of course - converts an "invisible" character to the
>> illegal character glyph by default, albeit that KDE might not be
>> doing that;
>
> I think the problem could be that some KDE applications either
> convert an illegal sequence, or just strip it from the input. In both
> cases the resulting file name would no longer match the name
> stored in the directory, and OS functions, such as open() or
> chdir(), will fail with a “no such file or directory” error.
> Unfortunately, GUI applications rarely tell you what they're
> up to in a clear, unambiguous way.

That is probably due to the fact that writing a GUI is an entirely
different thing from actually dealing with low-level stuff like
filesystem access.

This is one of those things that aggravated me during my brief
experiences with Windows NT on my own machine between 1997 and 1999 and
the very limited experiences with other Wintendo versions on the
computers of other people. Windows would often spit out error messages
that had nothing whatsoever to do with the actual error and totally set
you off on the wrong foot in trying to hunt down the problem. I guess
it's the same thing with any GUI. ;-)

>> I haven't come across any illegal characters in filenames yet in
>> KDE's filemanager, but then again I usually handle files from the
>> commandline.
>
> So do I ;-) , accept for ftp downloads of complete sub-trees; just
> one drag-and-drop to copy 5000+ files, can't beat that with any
> number of CLI tools!

Of course! ;-) One has to use common sense in all of this. Some people
are totally abhorred by the thought of a commandline, and others are
totally repelled by anything desktop-environment-related and insist
that we should all use the commandline only - there was a thread about
this subject not too long ago in either /comp.os.linux.hardware/
or /comp.os.linux.setup;/ I forgot which one it was again.

Some guy who identified as Sidney Lamb (or something) was raving against
KDE all the time and even went so far as to insist that KDE was a
corporate conspiracy initiative to dumb down the enduser. Now I'm
highly interested in conspiracy theories and most of them actually do
contain a lot of truth, and in addition to that I'm also not too happy
with the decision of both distromakers and the KDE developers to try
and make KDE (4.x) look like a Windows clone, but in this case there
was obviously paranoid delusion at play.

According to a number of regulars in said group, this poster was the
same person as the one going by the pseudonym "Alan Connor". I don't
know, really. There were similarities in stance and paranoia, yes. I
suppose it's possible.

>> Either way, my point to Herman was that if KDE is indeed not showing
>> any illegal characters, then this is not a bug but rather legacy UNIX
>> behavior, and if /bash/ or any other commandline GNU tools do show
>> those characters by default, then this can only serve to give them
>> credit. :-)
>
> I ran:
>
> $ mkdir $(printf cinéma|iconv -futf8 -tl1)
> $ mkdir cinéma
> $ ls -ld ci*
> cinéma cin?ma
> $ echo ci*
> cinéma cin�ma
>
> Interesting! The ls command still converts the (in my UTF-8 locale)
> unprintable latin1 byte (0362) to a question mark, but it does use
> the iswprint() function. The echo command (a bash built-in) OTOH
> converts this unprintable byte to the unicode U+fffd code point.

The difference here is interesting, but in essence they are both
reporting an illegal character as such, because in the output from
your /ls/ command, a question mark is used to substitute a character in
the directoryname, while a question mark is a wildcard character and
therefore illegal. At least, in theory, because every literal
character becomes legal once it has been escaped; that's why we can
have filenames with spaces or even newlines in them.

> Konqueror displays the latin1 encoded name as ‘cinRRma’, where
> each of the Rs is a small rectangle. Very peculiar! I can, however,
> open the directory, then open the embedded terminal, and bash
> shows ‘cinRma’ in its prompt. Then:
>
> ~/cin�ma> ls -ld "$(pwd)"
> drwxr-xr-x 2 marcel users 4096 jul 24 01:52 /home/marcel/cin?ma
>
> Well, well, amazing! Fortunately, it's only the way they display the
> name that differs; none of these programs reports a “no such file
> or directory” error. Internally they all use the same name.

This is possibly because they are all actually looking for the inode
rather than for a file or directory by its given name - that is the
proper UNIX way. Still, it's bizarre how they each seem to have their
own interpretation on what to output to the screen when encountering
illegal characters. It's inconsistent, and that is *not* the UNIX way.

Thanks for this interesting post. I'm archiving this for future
reference on the subject. ;-)

Namaste. ;-)

Maurice Batey

unread,

Jul 24, 2009, 2:34:42 PM7/24/09

to

On Wed, 22 Jul 2009 06:30:24 +0200, Aragorn wrote:

> For all intents and purposes, and some conservative notions set aside,
> ISO-8859-1 and ISO-8859-15 are becoming deprecated.
>

> UTF-8 should already be available on your system

Just came across a weird complication when replying to a posting in
another thread here.

The posting contained the Norwegian town name T�nsberg (copied/pasted
here).
When I tried to reply, using my Pan-nominated editor Kwrite (with
Configure/Open-Save/General/Encoding set to UTF-8) the keyboard was
dead so far as Kwrite was concerned and the 2nd letter of the name
appeared as a black triangle containing a '?', but when I changed
that setting to ISO-8859-15, the 2nd letter appeared correctly as
"�", and I could enter characters from the keyboard.

Why Kwrite should in the first case ignore the keyboard I don't
understand, but as a result of the above episode I have turned UTF-8
off in MCC!

Aragorn

unread,

Jul 25, 2009, 3:03:37 AM7/25/09

to

On Friday 24 July 2009 20:34, someone identifying as *Maurice Batey*
wrote in /alt.os.linux.mandriva:/

> On Wed, 22 Jul 2009 06:30:24 +0200, Aragorn wrote:

>
>> For all intents and purposes, and some conservative notions set
>> aside, ISO-8859-1 and ISO-8859-15 are becoming deprecated.
>>
>> UTF-8 should already be available on your system
>
> Just came across a weird complication when replying to a posting in
> another thread here.
>

> The posting contained the Norwegian town name Tønsberg (copied/pasted
> here).

And displayed correctly in my UTF-8 set-up. :-)

(I just had to check again whether it was still set up to use UTF-8 as I
have just had to hard-reset my machine due to a hardware problem - this
box really is dying :-/ - and I had a /reiserfs/ journal playback, of
course. This often tends to screw up things, and one of the things I
had lost was my already years-old custom look & feel, i.e. fonts and
font sizes, color scheme, and even spellchecker set up. However, the
UTF-8 setting still seems to have stuck.)

> When I tried to reply, using my Pan-nominated editor Kwrite (with
> Configure/Open-Save/General/Encoding set to UTF-8) the keyboard was
> dead so far as Kwrite was concerned and the 2nd letter of the name

> appeared as a black triangle containing a '?', [...

Ah yes, the black triangles... We've sent a couple of F-16s after them
once, but they were too fast. Rumor has it that they are
Lockheed-Martin TR3-Bs, using antigravity propulsion reverse-engineered
from crashed UFOs. :p

> ..] but when I changed that setting to ISO-8859-15, the 2nd letter
> appeared correctly as "ø", and I could enter characters from the
> keyboard.

Are you so sure the keyboard was dead? Couldn't it have been that you
incidentally stumbled upon "dead keys" or something?

I'm just guessing here myself, but Pan is not a KDE application - isn't
it built on GTK libraries? - while KWrite on the other hand is a
KDE/Qt-linked editor. There might be some discrepancy there. In fact,
in my experience, there usually is, even.

> Why Kwrite should in the first case ignore the keyboard I don't
> understand, but as a result of the above episode I have turned UTF-8
> off in MCC!

I can't say that I have an idea on what might have gone wrong, other
than that KDE and non-KDE stuff - and particularly GTK stuff- usually
don't communicate very well. Yet, the MCC is /supposed/ to set up the
system consistently across all environments. It's odd... :-/

Maurice Batey

unread,

Jul 25, 2009, 12:13:50 PM7/25/09

to

On Sat, 25 Jul 2009 09:03:37 +0200, Aragorn wrote:

> And displayed correctly in my UTF-8 set-up. :-)

Which set-up? (The problem appears in kwrite.)

> Are you so sure the keyboard was dead? Couldn't it have been that
> you incidentally stumbled upon "dead keys" or something?

That was my first thought. Actually, not completely dead; the
up/down arrow keys worked, but Enter and any character key were a
no-op. This is with the Norwegian town name T�nsberg in the file.

> I'm just guessing here myself, but Pan is not a KDE application -
> isn't it built on GTK libraries? - while KWrite on the other hand
> is a KDE/Qt-linked editor.

Pan is a Gnome application.

Never had sach a problem before, and have been using the Pan/Kwrite
combination since 2008.0.

However, just to eliminate the Pan connection, I fired up Kwrite
separately from Pan, changed the Kwrite setting back from ISO-8859-15
to UTF-8, and pasted the Norwegian name.
SAME PROBLEM!
Changed the Kwrite setting back to ISO-8859-15, and all is well. QED

Any other suggestions as to WTF is going on here?!

(This is on 2009.1 PowerPack, KDE 4.2.4, and with MCC/System/Local
'UTF-8' turned off during previous boot.)

Perhaps simply a Kwrite bug?

Aragorn

unread,

Jul 25, 2009, 1:54:34 PM7/25/09

to

On Saturday 25 July 2009 18:13, someone identifying as *Maurice Batey*
wrote in /alt.os.linux.mandriva:/

> On Sat, 25 Jul 2009 09:03:37 +0200, Aragorn wrote:

>
>> And displayed correctly in my UTF-8 set-up. :-)
>
> Which set-up? (The problem appears in kwrite.)

Just about everything here is UTF-8 now, although it was ISO8859-15 when
I installed the system.

>> Are you so sure the keyboard was dead? Couldn't it have been that
>> you incidentally stumbled upon "dead keys" or something?
>
> That was my first thought. Actually, not completely dead; the
> up/down arrow keys worked, but Enter and any character key were a

> no-op. This is with the Norwegian town name Tønsberg in the file.

Extremely bizarre... :-/

>> I'm just guessing here myself, but Pan is not a KDE application -
>> isn't it built on GTK libraries? - while KWrite on the other hand
>> is a KDE/Qt-linked editor.
>
> Pan is a Gnome application.

And Gnome is GTK-based. ;-)

> Never had sach a problem before, and have been using the Pan/Kwrite
> combination since 2008.0.
>
> However, just to eliminate the Pan connection, I fired up Kwrite
> separately from Pan, changed the Kwrite setting back from ISO-8859-15
> to UTF-8, and pasted the Norwegian name.
> SAME PROBLEM!

Okay, that narrows it down to KWrite then.

> Changed the Kwrite setting back to ISO-8859-15, and all is well. QED
>
> Any other suggestions as to WTF is going on here?!

An insect. Ehm... Excuse me, I mean a bug. :p

> (This is on 2009.1 PowerPack, KDE 4.2.4, and with MCC/System/Local
> 'UTF-8' turned off during previous boot.)
>
> Perhaps simply a Kwrite bug?

Yes, but a very serious one in terms of usability. They should fix that
pretty soon.

Maurice Batey

unread,

Jul 25, 2009, 2:31:20 PM7/25/09

to

On Sat, 25 Jul 2009 09:03:37 +0200, Aragorn wrote:

> Ah yes, the black triangles..

Just been looking in Kmail for black triangles. Found one msg with
one. (The char before "1000" is the UK 'pound' sgn):

(1)
Message window: to win ?1000, sign up

View source: to win =A31000, sign up
('charset="UTF-8"')

(2) Here is one with the pound sign displayed correctly:

Message window: Total: �8.98

View source: Total: =C2=A38.98
("charset=UTF-8")

So, when View Source shows the pound sign as "=A3", the black
triangle appears, but when View Source shows it as "=C2=A3", then the
pound sign displays correctly.

I used to understand character sets (code pages), but now...

N.B. When I Saved this message, I got the warning:

"The selected encoding cannot encode every unicode character in this
document. Do you really want to save it? There could be some data
lost."

Maurice Batey

unread,

Jul 25, 2009, 3:56:01 PM7/25/09

to

On Sat, 25 Jul 2009 19:54:34 +0200, Aragorn wrote:

> a bug

Now reported:

https://qa.mandriva.com/show_bug.cgi?id=52449

Aragorn

unread,

Jul 26, 2009, 12:35:21 AM7/26/09

to

On Saturday 25 July 2009 21:56, someone identifying as *Maurice Batey*
wrote in /alt.os.linux.mandriva:/

> On Sat, 25 Jul 2009 19:54:34 +0200, Aragorn wrote:

>
>> a bug
>
> Now reported:
>
> https://qa.mandriva.com/show_bug.cgi?id=52449

You did well. However, I suspect that the bug is not Mandriva-specific
but rather KDE 4.x-specific, and so the Mandriva developers will have
to relay this one upstream and wait for a fix from the KDE team. ;-)

Robert Riches

unread,

Jul 26, 2009, 12:37:03 AM7/26/09

to

With many of the bug reports I have filed that have been upstream
issues, the Mandriva folks have instructed me that I (user,
customer) must file the upstream bug report.

Good luck.

--
Robert Riches
spamt...@verizon.net
(Yes, that is one of my email addresses.)

Marcel Bruinsma

unread,

Jul 26, 2009, 6:56:49 AM7/26/09

to

Maurice Batey wrote:

> However, just to eliminate the Pan connection, I fired up Kwrite
> separately from Pan, changed the Kwrite setting back from
> ISO-8859-15 to UTF-8, and pasted the Norwegian name.
> SAME PROBLEM!
> Changed the Kwrite setting back to ISO-8859-15, and all is well.
> QED

Still using KDE 3.5, can't reproduce this. It's most likely one
of the many bugs introduced in KDE 4.

Of course, there's also a bug in Pan. When Pan passes a text to an
external program it should convert the encoding of that text to
utf-8 (or whatever your default locale is set to). Pan knows the
encoding from the article headers, the external program (Kwrite)
doesn't have this information, and even if it did, it wouldn't
understand the headers (it's an editor not a newsreader).

Next bug report. :-)

Marcel Bruinsma

unread,

Jul 26, 2009, 6:57:46 AM7/26/09

to

Maurice Batey wrote:

> Just been looking in Kmail for black triangles. Found one msg
> with one. (The char before "1000" is the UK 'pound' sgn):
>
> (1)
> Message window: to win ?1000, sign up
>
> View source: to win =A31000, sign up
> ('charset="UTF-8"')

The =A3 is not utf-8. It could be iso-8859-1, iso-8859-3,
iso-8859-7, iso-8859-8, iso-8859-9, iso-8859-13, iso-8859-14,
iso-8859-15, or windows-1252. If the “Content-Type:” header
contains “charset=UTF-8” and Kmail is set to automatic
recognition, Kmail is innocent. The sending MUA should
have set ‘charset’ to the correct value.

> (2) Here is one with the pound sign displayed correctly:
>
> Message window: Total: £8.98
>
> View source: Total: =C2=A38.98
> ("charset=UTF-8")

Yep, that's utf8 pound sign (Unicode code point U+00A3, is utf8
two byte sequence 0302 0243, is equal to 0xC2 0xA3).

> So, when View Source shows the pound sign as "=A3", the black
> triangle appears, but when View Source shows it as "=C2=A3",
> then the pound sign displays correctly.

Two different character encodings, Kmail can handle both, but it
needs the correct headers. If Kmail reads the =A3 in an utf-8
context, the =A3 is an illegal sequence, which is converted to the
Unicode “REPLACEMENT CHARACTER” (U+FFFD). The
actual glyph (triangle, hexagon, rectangle, et cætera) varies
from font to font.

> I used to understand character sets (code pages), but now...

Character sets and character transfer encodings are not the same
thing! Unicode can be represented by at least 6 different transfer
encodings (utf-8 is the most common).

> N.B. When I Saved this message, I got the warning:
>
> "The selected encoding cannot encode every unicode character in this
> document. Do you really want to save it? There could be some data
> lost."

This means that your default character encoding is not utf-8 (or
utf-16, or utf-32), that's your bug, fix it. :-)

Maurice Batey

unread,

Jul 26, 2009, 7:53:01 AM7/26/09

to

On Sun, 26 Jul 2009 06:35:21 +0200, Aragorn wrote:

> suspect that the bug is not Mandriva-specific
> but rather KDE 4.x-specific, and so the Mandriva developers will have to
> relay this one upstream and wait for a fix from the KDE team. ;-)

So either they will relay it or (as they have done recently with
others) they will ask me to...

Maurice Batey

unread,

Jul 26, 2009, 7:58:52 AM7/26/09

to

On Sun, 26 Jul 2009 12:56:49 +0200, Marcel Bruinsma wrote:

> there's also a bug in Pan. When Pan passes a text to an
> external program it should convert the encoding of that text to utf-8 (or
> whatever your default locale is set to).

Although the problem also occurs when calling kwrite outside of Pan.
However in this case the result is the same, as the troublesome word
was pasted from within a Pan posting.

Marcel Bruinsma

unread,

Jul 26, 2009, 8:11:36 AM7/26/09

to

Maurice Batey wrote:

> On Sun, 26 Jul 2009 12:56:49 +0200, Marcel Bruinsma wrote:
>
>> there's also a bug in Pan. When Pan passes a text to an
>> external program it should convert the encoding of that text to utf-8
>> (or whatever your default locale is set to).
>
> Although the problem also occurs when calling kwrite outside of Pan.
> However in this case the result is the same, as the troublesome word
> was pasted from within a Pan posting.

Sure, but if Pan would have converted to utf8, as it should, there
would not have been a problem. Both applications have a bug. The
Kwrite bug, locking part of your keyboard is, of course, more
serious.

Maurice Batey

unread,

Jul 26, 2009, 8:19:56 AM7/26/09

to

On Sun, 26 Jul 2009 12:57:46 +0200, Marcel Bruinsma wrote:

> This means that your default character encoding is not utf-8

It is now, once again...

Maurice Batey

unread,

Jul 26, 2009, 8:28:49 AM7/26/09

to

On Sun, 26 Jul 2009 12:57:46 +0200, Marcel Bruinsma wrote:

> If the “Content-Type:” header contains
> “charset=UTF-8” and Kmail is set to automatic recognition, Kmail is
> innocent.

It is so set (View-->Set Encoding--> 'Auto' selected).

Marcel Bruinsma

unread,

Jul 26, 2009, 8:39:56 AM7/26/09

to

Maurice Batey wrote:

> On Sun, 26 Jul 2009 12:57:46 +0200, Marcel Bruinsma wrote:
>
>> If the “Content-Type:” header contains
>> “charset=UTF-8” and Kmail is set to automatic recognition, Kmail is
>> innocent.
>
> It is so set (View-->Set Encoding--> 'Auto' selected).

That explains the ‘triangle’. Kmail expects utf8, but gets windows
or iso. If you select one of those, in stead of ‘Auto’, the pound sign
will display correctly. And don't forget to inform the sender of the
mail, that their software is broken.

Maurice Batey

unread,

Jul 26, 2009, 8:41:07 AM7/26/09

to

On Sun, 26 Jul 2009 12:56:49 +0200, Marcel Bruinsma wrote:

> Of course, there's also a bug in Pan. When Pan passes a text to an
> external program it should convert the encoding of that text to utf-8 (or
> whatever your default locale is set to).

Actually, having now looked into that, I found that the Pan editor's
character set was not set to UTF-8!
Have now set it to that, and now kwrite (reverted to UTF-8
setting) - when called from Pan - handles the file correctly, with no
keyboard problem! So Pan (version 1) does appear to convert.

But as you say elsewhere, kwrite's keyboard input locking when
given a file with an invalid UTF-8 character is a different matter
altogether...

Marcel Bruinsma

unread,

Jul 26, 2009, 10:54:57 AM7/26/09

to

Maurice Batey wrote:

> On Sun, 26 Jul 2009 12:56:49 +0200, Marcel Bruinsma wrote:
>
>> Of course, there's also a bug in Pan. When Pan passes a text to an
>> external program it should convert the encoding of that text to utf-8
>> (or whatever your default locale is set to).
>
> Actually, having now looked into that, I found that the Pan editor's
> character set was not set to UTF-8!

Well, then you better inform the maintainers of “Maurice Batey”,
that their product malfunctioned. ;-)

Seriously, I've always thought this is a design flaw in both Gnome
and KDE. For CLI programs you set LANG (and possibly one or
more LC_* environment variables) in ~/.profile or ~/.login, and
almost all programs will use that. Changing locale is simple, user
friendly. Gnome and KDE have a centralised locale config too,
but many gnome/kde programs have their separate, private config
for the character encoding, which makes it more complicated to
change (or expirement).

Oh, well, it took a long time to get from the Bourne shell to today's
bash/zsh. I suppose 25 years from now GUI developers might start
to realise that a tiny bit of user-friendliness doesn't hurt. :-)

Dave Farrance

unread,

Jul 26, 2009, 12:16:55 PM7/26/09

to

Marcel Bruinsma <we-love-...@gmail.com> wrote:

>That explains the ‘triangle’. Kmail expects utf8, but gets windows
>or iso. If you select one of those, in stead of ‘Auto’, the pound sign
>will display correctly. And don't forget to inform the sender of the
>mail, that their software is broken.

The majority of emails are "broken" then. Windows-1252 character set with
no header. It's annoying, but we should work with the real world here. At
the moment, if an email program doesn't assume that received messages
without a charset header are Windows-1252, then it will usually misrender
them.

--
Dave Farrance

Maurice Batey

unread,

Jul 26, 2009, 1:05:00 PM7/26/09

to

On Sun, 26 Jul 2009 16:16:55 +0000, Dave Farrance wrote:

> if an email program doesn't assume that received messages
> without a charset header are Windows-1252, then it will usually misrender
> them.

I went through the whole list of alternatives in Kmail's
'View/Set encoding', and couldn't find one that would show the £ sign
in the sample I quoted earlier ("win ?1000").
The first I tried was "Western European (cp 1252)", but that
failed.

Dave Farrance

unread,

Jul 26, 2009, 3:22:18 PM7/26/09

to

Strange... What about this: �100

Marcel Bruinsma

unread,

Jul 26, 2009, 5:56:48 PM7/26/09

to

Dave Farrance wrote:

Most email with broken headers I get, declares to be ISO-8859-1,
while the body of the email contains UTF-8. Old configurations,
that need an update. When informed /politely/, senders usually
adjust their configuration files. If they are incapable to work out
their config, I tell Kmail to replace the broken headers (and make
a mental note of the sender's incapacity).

The ‘windows software fails to insert (correct|any) headers’ happens
too (less frequently). My experience is, that companies will fix this
problem, when informed and instructed. It's cluelessness, not rudeness
towards their paying customers, that causes them to use broken
software (or fail to configure properly).

Of course, I might also receive email using UTF-8 without declaring.
Such email is broken too, but I'll never notice. :-)

Marcel Bruinsma

unread,

Jul 26, 2009, 6:10:33 PM7/26/09

to

Maurice Batey wrote:

> I went through the whole list of alternatives in Kmail's
> 'View/Set encoding', and couldn't find one that would show
> the £ sign in the sample I quoted earlier ("win ?1000").
> The first I tried was "Western European (cp 1252)", but that
> failed.

Odd. If your mail is stored locally (in ~/.kde4/…), you
might try to quit Kmail, then edit the mailbox, replacing
“charset="UTF-8"” by “charset="ISO-8859-15"”, and
start Kmail again. (make a backup copy first)

Aragorn

unread,

Jul 27, 2009, 2:12:32 AM7/27/09

to

On Sunday 26 July 2009 14:41, someone identifying as *Maurice Batey*
wrote in /alt.os.linux.mandriva:/

> On Sun, 26 Jul 2009 12:56:49 +0200, Marcel Bruinsma wrote:

On the other hand, it /does/ make sense that something like this could
happen, or at least in part. Allow me to elaborate...

A keyboard is only an input device, and there is nothing special about
the Enter/Return key. It's a key like any other key, which sends a
certain scancode to the computer. In real mode - i.e. up until the
kernel is loaded into memory, but before the kernel image is
decompressed, as that already happens with the boot processor in
protected mode - this is handled by the BIOS using a default mapping,
and in protected mode this is handled by the kernel.

Now, as soon as the operating system is loaded and userspace is
initiated, the kernel puts the keyboard in RAW mode, which means that
every keystroke is passed on "as is" to whatever userspace process is
monitoring the keyboard. In KDE, this is the underlying X11 server,
which relays the keyboard's input to whatever application has focus,
insofar as said keyboard input has not been reserved for any special
functions, like the /Ctrl+Alt+Del/ sequence to log out of KDE,
or /Ctrl+Backspace/ to kill the X server.

As such, when the application receiving keyboard input is having
difficulties interpreting the scancodes according to any invalid
codepage translations, then the keyboard does not really get locked up
- although it /appears/ that way - but instead every keypress will in
essence just be misinterpreted by the client window as something
illegal and thus improperly handled.

It *is* a bug, but the above is the mechanism by which this bug can
manifest itself, and the mechanism itself is as it should be. The bug
itself is therefore to be found in the fact that KWrite misinterprets
the keystrokes and considers them to be illegal characters.

Another possible and additional bug - albeit that this is open for
interpretation - lies with the fact that if the system's administrator
makes use of the MCC to switch the system over to UTF-8, then the MCC
should perform this action consistently and not just to KDE/Qt-only
applications while leaving the Gnome/GTK-based end untouched, or vice
versa.

Of course, GTK-based stuff is completely separate from Qt-based stuff,
but it is, in my humble opinion, the task of the Mandriva Control
Center to consistently administer such changes systemwide, or at the
very least for everything requiring the X server, and not just to a
subset of the available GUI applications. And considering that the MCC
itself is GTK-based, it seems even more bizarre that it doesn't seem to
know how to apply those changes to the configuration file of another
GTK-app.

They've integrated just about everything else, from desktop menu entries
to even using a "common" theming for both KDE and Gnome applications,
so why not integrate characterset and codepage settings across both
these environments?

And the latter, given that it pertains to the MCC, is a Mandriva bug,
not a KDE or Gnome bug.

Maurice Batey

unread,

Jul 27, 2009, 12:40:25 PM7/27/09

to

On Mon, 27 Jul 2009 00:10:33 +0200, Marcel Bruinsma wrote:

> edit the mailbox

Easier said than done - I have so many msgs in the Inbox.
I'm not curious to the extent that I would try that!

Maurice Batey

unread,

Jul 28, 2009, 12:02:25 PM7/28/09

to

On Sun, 26 Jul 2009 12:57:46 +0200, Marcel Bruinsma wrote:

> The sending MUA should have set ‘charset’ to the correct
> value.

Have just received another email from the same source, containing
the same phrase ("win £1000"), and this time the msg header declares
"char.set=us-ascii" and the pound symbol does display correctly!
Guess someone else must have got to them first...

Dave Farrance

unread,

Jul 28, 2009, 3:11:41 PM7/28/09

to

Maurice Batey <mau...@nomail.afraid.org> wrote:

> Have just received another email from the same source, containing
>the same phrase ("win �1000"), and this time the msg header declares
>"char.set=us-ascii" and the pound symbol does display correctly!
> Guess someone else must have got to them first...

Except that us-ascii is a 7-bit character set that does not contain the
pound symbol. It's still messed up. Spammers probably hire software
writers that can't be hired anywhere else, anyway. Don't reply to them!

I'd guess that most newsreaders would treat any 8-bit character that
appeared in a "us-ascii" post as ISO-8859-1, since that's the most
common 8-bit character set.

I've set the Content-Type of this post to us-ascii, so let's see what
happens here. I've listed what these characters should be in ISO-8859-1
but maybe some people might see (e.g.) ISO-8859-15 characters instead:

0xA3: � (UK Pound currency)
0xA4: � (Currency sign)
0xBC: � (1/4)
0x80: � (Nothing, but is Euro sign in Microsoft CP1252)

--
Dave Farrance

Maurice Batey

unread,

Jul 28, 2009, 3:59:20 PM7/28/09

to

On Tue, 28 Jul 2009 19:11:41 +0000, Dave Farrance wrote:

> 0xA3: £ (UK Pound currency)
> 0xA4: ¤ (Currency sign)
> 0xBC: ¼ (1/4)
> 0x80: € (Nothing, but is Euro sign in Microsoft CP1252)

In your posting I see:

(1) Pound symbol
(2) Cross with a circle in the middle!
(3) 'quarter' symbol
(4) Square with '00 80' inside
(but in this reply (via Kwrite) it shows as Euro symbol)

Maurice Batey

unread,

Jul 28, 2009, 4:00:57 PM7/28/09

to

On Tue, 28 Jul 2009 19:11:41 +0000, Dave Farrance wrote:

> Spammers probably hire software
> writers that can't be hired anywhere else, anyway. Don't reply to them!

Wasn't from a spammer, but from a top-notch company in the UK
('SAGA')!

David W. Hodgins

unread,

Jul 28, 2009, 4:01:53 PM7/28/09

to

On Tue, 28 Jul 2009 15:11:41 -0400, Dave Farrance <DaveFa...@omitthisyahooandthis.co.uk> wrote:

Using opera here. Using the regular message view, the characters
are being translated.

> 0xA3: £ (UK Pound currency)
> 0xA4: ¤ (Currency sign)
> 0xBC: ¼ (1/4)
> 0x80: € (Nothing, but is Euro sign in Microsoft CP1252)

All except the currency sign show as described. The currency
sign looks like a small letter o, but with squished corners.

If I select "View all headers and message", to view the message
in raw mode, the characters all show as a period.

As I'm composing this article, the characters are shown ok.
I'll check to see what opera selects for the character encoding.

Regards, Dave Hodgins

--
Change nomail.afraid.org to ody.ca to reply by email.
(nomail.afraid.org has been set up specifically for
use in usenet. Feel free to use it yourself.)

Marcel Bruinsma

unread,

Jul 28, 2009, 10:42:30 PM7/28/09

to

Dave Farrance wrote:

> 0xA3: £ (UK Pound currency)
> 0xA4: ¤ (Currency sign)
> 0xBC: ¼ (1/4)
> 0x80: € (Nothing, but is Euro sign in Microsoft CP1252)

All as described.

Jim Beard

unread,

Jul 28, 2009, 11:26:29 PM7/28/09

to

Dave Farrance wrote:
> 0xA3: � (UK Pound currency)
> 0xA4: � (Currency sign)
> 0xBC: � (1/4)
> 0x80: � (Nothing, but is Euro sign in Microsoft CP1252)

Top to bottom, UK Pound sign, looks like a tv with a dot
at each corner, one-quarter, euro symbol

Maurice Replied:

On Tue, 28 Jul 2009 19:11:41 +0000, Dave Farrance wrote:

> > 0xA3: � (UK Pound currency)
> > 0xA4: � (Currency sign)
> > 0xBC: � (1/4)

> > 0x80: ? (Nothing, but is Euro sign in Microsoft CP1252)

In your posting I see:

(1) Pound symbol
(2) Cross with a circle in the middle!
(3) 'quarter' symbol
(4) Square with '00 80' inside
(but in this reply (via Kwrite) it shows as Euro symbol)
-- /\/\aurice

AND in Maurice Replied the Euro sign looks like the
OE squashed together of Oedipus, 0x80.

Marcel Bruinsma wrote:
> > 0xA3: � (UK Pound currency)
> > 0xA4: � (Currency sign)
> > 0xBC: � (1/4)

> > 0x80: ? (Nothing, but is Euro sign in Microsoft CP1252)
All as described.

BUT in Marcel's reply the OE I saw is replaced by the box
with 00 80 inside in what I saw, yet in what I pasted here it
too is OE.

IN none of the above did I see a cross with a circle in the
middle, nor did 0xA4 ever appear as a currency sign.
The pound sign and the one-quarter consistently displayed
correctly for me, but the Euro symbol/OE/box with 00 80 seems to
have done whatever it felt like at the time.

Interesting what conversions result from cut/paste.

And Thunderbird has complained that some of the characters in
this message will not display in the default character encoding
and proposed UTF8 (refused). Who knows what will result...

--
UNIX is not user unfriendly; it merely
expects users to be computer-friendly.

Peter D.

unread,

Jul 28, 2009, 11:45:45 PM7/28/09

to

on Tue, 28 Jul 2009 02:40 am
in the Usenet newsgroup alt.os.linux.mandriva
Maurice Batey wrote:

> On Mon, 27 Jul 2009 00:10:33 +0200, Marcel Bruinsma wrote:
>
>> edit the mailbox
>
> Easier said than done - I have so many msgs in the Inbox.
> I'm not curious to the extent that I would try that!

Doesn't Mandriva provide a script to do that? I certainly
wouldn't want to do it "by hand".

--
Peter D.
Sig goes here..

Aragorn

unread,

Jul 29, 2009, 2:36:21 AM7/29/09

to

On Tuesday 28 July 2009 21:11, someone identifying as *Dave Farrance*
wrote in /alt.os.linux.mandriva:/

> I've set the Content-Type of this post to us-ascii, so let's see what
> happens here. I've listed what these characters should be in
> ISO-8859-1 but maybe some people might see (e.g.) ISO-8859-15
> characters instead:

I am using UTF-8 myself, but of course KNode recognizes other
charactersets. I see... ->

> 0xA3: £ (UK Pound currency)

Yep.

> 0xA4: ¤ (Currency sign)

A small horizontally oriented rectangle of which the corners seem to
have been stretched out away from the center of the rectangle.

> 0xBC: ¼ (1/4)

One quarter, indeed.

> 0x80: € (Nothing, but is Euro sign in Microsoft CP1252)

An upright and empty rectangle.

Peter D.

unread,

Jul 29, 2009, 5:16:48 AM7/29/09

to

on Wed, 29 Jul 2009 05:11 am

in the Usenet newsgroup alt.os.linux.mandriva

Dave Farrance wrote:

> Maurice Batey <mau...@nomail.afraid.org> wrote:
>
>> Have just received another email from the same source, containing
>>the same phrase ("win £1000"), and this time the msg header declares
>>"char.set=us-ascii" and the pound symbol does display correctly!
>> Guess someone else must have got to them first...
>
> Except that us-ascii is a 7-bit character set that does not contain the
> pound symbol. It's still messed up. Spammers probably hire software
> writers that can't be hired anywhere else, anyway. Don't reply to them!
>
> I'd guess that most newsreaders would treat any 8-bit character that
> appeared in a "us-ascii" post as ISO-8859-1, since that's the most
> common 8-bit character set.
>
> I've set the Content-Type of this post to us-ascii, so let's see what
> happens here. I've listed what these characters should be in ISO-8859-1
> but maybe some people might see (e.g.) ISO-8859-15 characters instead:

Using 2009.1 with KDE 4.2.4 and Knode 0.99.01. My home directory is
very old and I don't know what my default charset is.

There is an unknown character in the Subject box, a question mark on
a black hexagon. Other posts in this thread showed an accented e.

> 0xA3: £ (UK Pound currency)
> 0xA4: ¤ (Currency sign)
> 0xBC: ¼ (1/4)
> 0x80: € (Nothing, but is Euro sign in Microsoft CP1252)

With Knode's charset at Automatic I see;
Pounds,
currency,
1/4, and
Euro.

Western European (IBM850);
lower case u rising accent (acute?),
lower case n squiggle (tilde?),
lower right corner twin line,
upper case C cedilla.

Western European (ISO 8859-1);
Pounds,
currency,
1/4, and
Euro.

Western European (ISO 8859-14);
Pounds,
upper case C dot,
lower case y descending accent (grave?), and
Euro.

Western European (ISO 8859-15);
Pounds,
Euro,
upper case OE ligature,
Euro again.

Western European (cp 1252);
Pounds,
currency,
1/4, and
Euro.

There are a few things that are wrong here.