Improved console UTF-8 support for the Linux kernel?

5 views
Skip to first unread message

Simos Xenitellis

unread,
Dec 11, 2004, 12:10:11 PM12/11/04
to

Hi All,
The current UTF-8 keyboard input (for the console) of the Linux kernel
does not support "composing" or writing characters with accents. This
affects quite a few languages that require accents (French, German,
Danish, Swedish?, Greek, cyrillic-based?, others?.).

In general, UTF-8 console support is good to display text in different character sets,
enabling to configure a distribution to use UTF-8 locales for both
console/Xorg. However, while it was possible to write in German, Spanish, French, etc,
now it is not possible anymore.

While looking into the problem, I noticed that there is work to make
Linux console handle Unicode better.

Two links are of interest
A. Improved UTF-8 support for the Linux kernel, by Chris Heath
http://chris.heathens.co.nz/linux/utf8.html
B. Notes on the Linux console, by Innocenti Maresin
http://www.comtv.ru/~av95/linux/console/

Discussion on these issues take place at the linux-utf8 mailing list, archived at
http://groups-beta.google.com/group/nlo.lists.linux-utf8

Chris Heath has a set of incremental patches
(http://chris.heathens.co.nz/linux/utf8.html) to enhance Unicode for the
console.
I noticed that he contacted this list in May 2003
(http://seclists.org/lists/linux-kernel/2003/May/7956.html) but
unfortunatelly the discussion was diverted to coding styles.

Is there an interest for re-submission of mentioned patches for
inclusion in the kernel (yeah, provided coding style is "normalised")?

Simos

p.s.
I am not sending this e-mail on behalf of any of the authors, just
myself.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Gómez

unread,
Dec 11, 2004, 12:40:08 PM12/11/04
to
Hi Simon ;),

> The current UTF-8 keyboard input (for the console) of the Linux kernel
> does not support "composing" or writing characters with accents.

Yes, i recently find it out when trying to switch all my system to
UTF-8. But the patch from Chris you mention below works very well
for me (and for anybody that needs to type compose characters for
languages based in the latin1 encoding i guess).

> affects quite a few languages that require accents (French, German,
> Danish, Swedish?, Greek, cyrillic-based?, others?.).

Spanish ;))

> Chris Heath has a set of incremental patches
> (http://chris.heathens.co.nz/linux/utf8.html) to enhance Unicode for the
> console.
> I noticed that he contacted this list in May 2003
> (http://seclists.org/lists/linux-kernel/2003/May/7956.html) but
> unfortunatelly the discussion was diverted to coding styles.

Chris told me in the utf-8 mailing list that he doesn't think his patch to
make the kernel generate UTF-8 characters in the compose tables will be
included in the main kernel. Basically because is not a full solution that
cover all the cases... But there is nothing better, so maybe it will be a
good idea to include it. Current state is, for 2.6 kernel, text console
is broken in UTF-8 mode because it cannot generate UTF-8 composed characters.

> Is there an interest for re-submission of mentioned patches for
> inclusion in the kernel (yeah, provided coding style is "normalised")?

At least, I am _really_ interested :)

regards,

--
David Gómez Jabber ID: dav...@jabber.org

Jan Engelhardt

unread,
Dec 11, 2004, 2:10:11 PM12/11/04
to
>> The current UTF-8 keyboard input (for the console) of the Linux kernel
>> does not support "composing" or writing characters with accents.

That's weird, because "ö" (LATIN O WITH DIAERESIS) -- which clearly lies
outside the 7-bit range, is working on my system without myself poking the
kernel. Both hitting the key or using compose mode. This also applies to
A-with-DIAERESIS, U-with-DIAERESIS, sharp german S, but does not for anything
else, e.g. compose-'-e to generate E with accent aigu.

>Yes, i recently find it out when trying to switch all my system to
>UTF-8. But the patch from Chris you mention below works very well
>for me (and for anybody that needs to type compose characters for
>languages based in the latin1 encoding i guess).
>

>> Is there an interest for re-submission of mentioned patches for
>> inclusion in the kernel (yeah, provided coding style is "normalised")?
>
>At least, I am _really_ interested :)

So am I. I have to use xterm for anything fancy now...
(especially for the even-more fancy stuff that begins at three-byte UTF8
sequences, such as Japanese :-)

Jan Engelhardt
--
ENOSPC

David Gómez

unread,
Dec 11, 2004, 4:30:14 PM12/11/04
to
Hi Jan ;),

On Dec 11 at 08:07:11, Jan Engelhardt wrote:
> >> The current UTF-8 keyboard input (for the console) of the Linux kernel
> >> does not support "composing" or writing characters with accents.
>
> That's weird, because "ö" (LATIN O WITH DIAERESIS) -- which clearly lies
> outside the 7-bit range, is working on my system without myself poking the
> kernel.

Indeed is weird. Are you sure you keyboard is generating an UTF-8
enconded "ö"? Just check it with echo:

$ echo -n ö | od -t x1

0000000 c3 b6
0000002

I'm using kernel 2.6.9 + Chris patch

> So am I. I have to use xterm for anything fancy now...
> (especially for the even-more fancy stuff that begins at three-byte UTF8
> sequences, such as Japanese :-)

I know :)). By the way, and this is offtopic, have you checked uim? I
was testing it the other day with good results, and like it a lot as
a japanese (or another script, although i only use this japanese) input
method. I've used it with anthy, just have to check it with skk.

regards,

--
David Gómez Jabber ID: dav...@jabber.org

Jan Engelhardt

unread,
Dec 11, 2004, 4:50:07 PM12/11/04
to
>Indeed is weird. Are you sure you keyboard is generating an UTF-8
>enconded "ö"? Just check it with echo:
>
>$ echo -n ö | od -t x1
>
>0000000 c3 b6
>0000002

Yes it does generate 0xC3B6 (otherwise it would show up as garbage, because it
would not be utf8-compliant if it only output 0xF6)

>I'm using kernel 2.6.9 + Chris patch

I am using SUSE's KOTD 20041202 (2.6.8 + 2.6.9-rc2)

>I know :)). By the way, and this is offtopic, have you checked uim? I
>was testing it the other day with good results, and like it a lot as
>a japanese (or another script, although i only use this japanese) input
>method. I've used it with anthy, just have to check it with skk.

Have not seen it. What is it? Some sort of xterm?


Jan Engelhardt
--
ENOSPC

David Gómez

unread,
Dec 11, 2004, 5:10:09 PM12/11/04
to
Hi Jan ;),

On Dec 11 at 10:39:55, Jan Engelhardt wrote:
> Yes it does generate 0xC3B6 (otherwise it would show up as garbage, because it
> would not be utf8-compliant if it only output 0xF6)
>
> >I'm using kernel 2.6.9 + Chris patch
>
> I am using SUSE's KOTD 20041202 (2.6.8 + 2.6.9-rc2)

Maybe the patch or a fix has already been included in rc2/rc3, or in
SUSE's version :??

> >method. I've used it with anthy, just have to check it with skk.
>
> Have not seen it. What is it? Some sort of xterm?

Just an input system. To be able to write Japanese all over the place ;))

regards,

--
David Gómez Jabber ID: dav...@jabber.org

Gene Heskett

unread,
Dec 11, 2004, 5:30:14 PM12/11/04
to
On Saturday 11 December 2004 16:39, Jan Engelhardt wrote:
>>Indeed is weird. Are you sure you keyboard is generating an UTF-8
>>enconded "ö"? Just check it with echo:
>>
>>$ echo -n ö | od -t x1
>>
>>0000000 c3 b6
>>0000002
>
>Yes it does generate 0xC3B6 (otherwise it would show up as garbage,
> because it would not be utf8-compliant if it only output 0xF6)

Which is exactly (0xF6) what I'm getting. Kernel version
2.6.10-rc2-mm3-V0.7.32-18

As an american, I've often wondered how to go about getting those
accented characters out of a std american keyboard. I used to be
able to get all those accented and other stuffs out of my amiga's
keyboard, stuff like the Beta sign and so on. No can do now, and I
miss it.

[...]

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.30% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com attorneys please note, additions to this message
by Gene Heskett are:
Copyright 2004 by Maurice Eugene Heskett, all rights reserved.

Jan Engelhardt

unread,
Dec 11, 2004, 7:10:10 PM12/11/04
to
>I am a bit confused. Could you please comment on the following, as a
>common test steps?

I do it a bit faster (i.e. without "od"): if after a Compose operation, I see
something, it must have been UTF8. If not (like there has been only 8 bits
output), the current line screws up a little. <- My test strategy; does not
need `od` to confirm.

>I am not sure how you wrote the above characters. According to UTF-8,
>characters with codepoints above 0x79 require two bytes so that to be
>valid. When you compose "ö" (you press something like ";", then "o") in
>the console?

ö is a "native key" on my keyboard, i.e. i do not need to play with compose to
generate ö.

>For simplicity, let's assume you do something like
% loadkeys --unicode
compose '/' 'e' to U+00F6 # ('ö')
^D
%

I did that (a shortened form of yours), and saw by `dumpkeys` that there was
now only one compose table entry, so I think `loadkeys --unicode` overwrites
the table? Rightly so.
Still and despite there are now no compose table entries, with the exception of
that one, I can still generate ö. <compose><"><o> rightly gives two 7-bit
characters (rightly so at this point).

>Good. I hope more people raise their hands for this.

...want kanji-on-console, but I guess that will not come true with VGA, which
only supports 256 (512) chars. OTOH, [free]BSD's mouse support uses a graphical
mouse pointer rather than a "block" one like gpm does, and as I think of it,
such a graphical mouse (most Norton apps for DOS also had such) needs some VGA
magic or so to set single "pixels"/bits. If single bits within the 8x{16,etc.}
char cell can be set, we could have kanji.
Can anyone elaborate on this graphical mouse stuff?

Jan Engelhardt
--
ENOSPC

David Gómez

unread,
Dec 11, 2004, 7:50:09 PM12/11/04
to
Hi Jan ;),

> >I am not sure how you wrote the above characters. According to UTF-8,
> >characters with codepoints above 0x79 require two bytes so that to be
> >valid. When you compose "ö" (you press something like ";", then "o") in
> >the console?
>
> ö is a "native key" on my keyboard, i.e. i do not need to play with compose to
> generate ö.

Aaahh ;), you've should said that before. The whole problem with the
kernel is with the compose tables. If you have a native key for "ö" in
your keyboard you'll not have problems. I can type for example a 'n
with tilde' in my keyboard because is too is a native key, but for
accentuated characters, for utf-8 output is neccesary to apply the patch :-/

regards,

--
David Gómez Jabber ID: dav...@jabber.org

Simos Xenitellis

unread,
Dec 12, 2004, 9:10:09 AM12/12/04
to

Jan Engelhardt wrote:
> >> The current UTF-8 keyboard input (for the console) of the Linux kernel
> >> does not support "composing" or writing characters with accents.
>
> That's weird, because "ö" (LATIN O WITH DIAERESIS) -- which clearly lies
> outside the 7-bit range, is working on my system without myself poking the
> kernel. Both hitting the key or using compose mode. This also applies to
> A-with-DIAERESIS, U-with-DIAERESIS, sharp german S, but does not for anything
> else, e.g. compose-'-e to generate E with accent aigu.

I am a bit confused. Could you please comment on the following, as a
common test steps?

I am not sure how you wrote the above characters. According to UTF-8,


characters with codepoints above 0x79 require two bytes so that to be
valid. When you compose "ö" (you press something like ";", then "o") in
the console?

For simplicity, let's assume you do something like
% loadkeys --unicode
keycode 53 = 0x0d2f
compose '/' 'q' to U+00F6
compose '/' 'w' to U+00F7
compose '/' 'e' to U+00F8
compose '/' 'r' to U+00F9
compose '/' 't' to U+0100
compose '/' 'y' to U+0101
keycode 2 = U+00F6
keycode 3 = U+00F7
keycode 4 = U+00F8
keycode 5 = U+00F9
keycode 6 = U+0100
keycode 7 = U+0101
^D
%

Dead key (due to "0d") is the character "/" (0x2f).
Keycodes 2-7 are keys for numbers 1-6.
To test, I type
% cat > test.txt
<we try out all key compositions to generate U+00F6-U+0101>
^D

When we try keys 1-6, we get
% od -x text.txt
0000000 b6c3 b7c3 b8c3 b9c3 80c4 81c4 000a
0000015
%
which is correct.

When we try using the dead key "/" and q-y, we get
% od -x test.txt
0000000 f7f6 f9f8 0100 000a
0000007
%

To get the keyboard in a sane mode, "loadkeys --unicode -d".

>From here we see there is no conversion to UTF-8 whatsoever.

In the second case, the kernel cannot return the full character when it
is in Unicode mode.

> >Yes, i recently find it out when trying to switch all my system to
> >UTF-8. But the patch from Chris you mention below works very well
> >for me (and for anybody that needs to type compose characters for
> >languages based in the latin1 encoding i guess).
> >
> >> Is there an interest for re-submission of mentioned patches for
> >> inclusion in the kernel (yeah, provided coding style is "normalised")?
> >
> >At least, I am _really_ interested :)
>

> So am I. I have to use xterm for anything fancy now...
> (especially for the even-more fancy stuff that begins at three-byte UTF8
> sequences, such as Japanese :-)

Good. I hope more people raise their hands for this.

Simos

[I am sending this again. It did not make it to the kernel mailing list in the first^Wsecond post for some reason..]

Marc A. Lehmann

unread,
Dec 12, 2004, 10:50:25 AM12/12/04
to
On Sun, Dec 12, 2004 at 01:05:49AM +0100, Jan Engelhardt <jen...@linux01.gwdg.de> wrote:
> Can anyone elaborate on this graphical mouse stuff?

What norton does is simply use a few characters that happen to look like a
mouse cursor on characters (or norton forces to look, more correctly). You
can do that for a single object (like the mouse cursor), and a few more,
but of course you can display much less characters that way than with a
standard method, as it eats 4 characters/object.

--
The choice of a |
-----==- _GNU_ |
----==-- _ generation Marc Lehmann +--
---==---(_)__ __ ____ __ p...@goof.com |e|
--==---/ / _ \/ // /\ \/ / http://schmorp.de/ --+
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE |

Simos Xenitellis

unread,
Dec 12, 2004, 5:20:09 PM12/12/04
to
David Gómez wrote:
> Hi Jan ;),
>
> > >I am not sure how you wrote the above characters. According to UTF-8,
> > >characters with codepoints above 0x79 require two bytes so that to be
> > >valid. When you compose "ö" (you press something like ";", then "o") in
> > >the console?
> >
> > ö is a "native key" on my keyboard, i.e. i do not need to play with compose to
> > generate ö.
>
> Aaahh ;), you've should said that before. The whole problem with the
> kernel is with the compose tables. If you have a native key for "ö" in
> your keyboard you'll not have problems. I can type for example a 'n
> with tilde' in my keyboard because is too is a native key, but for
> accentuated characters, for utf-8 output is neccesary to apply the patch :-/

And that's the whole issue.

As soon as the kernel is in Unicode mode for the console, currently
there is no way to input accented characters through a dead key
(composed).
Some years back when 8-bit encodings where used there was no problem,
however now all distros are broken with regards to this.

I do not know what is the next step to consider adding the patch. Do we
get a kernel maintainer related to console I/O speak up and say "Hmm, I
*might* consider a patch, if I see it and people say they are happy"?

simos

Jan Engelhardt

unread,
Dec 12, 2004, 5:30:10 PM12/12/04
to
>> Aaahh ;), you've should said that before. The whole problem with the
>> kernel is with the compose tables. If you have a native key for "ö" in
>> your keyboard you'll not have problems. I can type for example a 'n
>> with tilde' in my keyboard because is too is a native key, but for
>> accentuated characters, for utf-8 output is neccesary to apply the patch :-/
>
>As soon as the kernel is in Unicode mode for the console, currently
>there is no way to input accented characters through a dead key
>(composed).
>Some years back when 8-bit encodings where used there was no problem,
>however now all distros are broken with regards to this.

Take it; AFAIK, the DOS box in Windows XP does not support UTF-8 either.

>I do not know what is the next step to consider adding the patch. Do we
>get a kernel maintainer related to console I/O speak up and say "Hmm, I
>*might* consider a patch, if I see it and people say they are happy"?

The proposed patch is working and that's ok. I am happy ÷)
(first composed smiley hehe <compose><:><-><)> )


Jan Engelhardt
--
ENOSPC

David Gómez

unread,
Dec 12, 2004, 6:10:13 PM12/12/04
to
Hi Simon ;),

On Dec 12 at 10:08:22, Simos Xenitellis wrote:
> > Aaahh ;), you've should said that before. The whole problem with the
> > kernel is with the compose tables. If you have a native key for "ö" in
> > your keyboard you'll not have problems. I can type for example a 'n
> > with tilde' in my keyboard because is too is a native key, but for
> > accentuated characters, for utf-8 output is neccesary to apply the patch :-/
>
> And that's the whole issue.
>
> As soon as the kernel is in Unicode mode for the console, currently
> there is no way to input accented characters through a dead key
> (composed).

True.

> Some years back when 8-bit encodings where used there was no problem,
> however now all distros are broken with regards to this.

I guess that some distros use their own patches, like it seems with
SuSE, but it's something that it's broken in the linux console and
should be fixed.

> I do not know what is the next step to consider adding the patch.

Submitting the patch to lkml to discuss about its possible
inclusion would be a good start. I don't know who's the console maintainer,
Vojtech Pavlik perhaps?

Regards,

--
David Gómez Jabber ID: dav...@jabber.org

Andries Brouwer

unread,
Dec 12, 2004, 7:00:28 PM12/12/04
to
On Sun, Dec 12, 2004 at 10:08:22PM +0000, Simos Xenitellis wrote:

> I do not know what is the next step to consider adding the patch. Do we
> get a kernel maintainer related to console I/O speak up and say "Hmm, I
> *might* consider a patch, if I see it and people say they are happy"?

You can send me patches if you want.
If I like them I'll submit them.

Very long ago I used to take care of console stuff.

Andries
a...@cwi.nl

Reply all
Reply to author
Forward
0 new messages