pine and utf-8 again

3 views
Skip to first unread message

Tzafrir Cohen

unread,
Jul 8, 2002, 4:22:49 AM7/8/02
to
A followup on my problems with pine and UTF-8

Some progress: It seems that pine inserts '^' marks into the message for
some reason. In many cases there is a caret between the two UTF8 bytes of
a char, and therefore the char is unreadable.

So to view UTF-8 text, I have to do something like piping the message
through:

tr -d ^ | iconv -f UTF-8 -t ISO8859-8

Why are all of those carets there?

Tested with: pine 4.33, pine 4.44 (with hebrew patch, but this patch
shouldn't change such stuff, AFAIK)

--
Tzafrir Cohen
mailto:tza...@technion.ac.il
http://www.technion.ac.il/~tzafrir

Andreas Prilop

unread,
Jul 8, 2002, 8:57:12 AM7/8/02
to
On Mon, 8 Jul 2002, Tzafrir Cohen wrote:

> Some progress: It seems that pine inserts '^' marks into the message for
> some reason. In many cases there is a caret between the two UTF8 bytes of
> a char, and therefore the char is unreadable.
>

> Why are all of those carets there?

Yeah - what did I say? :-(

Subject: Re: Bug report? Re: pc-pine and international charactersets
Date: Mon, 1 Jul 2002 16:10:17 +0200
Message-ID: <Pine.GSO.4.10.10207011550270.722-100000@s5b004>

--
http://www.unics.uni-hannover.de/nhtcapri/plonk.txt
E-mail from .com addresses is automatically deleted.

Tzafrir Cohen

unread,
Jul 8, 2002, 9:55:48 AM7/8/02
to
On Mon, 8 Jul 2002, Andreas Prilop wrote:

> On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
>
> > Some progress: It seems that pine inserts '^' marks into the message for
> > some reason. In many cases there is a caret between the two UTF8 bytes of
> > a char, and therefore the char is unreadable.
> >
> > Why are all of those carets there?
>
> Yeah - what did I say?:-(
>
> Subject: Re: Bug report? Re: pc-pine and international charactersets
> Date: Mon, 1 Jul 2002 16:10:17 +0200
> Message-ID: <Pine.GSO.4.10.10207011550270.722-100000@s5b004>

So I'll rephrase the question: Is there anybody here qorking with pine who
reads UTF-8 messages? Preferably ones that contains chars outside of the
range of 0-255 of UCS

Andreas Prilop

unread,
Jul 8, 2002, 10:21:51 AM7/8/02
to
On Mon, 8 Jul 2002, Tzafrir Cohen wrote:

> So I'll rephrase the question: Is there anybody here qorking with pine who
> reads UTF-8 messages? Preferably ones that contains chars outside of the
> range of 0-255 of UCS

Not me.
But why use UTF-8 in the first place? I would like to know whether it is
possible to exchange messages in ISO-8859-8 or Windows-1255 with PC-Pine.
Have you ever tried PC-Pine? Can you use right-to-left text in PC-Pine
under Hebrew MS Windows?

--
http://www.unics.uni-hannover.de/nhtcapri/hebrew.html

Tzafrir Cohen

unread,
Jul 8, 2002, 10:42:38 AM7/8/02
to Andreas Prilop
On Mon, 8 Jul 2002, Andreas Prilop wrote:

> On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
>
> > So I'll rephrase the question: Is there anybody here qorking with pine who
> > reads UTF-8 messages? Preferably ones that contains chars outside of the
> > range of 0-255 of UCS
>
> Not me.
> But why use UTF-8 inthe first place?

Because then I avoid allthe charset bullshit in the first place.

What if I want to use both {Hebrew|Russian|Arabic|Whatever} and accented
latin characters?

> I would like to know whether it is
> possible to exchange messages in ISO-8859-8 or Windows-1255 with PC-Pine.
> Have you ever tried PC-Pine? Can you use right-to-left text in PC-Pine
> under Hebrew MS Windows?

You should always use RTL text in email. (The Hebrew patch for unix pine
is obsolete, in that sense. Contact me privatly for an updated copy, which
I have no time to test currently)

windows-1255 is basically a superset of ISO-8859-8 . However, in email you
should use "ISO-8859-8-i" for (implicit) logical Hebrew. ISO-8859-8-i and
windows-1255 are practically the same.

If your copy of windows can display Hebrew scripts, then you ahould have
no problems viewing Hebrew messages. I don't have experince with
non-hebrew windows and PC-pine.

But if you decide to follow-up on this, please change the subject.

Tzafrir Cohen

unread,
Jul 8, 2002, 10:58:22 AM7/8/02
to
On Mon, 8 Jul 2002, Andreas Prilop wrote:

> On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
>
> > Some progress: It seems that pine inserts '^' marks into the message for
> > some reason. In many cases there is a caret between the two UTF8 bytes of
> > a char, and therefore the char is unreadable.
> >

> > Whyare all of those carets there?


>
> Yeah - what did I say?:-(
>
> Subject: Re: Bug report? Re: pc-pine and international charactersets
> Date: Mon, 1 Jul 2002 16:10:17 +0200
> Message-ID: <Pine.GSO.4.10.10207011550270.722-100000@s5b004>

Well, OK. Setting "Show control character as is" allowed me to view the
message properly. But I'm not sure that this is a resonable solution.

My pet example for a terminal-mode program with good suupport for
character sets is lynx.It is well aware of the charater set of the
terminal it is in and of the charater set of the page it displays, and try
to make the most of what it has (including some conversions).

Also note that many messages have incorrect charset headers.

So I figure that I have a feature request:

* A number of settings:

- Display character set
- Default message character set

* On-the-fly conversion of charsets.

* A run-time option to override the current character set


If this is not possible, then:

Make pine aware of the current charset, and make its chars non-control
chars

Andreas Prilop

unread,
Jul 8, 2002, 11:53:02 AM7/8/02
to
On Mon, 8 Jul 2002, Tzafrir Cohen wrote:

> My pet example for a terminal-mode program with good suupport for
> character sets is lynx.It is well aware of the charater set of the
> terminal it is in and of the charater set of the page it displays, and try
> to make the most of what it has (including some conversions).
>

> So I figure that I have a feature request:

You might want to read
<http://groups.google.com/groups?th=fb20df01f88d0f2a>
and then pick up a different e-mail program :-(

Tzafrir Cohen

unread,
Jul 8, 2002, 2:56:26 PM7/8/02
to
On Mon, 8 Jul 2002, Andreas Prilop wrote:

> On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
>
> > My pet example for a terminal-mode program with good suupport for
> > character sets is lynx.It is well aware of the charater set of the
> > terminal it is in and of the charater set of the page it displays, and try
> > to make the most of what it has (including some conversions).
> >
> > So I figure that I have a feature request:
>
> You might want to read
> <http://groups.google.com/groups?th=fb20df01f88d0f2a>
> and then pick up a different e-mail program :-(
>

So I'll raise this issue once again: Is there anything in the direction of
better unicode support?

Will pine be able to make use of a unicode terminal (such is becoming more
and more common)?

Another question: is there any way to make a disply filter script that
accepts the charset as a parameter? (And maybe the current terminal's
charset as a second parameter). This could be much easier to implement in
pine.

Note that with Hebrew'ISO-8859-8' is used in font names, etc. However when
used in context of mime charset, it means "visual Hebrew" which is
considered deprecated. "ISO-8859-8-i" ("-i" in the end) is "logical
Hebrew", which is what people use (when they don't use "windows-1255")

Mark Crispin

unread,
Jul 8, 2002, 3:40:29 PM7/8/02
to
On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
> So I'll raise this issue once again: Is there anything in the direction of
> better unicode support?

Yes

> Will pine be able to make use of a unicode terminal (such is becoming more
> and more common)?

Yes, this most certainly is intended. Unfortunately "more common" does
not yet mean "ubiquitous". Do you know of a freely-available
Windows-based Unicode terminal emulator with SSH support and *correct*
ANSI terminal emulation (both of which excluse Microsoft's telnet)?

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.

Will Yardley

unread,
Jul 8, 2002, 4:37:27 PM7/8/02
to
In article <Pine.LNX.4.50.020708...@shiva0.cac.washington.edu>,

Mark Crispin wrote:
> On Mon, 8 Jul 2002, Tzafrir Cohen wrote:

>> So I'll raise this issue once again: Is there anything in the
>> direction of better unicode support?

> Yes

>> Will pine be able to make use of a unicode terminal (such is becoming
>> more and more common)?

> Yes, this most certainly is intended. Unfortunately "more common"
> does not yet mean "ubiquitous". Do you know of a freely-available
> Windows-based Unicode terminal emulator with SSH support and *correct*
> ANSI terminal emulation (both of which excluse Microsoft's telnet)?

i believe that putty has unicode support:
http://www.chiark.greenend.org.uk/~sgtatham/putty/

i haven't used kermit, but there's a 21 day free trial version; the
software itself isn't free though:

http://www.columbia.edu/kermit/k95_20_ann.html

--
No copies, please.
To reply privately, simply reply; don't remove anything.

Alan J. Flavell

unread,
Jul 10, 2002, 6:46:10 AM7/10/02
to
On Jul 8, Mark Crispin inscribed on the eternal scroll:

> Yes, this most certainly is intended. Unfortunately "more common" does
> not yet mean "ubiquitous". Do you know of a freely-available
> Windows-based Unicode terminal emulator with SSH support and *correct*
> ANSI terminal emulation (both of which excluse Microsoft's telnet)?

For my everyday terminal requirements from Windows, I get to choose
between the ttssh/teraterm combination, and putty.

Whether they would satisfy your criterion of "correct" ANSI emulation,
I don't know.

ttssh offers me the advantage of very convenient printing from the
remote application to the so-called "ANSI printer", which brings up a
normal Windows print dialog, facilitating printing to any
Windows-supported printer. This works well from Lynx ('p' command)
and PINE ('%' command) for example. I saw no suggestion of any
Unicode support, though.

putty offers the benefit of unicode terminal support; support for
printing was a later addition, and it's not yet working in the
transparent way that one gets in ttssh. The author recognised this as
a reasonable user requirement, but now appears to say on his wish list
that

* Improved ANSI printer control support

is unlikely to be implemented by himself, but he's open to seeing
it implemented by someone else.

My other problem with using putty with a wide character repertoire
would be finding a _monospace_ font with a wide enough repertoire.
None of the available monospace Windows fonts that I could find as
part of the products (Windows, Office etc.) comes anywhere near the
repertoire of Arial Unicode MS.

I managed to find

http://bibliofile.mc.duke.edu/gww/fonts/Monospace/index.html

which has a good character repertoire, but is unfortunately missing
the key piece of data which tells Windows that it's a monospaced font,
and so putty misses it off its font selection menu. I can force the
font name into the registry entry, and then putty (like other windows
apps) is willing to _use_ it as a monospaced font; but any attempt to
_reconfigure_ via the application's own menu results in this setting
being lost again. (The same goes for trying to use it as a monospace
font in IE, for example).

I'm no expert on font formats, so that's where I got to. Hope that's
a useful jumping-off point for somebody.

Andreas Prilop

unread,
Jul 10, 2002, 9:43:34 AM7/10/02
to
On Wed, 10 Jul 2002, Alan J. Flavell wrote:

> http://bibliofile.mc.duke.edu/gww/fonts/Monospace/index.html
>
> which has a good character repertoire, but is unfortunately missing
> the key piece of data which tells Windows that it's a monospaced font,
> and so putty misses it off its font selection menu.

I thought I explained in
<http://groups.google.com/groups?selm=Pine.GSO.4.10.10206131725230.9037-100000@s5b004>

Alan J. Flavell

unread,
Jul 10, 2002, 2:37:35 PM7/10/02
to
On Wed, 10 Jul 2002, Andreas Prilop wrote:

Inded you did, and I considered mentioning that here, but then it
struck me that anyone who was in a position to do anything practical
with the font would already know what was needed.

I _had_ followed some of your links (they're still purple in my
browser), but ran out of time/effort to do anything myself to fix the
font.

| If his font-creating software doesn't allow editing of the 'OS/2'
| table, he might use third-party tools:
| <http://developer.apple.com/fonts/Tools/>
| <http://www.truetex.com/ttf_edit.htm>

thanks.

Tzafrir Cohen

unread,
Jul 11, 2002, 3:59:12 AM7/11/02
to
On Mon, 8 Jul 2002, Mark Crispin wrote:

> On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
> > So I'll raise this issue once again: Is there anything in the direction of
> > better unicode support?
>
> Yes
>
> > Will pine be able to make use of a unicode terminal (such is becoming more
> > and more common)?
>

> Yes, this most certainly is intended.Unfortunately "more common" does
> not yet mean "ubiquitous".Do you know of a freely-available


> Windows-based Unicode terminal emulator with SSH support and *correct*
> ANSI terminal emulation (both of which excluse Microsoft's telnet)?

1. why not make pine the first?

2. try putty (google for putty). Try it on an W2K or XP system. I'm not
sure, though. I also haven't tried ssh's terminal (free for use in
universities)

3. What about pine in a linux/unix xterm? Or am I supposed to use mutt in
a unix-only environment?

--
Tzafrir Cohen /"\
mailto:tza...@technion.ac.il \ / ASCII Ribbon Campaign
Taub 229, 972-4-829-3942, X Against HTML Mail
http://www.technion.ac.il/~tzafrir / \

Trevor Jenkins

unread,
Jul 11, 2002, 5:45:44 AM7/11/02
to
On Thu, 11 Jul 2002 10:59:12 +0300, Tzafrir Cohen <tza...@technion.ac.il> wrote:
> On Mon, 8 Jul 2002, Mark Crispin wrote:
>
> > On Mon, 8 Jul 2002, Tzafrir Cohen wrote:

> > > Will pine be able to make use of a unicode terminal (such is becoming more
> > > and more common)?

I've come late to this discussion. Now I find that I too need pine to
support ISO 10646 (Unicode) at least in its UTF-8 guise. It's lack of
support for this goes against the claim that pine implement the relavent
standards.

> 3. What about pine in a linux/unix xterm? Or am I supposed to use mutt in
> a unix-only environment?

I thought you'd thrown me a life-line there. But at least with my setup
(RH6.2 with Gnome desktop and the GNOME Terminal acting as an xterm) even
mutt doesn't display Hebrew.

Regards, Trevor

British Sign Language is not inarticulate handwaving; it's a living language.
Support the campaign for formal recognition by the British government now!
Details at http://www.fdp.org.uk/

--

<>< Re: deemed!

Andreas Prilop

unread,
Jul 11, 2002, 7:31:02 AM7/11/02
to
On Wed, 10 Jul 2002, Alan J. Flavell wrote:

> Inded you did, and I considered mentioning that here, but then it
> struck me that anyone who was in a position to do anything practical
> with the font would already know what was needed.

Not "anyone" - the author of the fonts should fix the fonts.
I think it would be a silly idea to let zillions of people
download the fonts and require everyone of them to edit
the fonts afterwards.

Tzafrir Cohen

unread,
Jul 11, 2002, 8:53:57 AM7/11/02
to
On 11 Jul 2002, Trevor Jenkins wrote:

> On Thu, 11 Jul 2002 10:59:12 +0300, Tzafrir Cohen <tza...@technion.ac.il> wrote:
> > On Mon, 8 Jul 2002, Mark Crispin wrote:
> >
> > > On Mon, 8 Jul 2002, Tzafrir Cohen wrote:
>
> > > > Will pine be able to make use of a unicode terminal (such is becoming
> > > > more and more common)?
>
> I've come late to this discussion. Now I find that I too need pine to
> support ISO 10646 (Unicode) at least in its UTF-8 guise. It's lack of
> support for this goes against the claim that pine implement the relavent
> standards.
>
> > 3. What about pine in a linux/unix xterm? Or am I supposed to use mutt in
> > a unix-only environment?
>
> I thought you'd thrown me a life-line there. But at least with my setup
> (RH6.2 with Gnome desktop and the GNOME Terminal acting as an xterm) even
> mutt doesn't display Hebrew.

What Hebrew? ISO-8859-8[-i]/windows-1255: use any ISO-8859-8 font (e.g:
'xterm -fn heb8x13')

<off-topic>
Terminals with unicode support:

* The upcoming gnome2 (although I really didn't like their gnome-terminal)
* kde2/3
* xterm, of a recent enough versio, with the switch -u8
* mlterm (http://mlterm.sf.net (with bidi support)

Having glibc 2.2 can laso help.

I don't know if any of the above comes with RH6.2
</off-topic>

Trevor Jenkins

unread,
Jul 11, 2002, 1:01:48 PM7/11/02
to
On Thu, 11 Jul 2002 15:53:57 +0300, Tzafrir Cohen <tza...@technion.ac.il> wrote:
> On 11 Jul 2002, Trevor Jenkins wrote:
>
> What Hebrew? ISO-8859-8[-i]/windows-1255: use any ISO-8859-8 font (e.g:
> 'xterm -fn heb8x13')

As to which Hebrew depends upon the choices made by my correspondents. They
seem to be using anyting and everything.

The -fn works ... but it stops me writing in Swedish; something I do fairly
often and more than Hebrew. I really can't deal with restarting pine depending
upn the language I want to read/write. I'm looking for a single approach that
lets me deal with all languages together, which is what I'd understood the
purpose of ISO 10646/Unicode to be.

Mark Crispin

unread,
Jul 11, 2002, 3:14:17 PM7/11/02
to
On 11 Jul 2002, Trevor Jenkins wrote:
> I'm looking for a single approach that
> lets me deal with all languages together, which is what I'd understood the
> purpose of ISO 10646/Unicode to be.

That is exactly what it is, and the MIME charset used is UTF-8.

Pine 4.50 will have some additional internationalization capabilities,
mostly of the form of recognizing equivalent and similar character sets
(e.g. the Windows vs. ISO) and doing some conversion between the two.

It will not be fully Unicode-capable; we want to get 4.50 out this year,
and fancy threading is the big feature in this version.

It is not, however, necessary to keep on telling us that you want Unicode.
We know. The delay isn't because we don't care either. We do care. We
have many tasks on our plate and limited funding to do it.

Alan J. Flavell

unread,
Jul 11, 2002, 3:48:25 PM7/11/02
to
On Jul 10, Thorsten Glaser inscribed on the eternal scroll:

> In http://yg.mine.nu/~tg/vgaoem.fon (dyndns address!) you may find a 10k
> .FON (not TrueType) font (14- and 16-pixel) which CAN be selected by
> PuTTY (note you MUST select "Use font in OEM mode only" for proper
> output). I use this as generic terminal font in Windows, DOS, X, etc.

Sorry, the issue was to find a monospaced font with a good character
repertoire, for utf-8 use.

You seem to be answering a quite different question - but I have no
shortage of selectable fixed-space fonts to cover a Latin-1 repertoire
or similar.


Trevor Jenkins

unread,
Jul 11, 2002, 4:25:40 PM7/11/02
to
On Thu, 11 Jul 2002 12:14:17 -0700, Mark Crispin <m...@CAC.Washington.EDU> wrote:
> On 11 Jul 2002, Trevor Jenkins wrote:
> > I'm looking for a single approach that
> > lets me deal with all languages together, which is what I'd understood the
> > purpose of ISO 10646/Unicode to be.
>
> That is exactly what it is, and the MIME charset used is UTF-8.
>
> Pine 4.50 will have some additional internationalization capabilities,
> mostly of the form of recognizing equivalent and similar character sets
> (e.g. the Windows vs. ISO) and doing some conversion between the two.
>
> It will not be fully Unicode-capable; we want to get 4.50 out this year,
> and fancy threading is the big feature in this version.

I'm no fan of threading. ISO 10646 compliance would be ahead of that in
my book.

> It is not, however, necessary to keep on telling us that you want Unicode.
> We know. The delay isn't because we don't care either. We do care. We
> have many tasks on our plate and limited funding to do it.

;-) But, awhile ago, one of the pine team said, maybe off group, that if
I ever discovered some area where pine was non-compliant with the various
standards and RFCs then I should say so. So I've said so; pine isn't ISO
10646 compliant. Okay that was one last time.

Mark Crispin

unread,
Jul 11, 2002, 7:13:31 PM7/11/02
to
On 11 Jul 2002, Trevor Jenkins wrote:
> I'm no fan of threading. ISO 10646 compliance would be ahead of that in
> my book.

You are, I fear, in a minority.

> ;-) But, awhile ago, one of the pine team said, maybe off group, that if
> I ever discovered some area where pine was non-compliant with the various
> standards and RFCs then I should say so. So I've said so; pine isn't ISO
> 10646 compliant. Okay that was one last time.

There is a very significant difference between "implement" and
"compliant". "Not implement a particular standard" does not imply "not
compliant with the various standards."

If you think for a minute, you would understand why; if matters were
otherwise, then there would be no such thing as software that is
"compliant with the various standards" because someone could always find a
standard that that software does not implement.

Unicode (or, if you prefer, ISO 10646) is not yet widely implemented. Its
implementation base is growing. Pine uses Unicode internally.
Nevertheless, the vast majority of Internet mail is not in UTF-8 (the
encoding of Unicode designated for use in Internet mail) but rather in
legacy character sets. Furthermore, the majority of Internet mail users
are still not able to read UTF-8 messages (although that will soon not be
true).

PS: It is more accurate to say "Unicode", instead of ISO 10646; because
what the Internet has selected is Unicode. Also, it's a pretty safe bet
that a lot of Internet email implementations are not going to handle
characters outside the BMP.

Tzafrir Cohen

unread,
Jul 12, 2002, 12:18:26 AM7/12/02
to
On Thu, 11 Jul 2002, Mark Crispin wrote:

> On 11 Jul 2002, Trevor Jenkins wrote:

> If you think for a minute, you would understand why; if matters were
> otherwise, then there would be no such thing as software that is
> "compliant with the various standards" because someone could always find a
> standard that that software does not implement.
>
> Unicode (or, if you prefer, ISO 10646) is not yet widely implemented.Its

> implementation base is growing.Pine uses Unicode internally.


> Nevertheless, the vast majority of Internet mail is not in UTF-8 (the
> encoding of Unicode designated for use in Internet mail) but rather in

> legacy character sets.Furthermore, the majority of Internet mail users
> are still not able to read UTF-8messages (although that will soon not be
> true).

* Outlook Express (at least as of version 5 (?)
* Outlook (as of version??, IIRC Outlook 2000 can handle and send UTF-8
well)
* Mozilla/netsscape6 mail
* kmail on linux (default of a number of distros. The "ISO-10646-1"
charset is the default at least for Hebrew and many other languages.
Mails from kmail users caused me to start this, because I can't ignore
them as I ignore mails from Outlook)

The above are mailers for which I'm certain about the support. I am sure
many others have good support.

Jungshik Shin

unread,
Jul 12, 2002, 3:39:14 AM7/12/02
to
In <Pine.LNX.4.50.020711...@shiva0.cac.washington.edu>, Mark Crispin wrote:

: PS: It is more accurate to say "Unicode", instead of ISO 10646; because


: what the Internet has selected is Unicode.

I have little idea what you meant by Unicode having been selected by the
Internet. Unicode and ISO 10646 have been aligned with each other
on a regular basis and they're basically the same standard at least in
terms of character repertoire. 'U' in UTF-8 stands for both 'Unicode'
and 'UCS' (Universal Character Set : ISO 10646 term). There are a lot
of overlaps in terms of people working on them. That is, a lot of people
belong to both UTC(Unicode Technical Committee) and ISO/IEC JTC1/SC2/WG2
(responsible for ISO 10646)

: Also, it's a pretty safe bet


: that a lot of Internet email implementations are not going to handle
: characters outside the BMP.

That may be true of many email clients as of today, but it'll
change in the future. However, that doesn't have anything to do with
Unicode vs ISO 10646. Both Unicode 3.2 and ISO 10646-2:2001 have
characters outside BMP (CJK ideograph Ext. B and Ext.C, additional
characters for typesetting mathematics, and so forth) and they're exactly
the same. I'm sure you're well aware of that, but you appeared to imply
otherwise in what you wrote above. (see http://www.i18nguy.com/unicode
for a real life example of characters outside BMP.)

Jungshik Shin

Jungshik Shin

unread,
Jul 12, 2002, 4:34:33 AM7/12/02
to
In <Pine.GSO.4.33_heb2.09.0207111055461.13097-100000@csd>, Tzafrir Cohen wrote:

: 1. why not make pine the first?

My patch to Pine 4.44 in this direction is available at
<http://jshin.net/i18n/pine4.44.iconv.patch>.

I have tested this only under Linux with glibc 2.2.x, but it should also
work under any Unix-like OS with libiconv(a free implementation of iconv
by Bruno Haible available at http://www.gnu.org/software/libiconv) or any
other OS (where libiconv is ported.) My patch uses an glibc/libiconv
extension of iconv(3) (specified in SUS3/POSIX). glibc/libiconv
implementation of iconv(3) does transliteration when '//TRANSLIT' is
added at the end of encoding names. The default iconv(3)
under OS' like Solaris8/9 may not have this extension and won't work
with my patch.

To compile it, you have to use

% ./build EXTRACFLAGS="-DHAVE_ICONV" target

Three configuration options are added. I got the idea for
two of them from Mutt 1.4.x/1.5.x

* assumed-charset : a lot of emails sent by non-standard compliant
MUAs/web mail programs have _raw_ 8bit characters (i.e.
not encoded per RFC 2047). Setting this to
the most common of them would help you read those
emails (subject, from, to, etc). For instance,
Western European users would want to set this
to ISO-8859-1/Windows-1252. Chinese users would set
this to GB2312. This does NOT work for _untagged_ (
no MIME charset is specified in C-T header)
message body, yet. For untagged message body,
you have to
define the display filter for US-ASCII as following:

_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f ISO-8859-1 -t UTF-8

or

_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8
or

_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f GB2312 -t UTF-8

* charset-aliases : Some MUAs use non-standard MIME charset names. For
instance, MS Outlook Express uses ks_c_5601-1987
for EUC-KR or CP949(X-Windows-949). You can
specify pairs of non-standard MIME charset
and standard MIME charset with each pair
delimetered by comma. In each pair, non-standard
charset name and standard name should be
delimetered by a colon. For instance, I have

ks_c_5601-1987:x-windows-949,ksc5601:x-windows-949

* iconv-aliases : Iconv codeset names are not always the same as
the standard MIME charset names. For instance,
'x-windows-949' in glibc implementation of iconv
is 'mscp949' so that I have the following:

x-windows-949:mscp949,euc-kr:mscp949

Although EUC-KR is understood by glibc
implementation of iconv, I also have 'euc-kr:mscp949'
because some emails in X-Windows-949 is MISLABELLED
as in EUC-KR. X-Windows-949 (CP949) is upward
compatible with EUC-KR and there's no harm in
treating genuine EUC-KR text as X-Windows-949.
The same thing happens with
ISO-8859-1 and Windows-1252 and you can add
'iso-8859-1:windows-1252'. You can get the
identical effect by adding it to charset-aliases
list.


You also have to set 'character-set' to 'UTF-8' and run Pine in UTF-8
terminal (putty, Thomas Dickey's xterm-16x, Solaris dtterm under UTF-8
locale, etc).

In addition, you have to define a bunch of display filters because
my patch doesn't use iconv internally to do automatic encoding/MIME
charset conversion for the message body. However, it does automatic
conversion for the message header. I have the following defined
in my pinerc. I haven't checked yet whether '-c' option is
specified in SUS3/POSIX. It may be a glibc/libiconv extension.

display-filters=_CHARSET(EUC-KR)_ /usr/bin/iconv -c -f EUC-KR -t UTF-8,
_CHARSET(ks_c_5601-1987)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8,
_CHARSET(US-ASCII)_ /usr/bin/iconv -c -f MSCP949 -t UTF-8,
_CHARSET(ISO-8859-1)_ /usr/bin/iconv -c -f Windows-1252 -t UTF-8,
_CHARSET(ISO-8859-15)_ /usr/bin/iconv -c -f ISO8859-15 -t UTF-8,
_CHARSET(ISO-2022-JP)_ /usr/bin/iconv -c -f ISO-2022-JP -t UTF-8,
_CHARSET(GB2312)_ /usr/bin/iconv -c -f GB2312 -t UTF-8,
_CHARSET(BIG5)_ /usr/bin/iconv -c -f BIG5 -t UTF-8,
_CHARSET(Windows-1251)_ /usr/bin/iconv -c -f WINDOWS-1251 -t UTF-8,
_CHARSET(Windows-1252)_ /usr/bin/iconv -c -f WINDOWS-1252 -t UTF-8,
_CHARSET(Windows-1253)_ /usr/bin/iconv -c -f WINDOWS-1253 -t UTF-8,
_CHARSET(ISO-8859-2)_ /usr/bin/iconv -c -f ISO8859-2 -t UTF-8,
_CHARSET(ISO-8859-3)_ /usr/bin/iconv -c -f ISO8859-3 -t UTF-8,
_CHARSET(ISO-8859-4)_ /usr/bin/iconv -c -f ISO8859-4 -t UTF-8,
_CHARSET(ISO-8859-5)_ /usr/bin/iconv -c -f ISO8859-5 -t UTF-8,
_CHARSET(ISO-8859-6)_ /usr/bin/iconv -c -f ISO8859-6 -t UTF-8,
_CHARSET(ISO-8859-7)_ /usr/bin/iconv -c -f ISO8859-7 -t UTF-8,
_CHARSET(ISO-8859-8)_ /usr/bin/iconv -c -f ISO8859-8 -t UTF-8,
_CHARSET(ISO-8859-9)_ /usr/bin/iconv -c -f ISO8859-9 -t UTF-8,
_CHARSET(ISO-8859-10)_ /usr/bin/iconv -c -f ISO8859-10 -t UTF-8,
_CHARSET(ISO-8859-11)_ /usr/bin/iconv -c -f ISO8859-11 -t UTF-8,
_CHARSET(ISO-8859-13)_ /usr/bin/iconv -c -f ISO8859-13 -t UTF-8,
_CHARSET(ISO-8859-14)_ /usr/bin/iconv -c -f ISO8859-14 -t UTF-8,
_CHARSET(ISO-8859-16)_ /usr/bin/iconv -c -f ISO8859-16 -t UTF-8,
_CHARSET(KOI8-R)_ /usr/bin/iconv -c -f KOI8-R -t UTF-8,
_CHARSET(KOI8-U)_ /usr/bin/iconv -c -f KOI8-U -t UTF-8,
_CHARSET(Windows-874)_ /usr/bin/iconv -c -f CP874 -t UTF-8,
_CHARSET(UTF-7)_ /usr/bin/iconv -c -f UTF-7 -t UTF-8

: 3. What about pine in a linux/unix xterm? Or am I supposed to use mutt in
: a unix-only environment?

My patch works pretty well under Thomas Dickey's xterm-16x. For a
couple of months, I have been using my patched Pine 4.44 under xterm-16x
to send all my outgoing emails in UTF-8 and read incoming emails in
various encodings (UTF-8, ISO-8859-x, Windows-12xx, EUC-KR, ISO-2022-JP,
GB2312, Big5, KOI8-R, etc)


There are a couple of problems with my patch.

One of them is that I haven't done anything to fix 'one octet ->
one column width model'. In legacy encodings (both single byte encodings
like ISO-8859-x,Windows-12xx,KOI8-R/U and multibyte encodings like
EUC-JP, Shift_JIS,EUC-KR, GB2312, Big5), this holds true. In UTF-8,
this false assumption completely breaks down except for characters in
US-ASCII(U+0020 - U+007E). Some characters take two/three/four octets but
take only one column width while others take three/four octets but take
only two column width. Therefore,in the message display screen, lines
are wrapped prematurely and in the message index screen, headers (subject,
recipient, etc) are truncated.

The other is that somehow the link to 'email list management
information' at the end of a message with 'list management information'
header does not work. I guess it's easy to fix, but I haven't gotten
around to look into it yet.

There may be other problems as well. I'll be glad to hear about them,
although I may not be able to fix them as quickly as I wish to.


BTW, Pine 4.44 with my patch can also be run under non-UTF-8 terminal.
In that case, you have to set 'character-set' to the encoding of
your terminal (say, EUC-JP) and define your display filters accordingly.


My goal was to make Pine a text-terminal version of MS OE or
Mozilla-mail in terms of I18N support. With my patch, Pine got
closer to that goal, but is still far from it. Some of features
I want to see include:

- The encoding(MIME charset) for outgoing emails should be
decoupled from the encoding of a terminal under which Pine
is launched.

- It should be possible to change the encoding(MIME charset)
of outgoing messages _at the time of_ composition
(as is possible with MS OE and Mozilla-Mail.)
Although going all the way to UTF-8 is desirable,
the reality is that some of my correspondents cannot
deal with UTF-8 messages. For them, I have to
write in legacy encodings. Currently, I have to
launch another Pine with a separate pinerc to compose
my email in a legacy encoding.

Hope a lot of people find my patch useful,

Jungshik Shin

Jungshik Shin

unread,
Jul 12, 2002, 8:27:40 AM7/12/02
to
In <Pine.GSO.4.44_heb2.09.02...@techunix.technion.ac.il>, Tzafrir Cohen wrote:

: On Thu, 11 Jul 2002, Mark Crispin wrote:
:> On 11 Jul 2002, Trevor Jenkins wrote:
....

:> legacy character sets.Furthermore, the majority of Internet mail users


:> are still not able to read UTF-8messages (although that will soon not be
:> true).

I think it's not true right now. Users of all the email clients
listed below and Hotmail and some other web mail services (UTF-8 and I18N
support in most web mail services suck, but depending on web browsers
used by their users, viewing UTF-8 mail is just a click away although
sending standard-compliant UTF-8 email messages with them is not yet
possible.) can easily be the majority of internet email users.


: * Outlook Express (at least as of version 5 (?)


: * Outlook (as of version??, IIRC Outlook 2000 can handle and send UTF-8
: well)
: * Mozilla/netsscape6 mail
: * kmail on linux (default of a number of distros. The "ISO-10646-1"

Which version of kmail supports UTF-8? A rare complaint about my
UTF-8 emails came from a kmail 1.3.x user.

: charset is the default at least for Hebrew and many other languages.


: Mails from kmail users caused me to start this, because I can't ignore
: them as I ignore mails from Outlook)

And, mutt 1.5.x running under a UTF-8 terminal (xterm 16x, mlterm,
putty, Solaris's dtterm under UTF-8 locale). In a sense, Pine 4.4x
without any patch can also qualify if it's run under a UTF-8 terminal.

Jungshik Shin

Mark Crispin

unread,
Jul 12, 2002, 1:09:08 PM7/12/02
to
On Fri, 12 Jul 2002, Jungshik Shin wrote:
> : PS: It is more accurate to say "Unicode", instead of ISO 10646; because
> : what the Internet has selected is Unicode.
> I have little idea what you meant by Unicode having been selected by the
> Internet. Unicode and ISO 10646 have been aligned with each other
> on a regular basis and they're basically the same standard at least in
> terms of character repertoire.

The Internet is using the BMP and possibly also planes 1 to 16, not the
full 2^31 bit codepoint set of ISO 10646.

For the most part, this is a difference that makes no difference, since
ISO will have 14 planes to fill first; but it is a difference nonetheless
and one can not blithely say "ISO 10646" when one means "Unicode".

Mark Crispin

unread,
Jul 12, 2002, 1:31:36 PM7/12/02
to
On Fri, 12 Jul 2002, Jungshik Shin wrote:
> : Also, it's a pretty safe bet
> : that a lot of Internet email implementations are not going to handle
> : characters outside the BMP.
> That may be true of many email clients as of today, but it'll
> change in the future.

When you leave the BMP, you have to lose the assumption that everything is
just UCS-2. This is a very tempting assumption for programmers to make.
Leave aside the matter that a character is not necessarily a glyph on the
screen.

As soon as you leave the BMP, things get ugly. In 16-bit form, you have
surrogates with UTF-16. A character is now 16 bits or 32-bits. You get
no joy if you expand the surrogates, because you don't have 20 bit
characters; you have 20 bits plus a bit. Programmers like to think of
limits as being a power of 2 minus 1, but not here where the limit is
0x10ffff.

The non-BMP planes have become places to exile characters that are either
rarely-used or special-use (e.g. typesetting math), with a view to making
the BMP be adequate for almost all communication. You're going to find
that programmers will take those simplifying assumptions, and there is
little that you or I can do about it.

UTF-8 can, of course, represent all 2^31 characters of UCS-4. But UTF-8
is not the best for for internally handling, and a lot of software is
going to use a 16-bit representation internally. At best, that'll be
UTF-16 (hence the BMP + 16 planes) but a lot of code undoubtably is going
to take the easy way out and assume UCS-2.

Jungshik Shin

unread,
Jul 12, 2002, 6:04:42 PM7/12/02
to
In <Pine.LNX.4.50.02071...@shiva1.cac.washington.edu>, Mark Crispin wrote:

: On Fri, 12 Jul 2002, Jungshik Shin wrote:
:> : Also, it's a pretty safe bet
:> : that a lot of Internet email implementations are not going to handle
:> : characters outside the BMP.
:> That may be true of many email clients as of today, but it'll
:> change in the future.

: When you leave the BMP, you have to lose the assumption that everything is
: just UCS-2. This is a very tempting assumption for programmers to make.

Well, with a lot of public-domain/commerical libraries written for
I18N support, not all programmers have to write their own I18N
infrastructure. In many cases, they can just take one of them and link
against it.

<well known problems of UTF-16 and surrogate pairs snipped...>

You may turn out right, but may be not. UTF-16 is not so much a
beast as multibyte legacy encodings are. Moreover, if necessary for the
sake of convenient handling in memory, they can use UTF-32 (they don't
have to worry about 20.1 bit stuff) like some C libraries do for iconv(3)
implementation and wchar-related C API implementation.

For terminal based email programs, virtually all they need is a
terminal that supports characters beyond BMP along with fonts and input
methods. Of course, they should be aware that UTF-8 is not limited to
3bytes but can be 4bytes long as well.

Jungshik Shin

Mark Crispin

unread,
Jul 12, 2002, 8:24:30 PM7/12/02
to

Try 6 bytes if you want to support the full UCS-4 of ISO 10646. Believe
it or not, there are actually people using codepoints up there.

Also, many programs actually do want to deal with "characters", and UTF-8
is poor for that purpose because it is of variable width unless one thinks
in terms of certain ASCII characters (e.g. Latin alphabetics and digits)
and "everything else".

UCS-2 is a so-called "attractive nuisance", especially since there are any
number of examples in which UCS-2 *is* good enough.

Jungshik Shin

unread,
Jul 12, 2002, 9:12:47 PM7/12/02
to
: On Fri, 12 Jul 2002, Jungshik Shin wrote:
:> : PS: It is more accurate to say "Unicode", instead of ISO 10646; because
:> : what the Internet has selected is Unicode.
:> I have little idea what you meant by Unicode having been selected by the
:> Internet. Unicode and ISO 10646 have been aligned with each other
:> on a regular basis and they're basically the same standard at least in
:> terms of character repertoire.

: The Internet is using the BMP and possibly also planes 1 to 16,
: not the full 2^31 bit codepoint set of ISO 10646.

You're right that formally ISO 10646 is still a 31 bit character set
standard. However, ISO/IEC JTC1/SC2/WG2 will not assign any character
beyond plane 16(0x10). That means ISO 10646 will _effectively_ become
a 20.1 bit (with just 17 planes: plane 0 to 16) character set standard
just like Unicode is when amendment 1 to ISO 10646-1:2000 is passed if it
hasn't been yet.


Let me quote Kenneth Whistler (a member of both UTC and ISO/IEC
JTC1/SC2/WG2) (this was posted to Unicode mailing list in March, 2000)

KW> You may want to also consider the *21*-bit option now. WG2 has just
KW> approved the proposal to remove the user planes and user groups from
KW> 10646 past Plane 16. That means there will be *no* assigned characters
KW> past U-0010FFFF, and that UTF-16, UTF-8, and UCS-4 (=UTF-32) will all
KW> have the same code ranges and be completely interoperable. This closes
KW> the last main architectural mismatch between 10646 and the Unicode
KW> Standard.

KW> This approval still needs to be ballotted in the next amendment for
KW> 10646 Part 1, but while not a foregone conclusion, its approval seems
KW> quite likely to me.

KW> What this means is that for the foreseeable future (50 - 100 years?),
KW> 10646 characters will be constrained to 20.1 bits (round up to 21).

Here's also an excerpt from PDAM(proposed draft amendment) 1 to ISO
10646-1:2000 ( http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n2308.pdf):

-----------
9.1 Planes reserved for future standardization

Planes 11(17 in decimal) to FF in group 00 and all planes in all
other groups (i.e. Planes 00 to FF n Groups 01 to 7F) are reserved
for future standardization, and thus those code positions shall not be
used for any other purpose.

Code positions in these planes do not have a mapping to the UTF-16
form (see Annex C)

Note: To ensure contiuned interoperability between the UTF-16 form and
other coded representation of the UCS, it is intended that *no* characters
will *ever* be allocated to code positions above 0010FFFF.
-----------

In the final draft (not available on-line), it's likely that 'reserved'
in the first paragraph has been replaced by 'permanently reserved'
because some nat'l standard bodies requested a revision along that line.


: For the most part, this is a difference that makes no difference, since


: ISO will have 14 planes to fill first; but it is a difference
: nonetheless
: and one can not blithely say "ISO 10646" when one means "Unicode".

You have a point. But for all practical purposes, that's a difference
that makes no difference (at least in terms of the character repertoire
of both standards) as you wrote.

Jungshik Shin

Jungshik Shin

unread,
Jul 15, 2002, 4:22:35 PM7/15/02
to
Jungshik Shin wrote:
> In <Pine.GSO.4.33_heb2.09.0207111055461.13097-100000@csd>, Tzafrir Cohen wrote:
>
> : 1. why not make pine the first?
>
> My patch to Pine 4.44 in this direction is available at
> <http://jshin.net/i18n/pine4.44.iconv.patch>.

This is now available at <http://www.i18nl10n.com/pine4.44.iconv.patch>

Jungshik Shin

unread,
Jul 15, 2002, 11:25:54 PM7/15/02
to
In <3D332F0B...@jtan.com>, Jungshik Shin wrote:

When I announced my patch, I wrote some shortfalls of my patch, but
that was not exhaustive. Here's another try for others who might
be interested in improving my patch as well as for my own reference.


- The encoding(MIME charset) for outgoing emails should be

*decoupled* from the encoding of a terminal under which Pine
is launched. In place of 'character-set', we need two separate
configuration options, 'send-encoding' and 'display-encoding' which
can be different from each other.

- It should be possible to change the encoding(MIME charset)
of outgoing messages _at the time of_ composition
(as is possible with MS OE and Mozilla-Mail.)

without changing the default send-encoding mentioned above.


Although going all the way to UTF-8 is desirable,
the reality is that some of my correspondents cannot
deal with UTF-8 messages. For them, I have to
write in legacy encodings. Currently, I have to
launch another Pine with a separate pinerc to compose
my email in a legacy encoding.

- The internal encoding conversion (as opposed to relying on
users setting display filters correctly in pinerc) with iconv

- 'assumed-charset' should be settable per-folder basis as well as
globally. And, 'assumed-charset' should work for the message body
(whtout setting the display filter for US-ASCII) as well for the
message header.

- In place of 'one octet-one column width' model, we need to make use
of something like 'wcwidth()'. This might be the hardest part.

Jungshik

P.S. I'm writing this under Putty-ssh2 session. Just in case, 'line code
page' in Windows|Translation has to be set to 'utf-8' to make Putty a
UTF-8 terminal emulator. Pine 4.44 with my patch worked under Putty-ssh2
(UTF-8) as well as under xterm-16x (under a UTF-8 locale or with '-u8'
option). BTW, xterm-16x has been ported to a number of Unix/X11 platforms.

J.B. Moreno

unread,
Jul 18, 2002, 11:50:17 AM7/18/02
to
Mark Crispin <m...@CAC.Washington.EDU> wrote:

> It is not, however, necessary to keep on telling us that you want Unicode.
> We know. The delay isn't because we don't care either. We do care. We
> have many tasks on our plate and limited funding to do it.

Can Pine read and post to UTF8 newsgroups like dk.test.utf8-æøå?

--
J.B. Moreno

Reply all
Reply to author
Forward
0 new messages