[Haskell-cafe] invalid character encoding

John Goerzen

unread,

Mar 14, 2005, 8:38:52 PM3/14/05

to haskel...@haskell.org

I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
the new hugs with:

<handle>: IO.getContents: protocol error (invalid character encoding)

What is going on, and how can I fix it?

Thanks,
John
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Ross Paterson

unread,

Mar 15, 2005, 5:44:48 AM3/15/05

to John Goerzen, 299...@bugs.debian.org, haskel...@haskell.org

On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
> I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
> the new hugs with:
>
> <handle>: IO.getContents: protocol error (invalid character encoding)
>
> What is going on, and how can I fix it?

A Haskell 98 Handle is a character stream, and doesn't support binary
I/O. This would have bitten you sooner or later on systems that do CRLF
conversion, but Hugs is now much stricter, because character streams now
use the encoding determined by the current locale (for the C locale, that
means ASCII only).

You can select binary I/O using the openBinaryFile and hSetBinaryMode
functions from System.IO. After that, the Chars you get from that Handle
are actually bytes.

John Goerzen

unread,

Mar 15, 2005, 9:13:19 AM3/15/05

to haskel...@haskell.org, 299...@bugs.debian.org

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:
> On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
> > I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
> > the new hugs with:
> >
> > <handle>: IO.getContents: protocol error (invalid character encoding)
> >
> > What is going on, and how can I fix it?
>
> A Haskell 98 Handle is a character stream, and doesn't support binary
> I/O. This would have bitten you sooner or later on systems that do CRLF

Yes, probably so..

> conversion, but Hugs is now much stricter, because character streams now
> use the encoding determined by the current locale (for the C locale, that
> means ASCII only).

Hmm, this seems to be completely undocumented. So yes, I'll try using
openBinaryFile, but the docs I have seen still talk only about CRLF and
^Z.

Anyway, I'm intrested in this new feature (I assume GHC 6.4 has it as
well?) Would it, for instance, automatically convert from Latin-1 to
UTF-16 on read, and the inverse on write? Or to/from UTF-8?

Thanks,

-- John

Ross Paterson

unread,

Mar 15, 2005, 9:24:24 AM3/15/05

to John Goerzen, 299...@bugs.debian.org, haskel...@haskell.org

On Tue, Mar 15, 2005 at 08:12:48AM -0600, John Goerzen wrote:
> > [...] but Hugs is now much stricter, because character streams now

> > use the encoding determined by the current locale (for the C locale, that
> > means ASCII only).
>
> Hmm, this seems to be completely undocumented.

It's mentioned in the release history in the User's Guide, which refers
to section 3.3 for (some) more details.

Ian Lynagh

unread,

Mar 15, 2005, 10:59:56 PM3/15/05

to Ross Paterson, Simon Peyton-Jones, Simon Marlow, Malcolm Wallace, haskel...@haskell.org, 299...@bugs.debian.org

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:

> On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
> > I've got some gzip (and Ian Lynagh's Inflate) code that breaks under
> > the new hugs with:
> >
> > <handle>: IO.getContents: protocol error (invalid character encoding)
> >
> > What is going on, and how can I fix it?
>
> A Haskell 98 Handle is a character stream, and doesn't support binary
> I/O. This would have bitten you sooner or later on systems that do CRLF
> conversion, but Hugs is now much stricter, because character streams now
> use the encoding determined by the current locale (for the C locale, that
> means ASCII only).

Do you have a list of functions which behave differently in the new
release to how they did in the previous release?
(I'm not interested in changes that will affect only whether something
compiles, not how it behaves given it compiles both before and after).

Simons, Malcolm, are there any such functions in the new ghc/nhc98?

Also, are you all agreed that the hugs interpretation of the report is
correct, and thus ghc at least is buggy in this respect? (I'm afraid I
haven't been able to test nhc98 yet).

Finally, the hugs behaviour seems a little odd to me. The below shows 4
cases where iconv complains when asked to convert utf8 to utf8, but hugs
only gives an error in one of them. In the others it just truncates the
input. Is this really correct? It also seems to behave the same for me
regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C.

Thanks
Ian

printf "\x00\x7F" > inp1
printf "\x00\x80" > inp2
printf "\x00\xC4" > inp3
printf "\xFF\xFF" > inp4
printf "\xb1\x41\x00\x03\x65\x6d\x70\x74\x79\x00\x03\x00\x00\x00\x00\x00" > inp5
echo 'main = do xs <- getContents; print xs' > run.hs
for i in `seq 1 5`; do runhugs run.hs < inp$i; done
for i in `seq 1 5`; do runghc6 run.hs < inp$i; done
for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8 < inp$i; done

which gives me the following output:

$ for i in `seq 1 5`; do runhugs run.hs < inp$i; done
"\NUL\DEL"
"\NUL"
"\NUL"
""
"
Program error: <stdin>: IO.getContents: protocol error (invalid character encoding)
$ for i in `seq 1 5`; do runghc6 run.hs < inp$i; done
"\NUL\DEL"
"\NUL\128"
"\NUL\196"
"\255\255"
"\177A\NUL\ETXempty\NUL\ETX\NUL\NUL\NUL\NUL\NUL"
$ for i in `seq 1 5`; do echo $i; iconv -f utf8 -t utf8 < inp$i; done
1
2
iconv: illegal input sequence at position 1
3
iconv: incomplete character or shift sequence at end of buffer
4
iconv: illegal input sequence at position 0
5
iconv: illegal input sequence at position 0
$

Simon Marlow

unread,

Mar 16, 2005, 4:34:11 AM3/16/05

to Ian Lynagh, Ross Paterson, Simon Peyton-Jones, Malcolm Wallace, haskel...@haskell.org, 299...@bugs.debian.org

On 16 March 2005 03:54, Ian Lynagh wrote:

> On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:
>> On Mon, Mar 14, 2005 at 07:38:09PM -0600, John Goerzen wrote:
>>> I've got some gzip (and Ian Lynagh's Inflate) code that breaks
>>> under the new hugs with:
>>>
>>> <handle>: IO.getContents: protocol error (invalid character
>>> encoding)
>>>
>>> What is going on, and how can I fix it?
>>
>> A Haskell 98 Handle is a character stream, and doesn't support binary
>> I/O. This would have bitten you sooner or later on systems that do
>> CRLF conversion, but Hugs is now much stricter, because character
>> streams now use the encoding determined by the current locale (for
>> the C locale, that means ASCII only).
>
> Do you have a list of functions which behave differently in the new
> release to how they did in the previous release?
> (I'm not interested in changes that will affect only whether something
> compiles, not how it behaves given it compiles both before and after).
>
> Simons, Malcolm, are there any such functions in the new ghc/nhc98?
>
> Also, are you all agreed that the hugs interpretation of the report is
> correct, and thus ghc at least is buggy in this respect? (I'm afraid I
> haven't been able to test nhc98 yet).

GHC (and nhc98) assumes a locale of ISO8859-1 for I/O. You could
consider that to be a bug, I suppose. We don't plan to do anything
about it in the context of the current IO library, at least.

Cheers,
Simon

Ross Paterson

unread,

Mar 16, 2005, 6:55:37 AM3/16/05

to haskel...@haskell.org, 299...@bugs.debian.org

On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:
> Do you have a list of functions which behave differently in the new
> release to how they did in the previous release?
> (I'm not interested in changes that will affect only whether something
> compiles, not how it behaves given it compiles both before and after).

I got lost in the negatives here. It affects all Haskell 98 primitives
that do character I/O, or that exchange C strings with the C library.

It doesn't affect functions added by the hierarchical libraries, i.e.
those functions are safe only with the ASCII subset. (There is a vague
plan to make Foreign.C.String conform to the FFI spec, which mandates
locale-based encoding, and thus would change all those, but it's still
up in the air.)

> Finally, the hugs behaviour seems a little odd to me. The below shows 4
> cases where iconv complains when asked to convert utf8 to utf8, but hugs
> only gives an error in one of them. In the others it just truncates the
> input. Is this really correct? It also seems to behave the same for me
> regardless of whether I export LC_CTYPE to en_GB.UTF-8 or C.

It's a bug: an unrecognized encoding at the end of the input was being
ignored instead of triggering the exception. Now fixed in CVS
(rev. 1.14 of src/char.c if anyone's backporting). It was an accident
of this example that the behaviour in all locales was the same.

Duncan Coutts

unread,

Mar 16, 2005, 8:08:21 AM3/16/05

to Ross Paterson, Haskell Cafe

On Wed, 2005-03-16 at 11:55 +0000, Ross Paterson wrote:
> On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:
> > Do you have a list of functions which behave differently in the new
> > release to how they did in the previous release?
> > (I'm not interested in changes that will affect only whether something
> > compiles, not how it behaves given it compiles both before and after).
>
> I got lost in the negatives here. It affects all Haskell 98 primitives
> that do character I/O, or that exchange C strings with the C library.
>
> It doesn't affect functions added by the hierarchical libraries, i.e.
> those functions are safe only with the ASCII subset. (There is a vague
> plan to make Foreign.C.String conform to the FFI spec, which mandates
> locale-based encoding, and thus would change all those, but it's still
> up in the air.)

Hmm. I'm not convinced that automatically converting to the current
locale is the ideal behaviour (it'd certianly break all my programs!).
Certainly a function for converting into the encoding of the current
locale would be useful for may users but it's important to be able to
know the encoding with certainty. For example some libraries (eg Gtk+)
take all strings in UTF-8 irrespective of the current locale (it does
locale-dependent conversions on IO etc but the internal representation
is always UTF8). We do the conversion to UTF8 on the Haskell side and so
produce a byte string which we marshal using the FFI CString functions.

If the implementations get fixed to conform to the FFI spec, I suppose
we could roll our own version of withCString that marshals [Word8] ->
char*.

Duncan

Duncan Coutts

unread,

Mar 16, 2005, 8:15:25 AM3/16/05

to Haskell Cafe

On Wed, 2005-03-16 at 13:09 +0000, Duncan Coutts wrote:
> On Wed, 2005-03-16 at 11:55 +0000, Ross Paterson wrote:

> > It doesn't affect functions added by the hierarchical libraries, i.e.
> > those functions are safe only with the ASCII subset. (There is a vague
> > plan to make Foreign.C.String conform to the FFI spec, which mandates
> > locale-based encoding, and thus would change all those, but it's still
> > up in the air.)
>
> Hmm. I'm not convinced that automatically converting to the current
> locale is the ideal behaviour (it'd certianly break all my programs!).
> Certainly a function for converting into the encoding of the current
> locale would be useful for may users but it's important to be able to
> know the encoding with certainty. For example some libraries (eg Gtk+)
> take all strings in UTF-8 irrespective of the current locale (it does
> locale-dependent conversions on IO etc but the internal representation
> is always UTF8). We do the conversion to UTF8 on the Haskell side and so
> produce a byte string which we marshal using the FFI CString functions.

Silly me! There are C marshaling functions that are specified to do just
this but I never noticed them before!

withCAString and similar functions treat haskell Strings as byte
strings.

Marcin 'Qrczak' Kowalczyk

unread,

Mar 16, 2005, 8:16:40 AM3/16/05

to haskel...@haskell.org

Duncan Coutts <duncan...@worc.ox.ac.uk> writes:

>> It doesn't affect functions added by the hierarchical libraries,
>> i.e. those functions are safe only with the ASCII subset. (There is
>> a vague plan to make Foreign.C.String conform to the FFI spec,
>> which mandates locale-based encoding, and thus would change all
>> those, but it's still up in the air.)
>
> Hmm. I'm not convinced that automatically converting to the current
> locale is the ideal behaviour (it'd certianly break all my programs!).
> Certainly a function for converting into the encoding of the current
> locale would be useful for may users but it's important to be able to
> know the encoding with certainty.

It should only be the default, not the only option. It should be
possible to specify the encoding explicitly.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

unread,

Mar 16, 2005, 12:13:50 PM3/16/05

to haskel...@haskell.org

Marcin 'Qrczak' Kowalczyk wrote:

> >> It doesn't affect functions added by the hierarchical libraries,
> >> i.e. those functions are safe only with the ASCII subset. (There is
> >> a vague plan to make Foreign.C.String conform to the FFI spec,
> >> which mandates locale-based encoding, and thus would change all
> >> those, but it's still up in the air.)
> >
> > Hmm. I'm not convinced that automatically converting to the current
> > locale is the ideal behaviour (it'd certianly break all my programs!).
> > Certainly a function for converting into the encoding of the current
> > locale would be useful for may users but it's important to be able to
> > know the encoding with certainty.
>
> It should only be the default, not the only option.

I'm not sure that it should be available at all.

> It should be possible to specify the encoding explicitly.

Conversely, it shouldn't be possible to avoid specifying the encoding
explicitly.

Personally, I wouldn't provide an all-in-one "convert String to
CString using locale's encoding" function, just in case anyone was
tempted to actually use it.

The decision as to the encoding belongs in application code; not in
(most) libraries, and definitely not in the language.

[Libraries dealing with file formats or communication protocols which
mandate a specific encoding are an exception. But they will be using a
fixed encoding, not the locale's encoding.]

If application code chooses to use the locale's encoding, it can
retrieve it then pass it as the encoding argument to any applicable
functions.

If application code doesn't want to use the locale's encoding, it
shouldn't be shoe-horned into doing so because a library developer
decided to duck the encoding issues by grabbing whatever encoding was
readily to hand (i.e. the locale's encoding).

--
Glynn Clements <gl...@gclements.plus.com>

Marcin 'Qrczak' Kowalczyk

unread,

Mar 16, 2005, 12:51:14 PM3/16/05

to haskel...@haskell.org

Glynn Clements <gl...@gclements.plus.com> writes:

>> It should be possible to specify the encoding explicitly.
>
> Conversely, it shouldn't be possible to avoid specifying the
> encoding explicitly.

What encoding should a binding to readline or curses use?

Curses in C comes in two flavors: the traditional byte version and a
wide character version. The second version is easy if we can assume
that wchar_t is Unicode, but it's not always available and until
recently in ncurses it was buggy. Let's assume we are using the byte
version. How to encode strings?

A terminal uses an ASCII-compatible encoding. Wide character version
of curses convert characters to the locale encoding, and byte version
passes bytes unchanged. This means that if a Haskell binding to the
wide character version does the obvious thing and passes Unicode
directly, then an equivalent behavior can be obtained from the byte
version (only limited to 256-character encodings) by using the locale
encoding.

The locale encoding is the right encoding to use for conversion of the
result of strerror, gai_strerror, msg member of gzip compressor state
etc. When an I/O error occurs and the error code is translated to a
Haskell exception and then shown to the user, why would the application
need to specify the encoding and how?

> If application code doesn't want to use the locale's encoding, it
> shouldn't be shoe-horned into doing so because a library developer
> decided to duck the encoding issues by grabbing whatever encoding
> was readily to hand (i.e. the locale's encoding).

If a C library is written with the assumption that texts are in the
locale encoding, a Haskell binding to such library should respect that
assumption.

Only some libraries allow to work with different, explicitly specified
encodings. Many libraries don't, especially if the texts are not the
core of the library functionality but error messages.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

John Meacham

unread,

Mar 16, 2005, 6:37:09 PM3/16/05

to haskel...@haskell.org

On Wed, Mar 16, 2005 at 05:13:25PM +0000, Glynn Clements wrote:
>
> Marcin 'Qrczak' Kowalczyk wrote:
>
> > >> It doesn't affect functions added by the hierarchical libraries,

> > >> i.e. those functions are safe only with the ASCII subset. (There i=

s
> > >> a vague plan to make Foreign.C.String conform to the FFI spec,
> > >> which mandates locale-based encoding, and thus would change all
> > >> those, but it's still up in the air.)
> > >
> > > Hmm. I'm not convinced that automatically converting to the current

> > > locale is the ideal behaviour (it'd certianly break all my programs=
!).
> > > Certainly a function for converting into the encoding of the curren=
t
> > > locale would be useful for may users but it's important to be able =

to
> > > know the encoding with certainty.
> >
> > It should only be the default, not the only option.
>
> I'm not sure that it should be available at all.
>
> > It should be possible to specify the encoding explicitly.
>
> Conversely, it shouldn't be possible to avoid specifying the encoding
> explicitly.
>
> Personally, I wouldn't provide an all-in-one "convert String to
> CString using locale's encoding" function, just in case anyone was
> tempted to actually use it.

But this is exactly what is needed for most C library bindings. Which is
why I had to write my own and proposed it to the FFI. Most C libraries
expect char * to be in the standard encoding of the current locale.
When a binding explicitly uses another encoding, then great, we can use
different marshaling functions. In any case, we need tools to be able to
conform to the common cases of ascii-only (withCAStrirg) and current
locale (withCString).

withUTF8String would be a nice addition, but is much less important to
come standard as it can easily be written by end users, unlike locale
specific versions which are necessarily system dependent.

John

--
John Meacham - ⑆repetae.net⑆john⑈

Marcin 'Qrczak' Kowalczyk

unread,

Mar 16, 2005, 7:06:15 PM3/16/05

to haskel...@haskell.org

John Meacham <jo...@repetae.net> writes:

> In any case, we need tools to be able to conform to the common cases
> of ascii-only (withCAStrirg) and current locale (withCString).
>
> withUTF8String would be a nice addition, but is much less important to
> come standard as it can easily be written by end users, unlike locale
> specific versions which are necessarily system dependent.

IMHO the encoding should be a parameter of an extended variant of
withCString (and peekCString etc.).

We need a framework for implementing encoders/decoders first.
A problem with designing the framework is that it should support
both pure Haskell conversions and C functions like iconv which work
on arrays. We must also provide a way to signal errors.

A bonus is a way to handle errors coming from another recoder without
causing it to fail completely. That way one could add a fallback for
unrepresentable characters, e.g. HTML entities or approximations with
stripped accents.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Ian Lynagh

unread,

Mar 17, 2005, 1:22:37 AM3/17/05

to haskel...@haskell.org

On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:
>

> You can select binary I/O using the openBinaryFile and hSetBinaryMode
> functions from System.IO. After that, the Chars you get from that Handle
> are actually bytes.

What about the ones sent to it?
Are all the following results intentional?
Am I doing something stupid?

[in brief: hugs' (hPutStr h) now behaves differently to
(mapM_ (hPutChar h)), and ghc writes the empty string for both when
told to write "\128"]

Running the following with new ghc 6.4 and hugs 20050308 or 20050317:

echo 'import System.IO; import System.Environment; main = do [o] <- getArgs; ho <- openBinaryFile o WriteMode; hPutStr ho "\128"' > run1.hs
echo 'import System.IO; import System.Environment; main = do [o] <- getArgs; ho <- openBinaryFile o WriteMode; mapM_ (hPutChar ho) "\128"' > run2.hs
runhugs run1.hs hugs1
runhugs run2.hs hugs2
runghc run1.hs ghc1
runghc run2.hs ghc2
ls -l hugs1 hugs2 ghc1 ghc2
for f in hugs1 hugs2 ghc1 ghc2; do echo $f; hexdump -C $f; done

gives:

-rw-r--r-- 1 igloo igloo 0 Mar 17 06:15 ghc1
-rw-r--r-- 1 igloo igloo 0 Mar 17 06:15 ghc2
-rw-r--r-- 1 igloo igloo 1 Mar 17 06:15 hugs1
-rw-r--r-- 1 igloo igloo 1 Mar 17 06:15 hugs2
hugs1
00000000 3f |?|
00000001
hugs2
00000000 80 |.|
00000001
ghc1
ghc2

With ghc 6.2.2 and hugs "November 2003" I get:

-rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 ghc1
-rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 ghc2
-rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 hugs1
-rw-r--r-- 1 igloo igloo 1 Mar 17 06:16 hugs2
hugs1
00000000 80 |.|
00000001
hugs2
00000000 80 |.|
00000001
ghc1
00000000 80 |.|
00000001
ghc2
00000000 80 |.|
00000001

Incidentally, "make check" in CVS hugs said:

cd tests && sh testScript | egrep -v '^--( |-----)'
./../src/hugs +q -w -pHugs: static/mod154.hs < /dev/null
expected stdout not matched by reality
*** static/Loaded.output Fri Jul 19 22:41:51 2002
--- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005
***************
*** 1,2 ****
! Type :? for help
Hugs:[Leaving Hugs]
--- 1,3 ----
! ERROR "static/mod154.hs" - Conflicting exports of entity "sort"
! *** Could refer to Data.List.sort or M.sort
Hugs:[Leaving Hugs]

Thanks
Ian

Ian Lynagh

unread,

Mar 17, 2005, 6:41:04 AM3/17/05

to haskel...@haskell.org

On Thu, Mar 17, 2005 at 06:22:25AM +0000, Ian Lynagh wrote:
>
> [in brief: hugs' (hPutStr h) now behaves differently to
> (mapM_ (hPutChar h)), and ghc writes the empty string for both when
> told to write "\128"]

Ah, Malcolm's commit messages have just reminded me of the finaliser
changes requiring hflushes in new ghc, so it's just the hugs output that
confuses me now.

Ross Paterson

unread,

Mar 17, 2005, 7:47:28 AM3/17/05

to haskel...@haskell.org

On Thu, Mar 17, 2005 at 06:22:25AM +0000, Ian Lynagh wrote:

> On Tue, Mar 15, 2005 at 10:44:28AM +0000, Ross Paterson wrote:
> > You can select binary I/O using the openBinaryFile and hSetBinaryMode
> > functions from System.IO. After that, the Chars you get from that Handle
> > are actually bytes.
>
> What about the ones sent to it?
> Are all the following results intentional?
> Am I doing something stupid?

No, I was. Output primitives other than hPutChar were ignoring binary
mode (and Hugs has more of these things as primitives than GHC does).
Now fixed in CVS (rev. 1.95 of src/char.c).

Ross Paterson

unread,

Mar 17, 2005, 8:35:27 AM3/17/05

to haskel...@haskell.org

On Thu, Mar 17, 2005 at 06:22:25AM +0000, Ian Lynagh wrote:

> Incidentally, "make check" in CVS hugs said:
>
> cd tests && sh testScript | egrep -v '^--( |-----)'
> ./../src/hugs +q -w -pHugs: static/mod154.hs < /dev/null
> expected stdout not matched by reality
> *** static/Loaded.output Fri Jul 19 22:41:51 2002
> --- /tmp/runtest11949.3 Thu Mar 17 05:46:05 2005
> ***************
> *** 1,2 ****
> ! Type :? for help
> Hugs:[Leaving Hugs]
> --- 1,3 ----
> ! ERROR "static/mod154.hs" - Conflicting exports of entity "sort"
> ! *** Could refer to Data.List.sort or M.sort
> Hugs:[Leaving Hugs]

This is a documented bug (though the notes in tests ought to mention
this too).

Glynn Clements

unread,

Mar 17, 2005, 2:26:06 PM3/17/05

to Marcin 'Qrczak' Kowalczyk, haskel...@haskell.org

Marcin 'Qrczak' Kowalczyk wrote:

> Glynn Clements <gl...@gclements.plus.com> writes:
>
> >> It should be possible to specify the encoding explicitly.
> >
> > Conversely, it shouldn't be possible to avoid specifying the
> > encoding explicitly.
>
> What encoding should a binding to readline or curses use?
>
> Curses in C comes in two flavors: the traditional byte version and a
> wide character version. The second version is easy if we can assume
> that wchar_t is Unicode, but it's not always available and until
> recently in ncurses it was buggy. Let's assume we are using the byte
> version. How to encode strings?

The (non-wchar) curses API functions take byte strings (char*), so the
Haskell bindings should take CString or [Word8] arguments. If you
provide "wrapper" functions which take String arguments, either they
should have an encoding argument or the encoding should be a mutable
per-terminal setting.

> A terminal uses an ASCII-compatible encoding. Wide character version
> of curses convert characters to the locale encoding, and byte version
> passes bytes unchanged. This means that if a Haskell binding to the
> wide character version does the obvious thing and passes Unicode
> directly, then an equivalent behavior can be obtained from the byte
> version (only limited to 256-character encodings) by using the locale
> encoding.

I don't know enough about the wchar version of curses to comment on
that.

I do know that, to work reliably, the normal (byte) version of curses
needs to pass "printable" bytes through unmodified.

It is possible for curses to be used with a terminal which doesn't use
the locale's encoding. Specifically, a single process may use curses
with multiple terminals with differing encodings, e.g. an airport
public information system displaying information in multiple
languages.

Also, it's quite common to use non-standard encodings with terminals
(e.g. codepage 437, which has graphic characters beyond the ACS_* set
which terminfo understands).

> The locale encoding is the right encoding to use for conversion of the
> result of strerror, gai_strerror, msg member of gzip compressor state
> etc. When an I/O error occurs and the error code is translated to a
> Haskell exception and then shown to the user, why would the application
> need to specify the encoding and how?

Because the application may be using multiple locales/encodings.
Having had to do this in C (i.e. repeatedly calling setlocale() to
select the correct encoding), I would much prefer to have been able to
pass the locale as a parameter.

[The most common example is printf("%f"). You need to use the C locale
(decimal point) for machine-readable text but the user's locale
(locale-specific decimal separator) for human-readable text. This
isn't directly related to encodings per se, but a good example of why
parameters are preferable to state.]

> > If application code doesn't want to use the locale's encoding, it
> > shouldn't be shoe-horned into doing so because a library developer
> > decided to duck the encoding issues by grabbing whatever encoding
> > was readily to hand (i.e. the locale's encoding).
>
> If a C library is written with the assumption that texts are in the
> locale encoding, a Haskell binding to such library should respect that
> assumption.

C libraries which use the locale do so as a last resort. K&R C
completely ignored I18N issues. ANSI C added the locale mechanism to
as a hack to provide minimal I18N support while maintaining backward
compatibility and in a minimally-intrusive manner.

The only reason that the C locale mechanism isn't a major nuisance is
that you can largely ignore it altogether. Code which requires real
I18N can use other mechanisms, and code which doesn't require any I18N
can just pass byte strings around and leave encoding issues to code
which actually has enough context to handle them correctly.

> Only some libraries allow to work with different, explicitly specified
> encodings. Many libraries don't, especially if the texts are not the
> core of the library functionality but error messages.

And most such libraries just treat text as byte strings. They don't
care about their interpretation, or even whether or not they are valid
in the locale's encoding.

--
Glynn Clements <gl...@gclements.plus.com>

Glynn Clements

unread,

Mar 17, 2005, 2:55:27 PM3/17/05

to John Meacham, haskel...@haskell.org

John Meacham wrote:

> > > >> It doesn't affect functions added by the hierarchical libraries,

> > > >> i.e. those functions are safe only with the ASCII subset. (There is

> > > >> a vague plan to make Foreign.C.String conform to the FFI spec,
> > > >> which mandates locale-based encoding, and thus would change all
> > > >> those, but it's still up in the air.)
> > > >
> > > > Hmm. I'm not convinced that automatically converting to the current

> > > > locale is the ideal behaviour (it'd certianly break all my programs!).
> > > > Certainly a function for converting into the encoding of the current
> > > > locale would be useful for may users but it's important to be able to

> > > > know the encoding with certainty.
> > >
> > > It should only be the default, not the only option.
> >
> > I'm not sure that it should be available at all.
> >
> > > It should be possible to specify the encoding explicitly.
> >
> > Conversely, it shouldn't be possible to avoid specifying the encoding
> > explicitly.
> >
> > Personally, I wouldn't provide an all-in-one "convert String to
> > CString using locale's encoding" function, just in case anyone was
> > tempted to actually use it.
>
> But this is exactly what is needed for most C library bindings.

I very much doubt that "most" is accurate.

C functions which take a "char*" fall into three main cases:

1. Unspecified encoding, i.e. it's a string of bytes, not characters.

2. Locale's encoding, as determined by nl_langinfo(CODESET);
essentially, whatever was set with setlocale(LC_CTYPE), defaulting to
C/POSIX if setlocale() hasn't been called.

3. Fixed encoding, e.g. UTF-8, ISO-2022, US-ASCII (or EBCDIC on IBM
mainframes).

Historically, library functions have tended to fall into category 1
unless they *need* to know the interpretation of a given byte or
sequence of bytes (e.g. <ctype.h>), in which case they fall into
category 2. Most of libc falls into category 1, with a minority of
functions in category 2.

Code which is designed to handle multiple languages simultaneously is
more likely to fall into category 3, using one of the "universal"
encodings (typically ISO-2022 in southeast Asia and UTF-8 elsewhere).

E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
use of the locale's encoding for filenames (if you have filenames in
multiple encodings, you lose; filenames using the "wrong" encoding
simply don't appear in file selectors).

--
Glynn Clements <gl...@gclements.plus.com>

Marcin 'Qrczak' Kowalczyk

unread,

Mar 17, 2005, 3:01:37 PM3/17/05

to haskel...@haskell.org

Glynn Clements <gl...@gclements.plus.com> writes:

> The (non-wchar) curses API functions take byte strings (char*),
> so the Haskell bindings should take CString or [Word8] arguments.

Programmers will not want to use such interface. When they want to
display a string, it will be in Haskell String type.

And it prevents having a single Haskell interface which uses either
the narrow or wide version of curses interface, depending on what is
available.

> If you provide "wrapper" functions which take String arguments,
> either they should have an encoding argument or the encoding should
> be a mutable per-terminal setting.

There is already a mutable setting. It's called "locale".

> I don't know enough about the wchar version of curses to comment on
> that.

It uses wcsrtombs or eqiuvalents to display characters. And the
reverse to interpret keystrokes.

> It is possible for curses to be used with a terminal which doesn't
> use the locale's encoding.

No, it will break under the new wide character curses API, and it will
confuse programs which use the old narrow character API.

The user (or the administrator) is responsible for matching the locale
encoding with the terminal encoding.

> Also, it's quite common to use non-standard encodings with terminals
> (e.g. codepage 437, which has graphic characters beyond the ACS_* set
> which terminfo understands).

curses don't support that.

>> The locale encoding is the right encoding to use for conversion of the
>> result of strerror, gai_strerror, msg member of gzip compressor state
>> etc. When an I/O error occurs and the error code is translated to a
>> Haskell exception and then shown to the user, why would the application
>> need to specify the encoding and how?
>
> Because the application may be using multiple locales/encodings.

But strerror always returns messages in the locale encoding.
Just like Gtk+2 always accepts texts in UTF-8.

For compatibility the default locale is "C", but new programs
which are prepared for I18N should do setlocale(LC_CTYPE, "")
and setlocale(LC_MESSAGES, "").

There are places where the encoding is settable independently,
or stored explicitly. For them Haskell should have withCString /
peekCString / etc. with an explicit encoding. And there are
places which use the locale encoding instead of having a separate
switch.

> [The most common example is printf("%f"). You need to use the C
> locale (decimal point) for machine-readable text but the user's
> locale (locale-specific decimal separator) for human-readable text.

This is a different thing, and it is what IMHO C did wrong.

> This isn't directly related to encodings per se, but a good example
> of why parameters are preferable to state.]

The LC_* environment variables are the parameters for the encoding.
There is no other convention to pass the encoding to be used for
textual output to stdout for example.

> C libraries which use the locale do so as a last resort.

No, they do it by default.

> The only reason that the C locale mechanism isn't a major nuisance
> is that you can largely ignore it altogether.

Then how would a Haskell program know what encoding to use for stdout
messages? How would it know how to interpret filenames for graphical
display?

Do you want to invent a separate mechanism for communicating that, so
that an administrator has to set up a dozen of environment variables
and teach each program separately about the encoding it should assume
by default? We had this mess 10 years ago, and parts of it are still
alive until today - you must sometimes configure xterm or Emacs
separately, but it's being more common that programs know to use the
system-supplied setting and don't have to be configured separately.

> Code which requires real I18N can use other mechanisms, and code
> which doesn't require any I18N can just pass byte strings around and
> leave encoding issues to code which actually has enough context to
> handle them correctly.

Haskell can't just pass byte strings around without turning the
Unicode support into a joke (which it is now).

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Marcin 'Qrczak' Kowalczyk

unread,

Mar 17, 2005, 3:05:52 PM3/17/05

to haskel...@haskell.org

Glynn Clements <gl...@gclements.plus.com> writes:

> E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
> use of the locale's encoding for filenames (if you have filenames in
> multiple encodings, you lose; filenames using the "wrong" encoding
> simply don't appear in file selectors).

Actually they do appear, even though you can't type their names
from the keyboard. The name shown in the GUI used to be escaped in
different ways by different programs or even different places in one
program (question marks, %hex escapes \oct escapes), but recently
they added some functions to glib to make the behavior uniform.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Keean Schupke

unread,

Mar 17, 2005, 4:53:52 PM3/17/05

to Marcin 'Qrczak' Kowalczyk, haskel...@haskell.org

I cannot help feeling that all this multi-language support is a mess.

All strings should be coded in a universal encoding (like UTF8) so that
the code for a character is the same independant of locale.

It seems stupid that the locale affects the character encodings... the
code for an 'a' should be the same all over the world... as should the
code for a particular japanese character.

In other words the locale should have no affect on character encodings,
it should select between multi-lingual error messages which are supplied
as distinct strings for each region.

While we may have to inter-operate with 'C' code, we could have a
Haskell library that does things properly.

Keean.

Marcin 'Qrczak' Kowalczyk wrote:

_______________________________________________

Glynn Clements

unread,

Mar 17, 2005, 8:34:51 PM3/17/05

to Marcin 'Qrczak' Kowalczyk, haskel...@haskell.org

Marcin 'Qrczak' Kowalczyk wrote:

> > If you provide "wrapper" functions which take String arguments,
> > either they should have an encoding argument or the encoding should
> > be a mutable per-terminal setting.
>
> There is already a mutable setting. It's called "locale".

It isn't a per-terminal setting.

> > It is possible for curses to be used with a terminal which doesn't
> > use the locale's encoding.
>
> No, it will break under the new wide character curses API,

Or expose the fact that the WC API is broken, depending upon your POV.

> and it will confuse programs which use the old narrow character API.

It has no effect on the *byte* API. Characters don't come into it.

> The user (or the administrator) is responsible for matching the locale
> encoding with the terminal encoding.

Which is rather hard to do if you have multiple encodings.

> > Also, it's quite common to use non-standard encodings with terminals
> > (e.g. codepage 437, which has graphic characters beyond the ACS_* set
> > which terminfo understands).
>
> curses don't support that.

Sure it does. You pass the appropriate bytes to waddstr() etc and they
get sent to the terminal as-is. Curses doesn't have ACS_* macros for
those characters, but it doesn't mean that you can't use them.

> >> The locale encoding is the right encoding to use for conversion of the
> >> result of strerror, gai_strerror, msg member of gzip compressor state
> >> etc. When an I/O error occurs and the error code is translated to a
> >> Haskell exception and then shown to the user, why would the application
> >> need to specify the encoding and how?
> >
> > Because the application may be using multiple locales/encodings.
>
> But strerror always returns messages in the locale encoding.

Sorry, I misread that paragraph. I replied to "why would ..." without
thinking about the context.

When you know that a string is in the locale's encoding, you need to
use it for the conversion. In that case you need to do the conversion
(or at least record the actual encoding) immediately, in case the
locale gets switched.

> Just like Gtk+2 always accepts texts in UTF-8.

Unfortunately. The text probably originated in an encoding other than
UTF-8, and will probably end up getting displayed using a font which
is indexed using the original encoding (rather than e.g. UCS-2/4).
Converting to Unicode then back again just introduces the potential
for errors. [Particularly for CJK where, due to Han unification,
Chinese characters may mutate into Japanese characters, or vice-versa.
Fortunately, that doesn't seem to have started any wars. Yet.]

> For compatibility the default locale is "C", but new programs
> which are prepared for I18N should do setlocale(LC_CTYPE, "")
> and setlocale(LC_MESSAGES, "").

In practice, you end up continuously calling setlocale(LC_CTYPE, "")
and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant
to be human-readable (locale-dependent) or a machine-readable format
(locale-independent, i.e. "C" locale).

> > [The most common example is printf("%f"). You need to use the C
> > locale (decimal point) for machine-readable text but the user's
> > locale (locale-specific decimal separator) for human-readable text.
>
> This is a different thing, and it is what IMHO C did wrong.

It's a different example of the same problem. I agree that C did it
wrong; I'm objecting to the implication that Haskell should make the
same mistakes.

> > This isn't directly related to encodings per se, but a good example
> > of why parameters are preferable to state.]
>
> The LC_* environment variables are the parameters for the encoding.

But they are only really "parameters" at the exec() level.

Once the program starts, the locale settings become global mutable
state. I would have thought that, more than anyone else, the
readership of this list would understand what's bad about that
concept.

> There is no other convention to pass the encoding to be used for
> textual output to stdout for example.

That's up to the application. Environment variables are a convenience;
there's no reason why you can't have a command-line switch to select
the encoding. For more complex applications, you often have
user-selectable options and/or encodings specified in the data which
you handle.

Another problem with having a single locale: if a program isn't
working, and you need to communicate with its developers, you will
often have to run the program in an English locale just so that you
will get error messages which the developers understand.

> > C libraries which use the locale do so as a last resort.
>
> No, they do it by default.

By default, libc uses the C locale. setlocale() includes a convenience
option to use the LC_* variables. Other libraries may or may not use
the locale settings, and plenty of code will misbehave if the locale
is wrong (e.g. using fprintf("%f") without explicitly setting the C
locale first will do the wrong thing if you're trying to generate
VRML/DXF/whatever files).

Beyond that, libc uses the locale mechanism because it was the
simplest way to retrofit minimal I18N onto K&R C. It also means that
most code can easily duck the issues (i.e. so you don't have to pass a
locale parameter to isupper() etc).

OTOH, if you don't want to duck the issue, global locale settings are
a nuisance.

> > The only reason that the C locale mechanism isn't a major nuisance
> > is that you can largely ignore it altogether.
>
> Then how would a Haskell program know what encoding to use for stdout
> messages?

It doesn't necessarily need to. If you are using message catalogues,
you just read bytes from the catalogue and write them to stdout. The
issue then boils down to using the correct encoding for the
catalogues; the code doesn't need to know.

> How would it know how to interpret filenames for graphical
> display?

An option menu on the file selector is one option; heuristics are
another.

Both tend to produce better results in non-trivial cases than either
of Gtk-2's choices: i.e. filenames must be either UTF-8 or must match
the locale (depending up the G_BROKEN_FILENAMES setting), otherwise
the filename simply doesn't exist. At least Gtk-1 would attempt to
display the filename; you would get the odd question mark but at least
you could select the file; ultimately, the returned char* just gets
passed to open(), so the encoding only really matters for display.

> > Code which requires real I18N can use other mechanisms, and code
> > which doesn't require any I18N can just pass byte strings around and
> > leave encoding issues to code which actually has enough context to
> > handle them correctly.
>
> Haskell can't just pass byte strings around without turning the
> Unicode support into a joke (which it is now).

If you try to pretend that I18N comes down to shoe-horning everything
into Unicode, you will turn the language into a joke.

Haskell's Unicode support is a joke because the API designers tried to
avoid the issues related to encoding with wishful thinking (i.e. you
open a file and you magically get Unicode characters out of it).

The "current locale" mechanism is just a way of avoiding the issues as
much as possible when you can't get away with avoiding them
altogether.

Unicode has been described (accurately, IMHO) as "Esperanto for
computers". Both use the same approach to try to solve essentially the
same problem. And both will be about as successful in the long run.

--
Glynn Clements <gl...@gclements.plus.com>

Glynn Clements

unread,

Mar 17, 2005, 9:20:48 PM3/17/05

to Marcin 'Qrczak' Kowalczyk, haskel...@haskell.org

Marcin 'Qrczak' Kowalczyk wrote:

> > E.g. Gtk-2.x uses UTF-8 almost exclusively, although you can force the
> > use of the locale's encoding for filenames (if you have filenames in
> > multiple encodings, you lose; filenames using the "wrong" encoding
> > simply don't appear in file selectors).
>
> Actually they do appear, even though you can't type their names
> from the keyboard. The name shown in the GUI used to be escaped in
> different ways by different programs or even different places in one
> program (question marks, %hex escapes \oct escapes), but recently
> they added some functions to glib to make the behavior uniform.

In the last version of Gtk-2.x which I tried, "invalid" filenames are
just omitted from the list. Gtk-1.x displayed them (I think with
question marks, but it may have been a box).

I've just tried with a more recent version (2.6.2); the default
behaviour is similar, although you can now get around the issue by
using G_FILENAME_ENCODING=ISO-8859-1. Of course, if your locale is
a long way from ISO-8859-1, that isn't a particularly good solution.

The best test case would be a system used predominantly by Japanese,
where (apparently) it's common to have a mixture of both EUC-JP and
Shift-JIS filenames (occasionally wrapped in ISO-2022, but usually
raw).

--
Glynn Clements <gl...@gclements.plus.com>

Wolfgang Thaller

unread,

Mar 17, 2005, 11:17:01 PM3/17/05

to haskel...@haskell.org, Glynn Clements

> If you try to pretend that I18N comes down to shoe-horning everything
> into Unicode, you will turn the language into a joke.

How common will those problems you are describing be by the time this
has been implemented?
How common are they even now?
I haven't yet encountered a unix box where the file names were not in
the system locale encoding. On all reasonably up-to-date Linux boxes
that I've seen recently, they were in UTF-8 (and the system locale
agreed).
On both Windows and Mac OS X, filenames are stored in Unicode, so it is
always possible to convert them to unicode.
So we can't do Unicode-based I18N because there exist a few unix
systems with messed-up file systems?

> Haskell's Unicode support is a joke because the API designers tried to
> avoid the issues related to encoding with wishful thinking (i.e. you
> open a file and you magically get Unicode characters out of it).

OK, that part is purely wishful thinking, but assuming that filenames
are text that can be represented in Unicode is wishful thinking that
corresponds to 99% of reality. So why can't the remaining 1 percent of
reality be fixed instead?

Cheers,

Wolfgang

Marcin 'Qrczak' Kowalczyk

unread,

Mar 18, 2005, 6:17:21 AM3/18/05

to haskel...@haskell.org

Glynn Clements <gl...@gclements.plus.com> writes:

>> > If you provide "wrapper" functions which take String arguments,
>> > either they should have an encoding argument or the encoding should
>> > be a mutable per-terminal setting.
>>
>> There is already a mutable setting. It's called "locale".
>
> It isn't a per-terminal setting.

A separate setting would force users to configure an encoding just
for the purposes of Haskell programs, as if the configuration wasn't
already too fragmented. It's unwise to propose a new standard when an
existing standard works well enough.

>> > It is possible for curses to be used with a terminal which doesn't
>> > use the locale's encoding.
>>
>> No, it will break under the new wide character curses API,
>
> Or expose the fact that the WC API is broken, depending upon your POV.

It's the only curses API which allows to write full-screen programs in
UTF-8 mode.

>> > Also, it's quite common to use non-standard encodings with terminals
>> > (e.g. codepage 437, which has graphic characters beyond the ACS_* set
>> > which terminfo understands).
>>
>> curses don't support that.
>
> Sure it does. You pass the appropriate bytes to waddstr() etc and they
> get sent to the terminal as-is.

It doesn't support that and it will switch the terminal mode to "user"
encoding (which is usually ISO-8859-x) on a first occasion, e.g. after
an ACS_* macro was used, or maybe even at initialization.

curses support two families of encodings: the current locale encoding
and ACS. The locale encoding may be UTF-8 (works only with wide
character API).

>> For compatibility the default locale is "C", but new programs
>> which are prepared for I18N should do setlocale(LC_CTYPE, "")
>> and setlocale(LC_MESSAGES, "").
>
> In practice, you end up continuously calling setlocale(LC_CTYPE, "")
> and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant
> to be human-readable (locale-dependent) or a machine-readable format
> (locale-independent, i.e. "C" locale).

I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting,
it only affects the encoding of texts emitted by gettext (including
strerror) and the meaning of isalpha, toupper etc.

>> The LC_* environment variables are the parameters for the encoding.
>
> But they are only really "parameters" at the exec() level.

This is usually the right place to specify it. It's rare that they
are even set separately for the given program - usually they are
per-system or per-user.

> Once the program starts, the locale settings become global mutable
> state. I would have thought that, more than anyone else, the
> readership of this list would understand what's bad about that
> concept.

You can treat it as immutable. Just don't call setlocale with
different arguments again.

> Another problem with having a single locale: if a program isn't
> working, and you need to communicate with its developers, you will
> often have to run the program in an English locale just so that you
> will get error messages which the developers understand.

You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

>> Then how would a Haskell program know what encoding to use for
>> stdout messages?
>
> It doesn't necessarily need to. If you are using message catalogues,
> you just read bytes from the catalogue and write them to stdout.

gettext uses the locale to choose the encoding. Messages are
internally stored as UTF-8 but emitted in the locale encoding.

You are using the semantics I'm advocating without knowing that...

>> How would it know how to interpret filenames for graphical
>> display?
>
> An option menu on the file selector is one option; heuristics are
> another.

Heuristics won't distinguish various ISO-8859-x from each other.

An option menu on the file selector is user-unfriendly because users
don't want to configure it for each program separately. They want to
set it in one place and expect it to work everywhere.

Currently there are two such places: the locale, and
G_FILENAME_ENCODING (or older G_BROKEN_FILENAMES) for glib. It's
unwise to introduce yet another convention, and it would be a horrible
idea to make it per-program.

> At least Gtk-1 would attempt to display the filename; you would get
> the odd question mark but at least you could select the file;

Gtk+2 also attempts to display the filename. It can be opened
even though the filename has inconvertible characters escaped.

> The "current locale" mechanism is just a way of avoiding the issues
> as much as possible when you can't get away with avoiding them
> altogether.

It's a way to communicate the encoding of the terminal, filenames,
strerror, gettext etc.

> Unicode has been described (accurately, IMHO) as "Esperanto for
> computers". Both use the same approach to try to solve essentially the
> same problem. And both will be about as successful in the long run.

Unicode has no viable competition.
Esperanto had English.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Glynn Clements

unread,

Mar 18, 2005, 2:01:01 PM3/18/05

to Wolfgang Thaller, haskel...@haskell.org

Wolfgang Thaller wrote:

> > If you try to pretend that I18N comes down to shoe-horning everything
> > into Unicode, you will turn the language into a joke.
>
> How common will those problems you are describing be by the time this
> has been implemented?
> How common are they even now?

Right now, GHC assumes ISO-8859-1 whenever it has to automatically
convert between String and CString. Conversions to and from ISO-8859-1
cannot fail, and encoding and decoding are exact inverses.

OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
the correct encoding, but that doesn't actually matter a lot of the
time; frequently, you're just grabbing a "blob" of data from one
function and passing it to another.

The problems will only appear once you start dealing with fallible or
non-reversible encodings such as UTF-8 or ISO-2022. If and when that
happens, I guess we'll find out how common the problems are. Of
course, it's quite possible that the only test cases will be people
using UTF-8-only (or even ASCII-only) systems, in which case you won't
see any problems.

> I haven't yet encountered a unix box where the file names were not in
> the system locale encoding. On all reasonably up-to-date Linux boxes
> that I've seen recently, they were in UTF-8 (and the system locale
> agreed).

I've encountered boxes where multiple encodings were used; primarily
web and FTP servers which were shared amongst multiple clients. Each
client used whichever encoding(s) they felt like. IIRC, the most
common non-ASCII encoding was MS-DOS codepage 850 (the clients were
mostly using Windows 3.1 at that time).

I haven't done sysadmin for a while, so I don't know the current
situation, but I don't think that the world has switched to UTF-8 in
the mean time. [Most of the non-ASCII filenames which I've seen
recently have been either ISO-8859-1 or Win-12XX; I haven't seen much
UTF-8.]

> On both Windows and Mac OS X, filenames are stored in Unicode, so it is
> always possible to convert them to unicode.
> So we can't do Unicode-based I18N because there exist a few unix
> systems with messed-up file systems?

Declaring such systems to be "messed up" won't make the problems go
away. If a design doesn't work in reality, it's the fault of the
design, not of reality.

> > Haskell's Unicode support is a joke because the API designers tried to
> > avoid the issues related to encoding with wishful thinking (i.e. you
> > open a file and you magically get Unicode characters out of it).
>
> OK, that part is purely wishful thinking, but assuming that filenames
> are text that can be represented in Unicode is wishful thinking that
> corresponds to 99% of reality.
> So why can't the remaining 1 percent of reality be fixed instead?

The issue isn't whether the data can be represented as Unicode text,
but whether you can convert it to and from Unicode without problems.
To do this, you need to know the encoding, you need to store the
encoding so that you can convert the wide string back to a byte
string, and the encoding needs to be reversible.

--
Glynn Clements <gl...@gclements.plus.com>

Glynn Clements

unread,

Mar 18, 2005, 2:53:04 PM3/18/05

to Marcin 'Qrczak' Kowalczyk, haskel...@haskell.org

Marcin 'Qrczak' Kowalczyk wrote:

> >> > If you provide "wrapper" functions which take String arguments,
> >> > either they should have an encoding argument or the encoding should
> >> > be a mutable per-terminal setting.
> >>
> >> There is already a mutable setting. It's called "locale".
> >
> > It isn't a per-terminal setting.
>
> A separate setting would force users to configure an encoding just
> for the purposes of Haskell programs, as if the configuration wasn't
> already too fragmented.

encoding <- localeEncoding
Curses.setupTerm encoding handle

Not a big deal.

> It's unwise to propose a new standard when an existing standard
> works well enough.

Existing standard? The standard curses API deals with bytes; encodings
don't come into it. AFAIK, the wide-character curses API isn't yet a
standard.

> >> > It is possible for curses to be used with a terminal which doesn't
> >> > use the locale's encoding.
> >>
> >> No, it will break under the new wide character curses API,
> >
> > Or expose the fact that the WC API is broken, depending upon your POV.
>
> It's the only curses API which allows to write full-screen programs in
> UTF-8 mode.

All the more reason to fix it.

And where does UTF-8 come into it? I would have expected it to use
wide characters throughout.

> >> > Also, it's quite common to use non-standard encodings with terminals
> >> > (e.g. codepage 437, which has graphic characters beyond the ACS_* set
> >> > which terminfo understands).
> >>
> >> curses don't support that.
> >
> > Sure it does. You pass the appropriate bytes to waddstr() etc and they
> > get sent to the terminal as-is.
>
> It doesn't support that and it will switch the terminal mode to "user"
> encoding (which is usually ISO-8859-x) on a first occasion, e.g. after
> an ACS_* macro was used, or maybe even at initialization.
>
> curses support two families of encodings: the current locale encoding
> and ACS. The locale encoding may be UTF-8 (works only with wide
> character API).

I'm talking about standard (XSI) curses, which will just pass
printable (non-control) bytes straight to the terminal. If your
terminal uses CP437 (or some other non-standard encoding), you can
just pass the appropriate bytes to waddstr() etc and the corresponding
characters will appear on the terminal.

ACS_* codes are a completely separate issue; they allow you to use
line graphics in addition to a full 8-bit character set (e.g.
ISO-8859-1). If you only need ASCII text, you can use the other 128
codes for graphics characters and never use the ACS_* macros or the
"acsc" capability.

> >> For compatibility the default locale is "C", but new programs
> >> which are prepared for I18N should do setlocale(LC_CTYPE, "")
> >> and setlocale(LC_MESSAGES, "").
> >
> > In practice, you end up continuously calling setlocale(LC_CTYPE, "")
> > and setlocale(LC_CTYPE, "C"), depending upon whether the text is meant
> > to be human-readable (locale-dependent) or a machine-readable format
> > (locale-independent, i.e. "C" locale).
>
> I wrote LC_TYPE, not LC_ALL. LC_TYPE doesn't affect %f formatting,
> it only affects the encoding of texts emitted by gettext (including
> strerror) and the meaning of isalpha, toupper etc.

Sorry, I'm confusing two cases here. With LC_CTYPE, the main reason
for continuous switching is when using wcstombs(). printf() uses
LC_NUMERIC, which is switched between the "C" locale and the user's
locale.

> > Once the program starts, the locale settings become global mutable
> > state. I would have thought that, more than anyone else, the
> > readership of this list would understand what's bad about that
> > concept.
>
> You can treat it as immutable. Just don't call setlocale with
> different arguments again.

Which limits you to a single locale. If you are using the locale's
encoding, that limits you to a single encoding.

> > Another problem with having a single locale: if a program isn't
> > working, and you need to communicate with its developers, you will
> > often have to run the program in an English locale just so that you
> > will get error messages which the developers understand.
>
> You don't need to change LC_CTYPE for that. Just set LC_MESSAGES.

I'm starting to think that you're misunderstanding on purpose. Again.

The point is that a single program often generates multiple streams of
text, possibly for different "audiences" (e.g. humans and machines).
Different streams may require different conventions (encodings,
numeric formats, collating orders), but may use the same functions.

Those functions need to obtain the conventions from somewhere, and
that means either parameters or state.

Having dealt with state (libc's locale mechanism), I would rather have
parameters.

> >> Then how would a Haskell program know what encoding to use for
> >> stdout messages?
> >
> > It doesn't necessarily need to. If you are using message catalogues,
> > you just read bytes from the catalogue and write them to stdout.
>
> gettext uses the locale to choose the encoding. Messages are
> internally stored as UTF-8 but emitted in the locale encoding.

It didn't use to be that way, but I can see why they would have
changed it (a single catalogue for encoding variants of a given
locale).

> >> How would it know how to interpret filenames for graphical
> >> display?
> >
> > An option menu on the file selector is one option; heuristics are
> > another.
>
> Heuristics won't distinguish various ISO-8859-x from each other.

You treat the locale's encoding as a heuristic. If it looks like
ISO-8859-x, and the locale's encoding is ISO-8859-x, you use that. If
it looks like Shift-JIS, you don't complain and give up just because
the locale is UTF-8.

> An option menu on the file selector is user-unfriendly because users
> don't want to configure it for each program separately. They want to
> set it in one place and expect it to work everywhere.

Nothing will work everywhere. An option menu allows the user to force
the encoding for individual cases when whatever other mechanism(s) you
use get it wrong.

I've needed to use Mozilla's "View -> Character Encoding" menu enough
times when the browser's guess turned out to be wrong (and blindly
honouring the charset specified by HTTP's Content-Type: or HTML's META
tags would be a disaster).

> > At least Gtk-1 would attempt to display the filename; you would get
> > the odd question mark but at least you could select the file;
>
> Gtk+2 also attempts to display the filename. It can be opened
> even though the filename has inconvertible characters escaped.

This isn't my experience; I just get messages like:

Gtk-Message: The filename "\377.ppm" couldn't be converted to UTF-8. (try setting the environment variable G_FILENAME_ENCODING): Invalid byte sequence in conversion input

and the filename is omitted altogether.

> > The "current locale" mechanism is just a way of avoiding the issues
> > as much as possible when you can't get away with avoiding them
> > altogether.
>
> It's a way to communicate the encoding of the terminal, filenames,
> strerror, gettext etc.

It's *a* way, but it's not a very good way. It sucks when you can't
apply a single convention to everything.

> > Unicode has been described (accurately, IMHO) as "Esperanto for
> > computers". Both use the same approach to try to solve essentially the
> > same problem. And both will be about as successful in the long run.
>
> Unicode has no viable competition.

There are two viable alternatives. Byte strings with associated
encodings and ISO-2022. In CJK environments, ISO-2022 is still far
more widespread than UTF-8, and will likely remain so for the
foreseeable future. And byte strings with associated encodings are
probably still the most common of all.

--
Glynn Clements <gl...@gclements.plus.com>

Wolfgang Thaller

unread,

Mar 19, 2005, 1:10:38 AM3/19/05

to Glynn Clements, haskel...@haskell.org

Glynn Clements wrote:

> OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
> the correct encoding, but that doesn't actually matter a lot of the
> time; frequently, you're just grabbing a "blob" of data from one
> function and passing it to another.

Yes. Of course, this also means that Strings representing non-ASCII
filenames will *always* be nonsense on Mac OS X and other UTF8-based
platforms.

> The problems will only appear once you start dealing with fallible or
> non-reversible encodings such as UTF-8 or ISO-2022.

In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
file name that is converted to Unicode cannot be converted back any
more (assuming you know for sure that it was ISO-2022 in the first
place)?

> Of course, it's quite possible that the only test cases will be people
> using UTF-8-only (or even ASCII-only) systems, in which case you won't
> see any problems.

I'm kind of hoping that we can just ignore a problem that is so rare
that a large and well-known project like GTK2 can get away with
ignoring it. Also, IIRC, Java strings are supposed to be unicode, too -
how do they deal with the problem?

>> So we can't do Unicode-based I18N because there exist a few unix
>> systems with messed-up file systems?
>
> Declaring such systems to be "messed up" won't make the problems go
> away. If a design doesn't work in reality, it's the fault of the
> design, not of reality.

In general, yes. But we're not talking about all of reality here, we're
talking about one small part of reality - the question is, can the part
of reality where the design doesn't work be ignored?

For example, as soon as we use any kind of path names in our APIs, we
are ignoring reality on good old "Classic" Mac OS (may it rest in
piece). Path names don't always uniquely denote a file there (although
they do most of the time). People writing cross-platform software have
been ignoring this fact for a long time now.

I think that if we wait long enough, the filename encoding problems
will become irrelevant and we will live in an ideal world where unicode
actually works. Maybe next year, maybe only in ten years. And while we
are arguing about how far we are from that ideal world, we should think
about alternatives. The current hack is really just a hack, and I don't
want to see this hack become the new accepted standard.

Do we have other alternatives? Preferably something that provides other
advantages over a unicode String than just making things work on
systems that many users never encounter, otherwise almost no one will
bother to use it. So maybe we should start looking for _other_ reasons
to represent file names and paths by an abstract datatype or something?

Cheers,

Wolfgang

Einar Karttunen

unread,

Mar 19, 2005, 4:34:54 AM3/19/05

to Wolfgang Thaller, Glynn Clements, haskel...@haskell.org

Wolfgang Thaller <wolfgang...@gmx.net> writes:
> In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
> file name that is converted to Unicode cannot be converted back any
> more (assuming you know for sure that it was ISO-2022 in the first
> place)?

I am no expert on ISO-2022 so the following may contain errors,
please correct if it is wrong.

ISO-2022 -> Unicode is always possible.
Also Unicode -> ISO-2022 should be always possible, but is a relation
not a function. This means there are an infinite? ways of encoding a
particular unicode string in ISO-2022.

ISO-2022 works by providing escape sequences to switch between different
character sets. One can freely use these escapes in almost any way you
wish. Also ISO-2022 makes a difference between the same character in
japanese/chinese/korean - which unicode does not do.

See here for more info on the topic:
http://www.ecma-international.org/publications/files/ecma-st/ECMA-035.pdf

Also trusting system locale for everything is problematic and makes
things quite unbearable for I18N. e.g. on my desktop 95% of things run
with iso-8859-1, 3% of things use utf-8 and a few apps use EUC-JP...

Using filenames as opaque blobs causes the least problems. If the
program wishes to display them in a graphical environment then they have
to be converted to a string, but very many apps never display the
filenames...

- Einar Karttunen

Marcin 'Qrczak' Kowalczyk

unread,

Mar 19, 2005, 6:56:14 AM3/19/05

to haskel...@haskell.org

Glynn Clements <gl...@gclements.plus.com> writes:

>> A separate setting would force users to configure an encoding just
>> for the purposes of Haskell programs, as if the configuration wasn't
>> already too fragmented.
>
> encoding <- localeEncoding
> Curses.setupTerm encoding handle

In a properly configured system curses is always supposed to be used
like this. That is, it can as well use the locale encoding directly,
without complicating the API.

I don't want to force to implement bindings like this, but to allow it,
because it's a good default.

>> It's unwise to propose a new standard when an existing standard
>> works well enough.
>
> Existing standard? The standard curses API deals with bytes; encodings
> don't come into it. AFAIK, the wide-character curses API isn't yet a
> standard.

It's described in Single Unix Spec along with the narrow character
version (but in an earlier version; the newest version doesn't
describe curses at all).

But I meant a standard for communicating the encoding of the terminal
to programs. If programs are supposed check the locale to determine
that, it can be done automatically by bindings to readline & curses.

>> > Or expose the fact that the WC API is broken, depending upon your POV.
>>
>> It's the only curses API which allows to write full-screen programs in
>> UTF-8 mode.
>
> All the more reason to fix it.
>
> And where does UTF-8 come into it? I would have expected it to use
> wide characters throughout.

The wide character API works with any encoding.

The narrow character API works only with encodings where one byte
corresponds to one character.

(In the wide character API wchar_t doesn't have to correspond to one
character cell; combining characters are attached to base characters,
and some characters are double-wide.)

> I'm talking about standard (XSI) curses, which will just pass
> printable (non-control) bytes straight to the terminal. If your
> terminal uses CP437 (or some other non-standard encoding), you can
> just pass the appropriate bytes to waddstr() etc and the corresponding
> characters will appear on the terminal.

Which terminal uses CP437?

Linux console doesn't, except temporarily after switching the mapping
to builtin CP437 (but this state is not used by curses) or after
loading CP437 as the user map (nobody does this, and it won't work
properly with all characters from the range 0x80-0x9F anyway).

>> You can treat it as immutable. Just don't call setlocale with
>> different arguments again.
>
> Which limits you to a single locale. If you are using the locale's
> encoding, that limits you to a single encoding.

There is no support for changing the encoding of a terminal on the fly
by programs running inside it.

> The point is that a single program often generates multiple streams of
> text, possibly for different "audiences" (e.g. humans and machines).
> Different streams may require different conventions (encodings,
> numeric formats, collating orders), but may use the same functions.

A single program has a single stdout and a single filesystem. The
contexts which use the locale encoding don't need multiple encodings.

Multiple encodings are needed e.g. for exchanging data with other
machines for the network, for reading contents of text files after the
user has specified an encoding explicitly etc. In these cases an API
with explicitly provided encoding should be used.

>> Gtk+2 also attempts to display the filename. It can be opened
>> even though the filename has inconvertible characters escaped.
>
> This isn't my experience; I just get messages like:
>
> Gtk-Message: The filename "\377.ppm" couldn't be converted to UTF-8.
> (try setting the environment variable G_FILENAME_ENCODING): Invalid
> byte sequence in conversion input
>
> and the filename is omitted altogether.

Works for me, e.g. in gedit-2.8.2. The filename is displayed with
escapes like \377 and can be opened.

>> > The "current locale" mechanism is just a way of avoiding the issues
>> > as much as possible when you can't get away with avoiding them
>> > altogether.
>>
>> It's a way to communicate the encoding of the terminal, filenames,
>> strerror, gettext etc.
>
> It's *a* way, but it's not a very good way. It sucks when you can't
> apply a single convention to everything.

It's not so bad to justify inventing our own conventions and forcing
users to configure the encoding of Haskell programs separately.

>> Unicode has no viable competition.
>
> There are two viable alternatives. Byte strings with associated
> encodings and ISO-2022.

ISO-2022 is an insanely complicated brain-damaged mess. I know it's
being used in some parts of the world, but the sooner it will die,
the better.

Byte strings with associated encodings coexist with Unicode and are
being slowly replaced by it, by using UTF-8 as the encoding more
often.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

David Roundy

unread,

Mar 19, 2005, 7:14:54 AM3/19/05

to haskel...@haskell.org

On Sat, Mar 19, 2005 at 12:55:54PM +0100, Marcin 'Qrczak' Kowalczyk wrote:
> Glynn Clements <gl...@gclements.plus.com> writes:

> > The point is that a single program often generates multiple streams of
> > text, possibly for different "audiences" (e.g. humans and machines).
> > Different streams may require different conventions (encodings,
> > numeric formats, collating orders), but may use the same functions.
>
> A single program has a single stdout and a single filesystem. The
> contexts which use the locale encoding don't need multiple encodings.

That's not true, there could be many filesystems, each of which uses a
different encoding for the filenames. In the case of removable media, this
scenario isn't even unlikely.
--
David Roundy
http://www.darcs.net

Glynn Clements

unread,

Mar 19, 2005, 9:33:13 AM3/19/05

to Einar Karttunen, Wolfgang Thaller, haskel...@haskell.org

Einar Karttunen wrote:

> > In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
> > file name that is converted to Unicode cannot be converted back any
> > more (assuming you know for sure that it was ISO-2022 in the first
> > place)?
>
> I am no expert on ISO-2022 so the following may contain errors,
> please correct if it is wrong.
>
> ISO-2022 -> Unicode is always possible.
> Also Unicode -> ISO-2022 should be always possible, but is a relation
> not a function. This means there are an infinite? ways of encoding a
> particular unicode string in ISO-2022.
>
> ISO-2022 works by providing escape sequences to switch between different
> character sets. One can freely use these escapes in almost any way you
> wish.

Exactly.

Moreover, while there are an infinite number of equivalent
representations in theory (you can add as many redundant switching
sequences as you wish), there are multiple "plausible" equivalent
representations in practice.

--
Glynn Clements <gl...@gclements.plus.com>

Glynn Clements

unread,

Mar 19, 2005, 10:04:22 AM3/19/05

to Marcin 'Qrczak' Kowalczyk, haskel...@haskell.org

Marcin 'Qrczak' Kowalczyk wrote:

> > I'm talking about standard (XSI) curses, which will just pass
> > printable (non-control) bytes straight to the terminal. If your
> > terminal uses CP437 (or some other non-standard encoding), you can
> > just pass the appropriate bytes to waddstr() etc and the corresponding
> > characters will appear on the terminal.
>
> Which terminal uses CP437?

Most software terminal emulators can use any encoding. Traditional
comms packages tend to support this (including their own "VGA" font if
necessary) because of its widespread use on BBSes which were targeted
at MS-DOS systems.

There exist hardware terminals (I can't name specific models, but I
have seen them in use) which support this, specifically for use with
MS-DOS systems.

> Linux console doesn't, except temporarily after switching the mapping
> to builtin CP437 (but this state is not used by curses) or after
> loading CP437 as the user map (nobody does this, and it won't work
> properly with all characters from the range 0x80-0x9F anyway).

I *still* encounter programs written for the linux console which
assume that the built-in CP437 font is being used (if you use an
ISO-8859-1 font, you get dialogs with accented characters where you
would expect line-drawing characters).

> >> You can treat it as immutable. Just don't call setlocale with
> >> different arguments again.
> >
> > Which limits you to a single locale. If you are using the locale's
> > encoding, that limits you to a single encoding.
>
> There is no support for changing the encoding of a terminal on the fly
> by programs running inside it.

If you support multiple terminals with different encodings, and the
library uses the global locale settings to determine the encoding, you
need to switch locale every time you write to a different terminal.

> > The point is that a single program often generates multiple streams of
> > text, possibly for different "audiences" (e.g. humans and machines).
> > Different streams may require different conventions (encodings,
> > numeric formats, collating orders), but may use the same functions.
>
> A single program has a single stdout and a single filesystem. The
> contexts which use the locale encoding don't need multiple encodings.
>
> Multiple encodings are needed e.g. for exchanging data with other
> machines for the network, for reading contents of text files after the
> user has specified an encoding explicitly etc. In these cases an API
> with explicitly provided encoding should be used.

A API which is used for reading and writing text files or sockets is
just as applicable to stdin/stdout.

> >> > The "current locale" mechanism is just a way of avoiding the issues
> >> > as much as possible when you can't get away with avoiding them
> >> > altogether.
> >>
> >> It's a way to communicate the encoding of the terminal, filenames,
> >> strerror, gettext etc.
> >
> > It's *a* way, but it's not a very good way. It sucks when you can't
> > apply a single convention to everything.
>
> It's not so bad to justify inventing our own conventions and forcing
> users to configure the encoding of Haskell programs separately.

I'm not suggesting inventing conventions. I'm suggesting leaving such
issues to the application programmer who, unlike the library
programmer, probably has enough context to be able to reliably
determine the correct encoding in any specific instance.

> >> Unicode has no viable competition.
> >
> > There are two viable alternatives. Byte strings with associated
> > encodings and ISO-2022.
>
> ISO-2022 is an insanely complicated brain-damaged mess. I know it's
> being used in some parts of the world, but the sooner it will die,
> the better.

ISO-2022 has advantages and disadvantages relative to UTF-8. I don't
want to go on about the specifics here because they aren't
particularly relevant. What's relevant is that it isn't likely to
disappear any time soon.

A large part of the world already has a universal encoding which works
well enough; they don't *need* UTF-8, and aren't going to rebuild
their IT infrastructure from scratch for the sake of it.

--
Glynn Clements <gl...@gclements.plus.com>

Keean Schupke

unread,

Mar 19, 2005, 11:13:42 AM3/19/05

to David Roundy, haskel...@haskell.org

David Roundy wrote:

>That's not true, there could be many filesystems, each of which uses a
>different encoding for the filenames. In the case of removable media, this
>scenario isn't even unlikely.
>
>

I agree - I can quite easily see the situation occuring where a student
(say from japan) brings in a zip-disk or USB key formatted with a
japanese filename encoding, that I need to read on my computer (with a
UK locale).

Also can different windows have different encodings? I might have a web
browser (written in haskell?) running and have windows with several
different encodings open at the same time, whist saving things on
filesystems with differing encodings.

Keean.

Mark Carroll

unread,

Mar 19, 2005, 11:27:31 AM3/19/05

to haskel...@haskell.org

On Sat, 19 Mar 2005, David Roundy wrote:

> That's not true, there could be many filesystems, each of which uses a
> different encoding for the filenames. In the case of removable media, this
> scenario isn't even unlikely.

The nearest desktop machine to me right now has in its directory structure
filesystems that use different encodings. So, yes, it's probably not all
that rare.

Mark.

--
Haskell vacancies in Columbus, Ohio, USA: see http://www.aetion.com/jobs.html

Glynn Clements

unread,

Mar 19, 2005, 12:35:30 PM3/19/05

to Wolfgang Thaller, haskel...@haskell.org

Wolfgang Thaller wrote:

> > Of course, it's quite possible that the only test cases will be people
> > using UTF-8-only (or even ASCII-only) systems, in which case you won't
> > see any problems.
>
> I'm kind of hoping that we can just ignore a problem that is so rare
> that a large and well-known project like GTK2 can get away with
> ignoring it.

1. The filename issues in GTK-2 are likely to be a major problem in
CJK locales, where filenames which don't match the locale (which is
seldom UTF-8) are common.

2. GTK's filename handling only really applies to file selector
dialogs. Most other uses of filenames in a GTK-based application don't
involve GTK; they use the OS API functions which just deal with byte
strings.

3. GTK is a GUI library. Most of the text which it deals with is going
to be rendered, so it *has* to be interpreted as characters. Treating
it as blobs of data won't work. IOW, on the question of whether or not
to interpret byte strings as character strings, GTK is at the far end
of the scale.

> Also, IIRC, Java strings are supposed to be unicode, too -
> how do they deal with the problem?

Files are represented by instances of the File class:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html

An abstract representation of file and directory pathnames.

You can construct Files from Strings, and convert Files to Strings.

The File class includes two sets of directory enumeration methods:
list() returns an array of Strings, while listFiles() returns an array
of Files.

The documentation for the File class doesn't mention encoding issues
at all. However, with that interface, it would be possible to
enumerate and open filenames which cannot be decoded.

> >> So we can't do Unicode-based I18N because there exist a few unix
> >> systems with messed-up file systems?
> >
> > Declaring such systems to be "messed up" won't make the problems go
> > away. If a design doesn't work in reality, it's the fault of the
> > design, not of reality.
>
> In general, yes. But we're not talking about all of reality here, we're
> talking about one small part of reality - the question is, can the part
> of reality where the design doesn't work be ignored?

Sure, you *can* ignore it; K&R C ignored everything other than ASCII.
If you limit yourself to locales which use the Roman alphabet (i.e.
ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot.

Most such users avoid encoding issues altogether by dropping the
accents and sticking to ASCII, at least when dealing with files which
might leave their system.

To get a better idea, you would need to consult users whose language
doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
you don't usually find too many of them on lists such as this.

I'm only familiar with one OSS project which has a sizeable CJK user
base, and that's XEmacs (whose I18N revolves around ISO-2022, and most
of the documentation is in Japanese). Even there, there are separate
mailing lists for English and Japanese, and the two seldom
communicate.

> I think that if we wait long enough, the filename encoding problems
> will become irrelevant and we will live in an ideal world where unicode
> actually works. Maybe next year, maybe only in ten years.

Maybe not even then. If Unicode really solved encoding problems, you'd
expect the CJK world to be the first adopters, but they're actually
the least eager; you are more likely to find UTF-8 in an
English-language HTML page or email message than a Japanese one.

--
Glynn Clements <gl...@gclements.plus.com>

Marcin 'Qrczak' Kowalczyk

unread,

Mar 19, 2005, 1:18:51 PM3/19/05

to haskel...@haskell.org

Wolfgang Thaller <wolfgang...@gmx.net> writes:

> Also, IIRC, Java strings are supposed to be unicode, too -
> how do they deal with the problem?

Java (Sun)
----------

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by "?".

Command line arguments and standard I/O are treated in the same way.

Java (GNU)
----------

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".

C# (mono)
---------

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
a directory listing.

The documentation says that if a filename, a command line argument
etc. looks like valid UTF-8, it is treated as such first, and
MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, U+0000 throws an exception
(System.ArgumentException: Path contains invalid chars), paired
surrogates are treated correctly, and an isolated surrogate causes
an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, unpaired surrogates are converted to pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.

--
__("< Marcin Kowalczyk
\__/ qrc...@knm.org.pl
^^ http://qrnik.knm.org.pl/~qrczak/

Ian Lynagh

unread,

Mar 19, 2005, 2:14:42 PM3/19/05

to haskel...@haskell.org, 299...@bugs.debian.org

On Wed, Mar 16, 2005 at 11:55:18AM +0000, Ross Paterson wrote:
> On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:
> > Do you have a list of functions which behave differently in the new
> > release to how they did in the previous release?
> > (I'm not interested in changes that will affect only whether something
> > compiles, not how it behaves given it compiles both before and after).
>
> I got lost in the negatives here. It affects all Haskell 98 primitives
> that do character I/O, or that exchange C strings with the C library.

In the below, it looks like there is a bug in getDirectoryContents.

Also, the error from w.hs is going to stdout, not stderr.

Most importantly, though: is there any way to remove this file without
doing something like an FFI import of unlink?

Is there anything LC_CTYPE can be set to that will act like C/POSIX but
accept 8-bit bytes as chars too?

(in the POSIX locale)
$ echo 'import Directory; main = getDirectoryContents "." >>= print' > q.hs
$ runhugs q.hs
[".","..","q.hs"]
$ touch 1`printf "\xA2"`
$ runhugs q.hs
runhugs: Error occurred

ERROR - Garbage collection fails to reclaim sufficient space

$ echo 'import Directory; main = removeFile "1\xA2"' > w.hs
$ runhugs w.hs

Program error: 1?: Directory.removeFile: does not exist (file does not exist)
$ strace -o strace.out runhugs w.hs > /dev/null
$ grep unlink strace.out | head -c 14 | hexdump -C
00000000 75 6e 6c 69 6e 6b 28 22 31 3f 22 29 20 20 |unlink("1?") |
0000000e
$ strace -o strace2.out rm 1*
$ grep unlink strace2.out | head -c 14 | hexdump -C
00000000 75 6e 6c 69 6e 6b 28 22 31 a2 22 29 20 20 |unlink("1.") |
0000000e
$

Now consider this e.hs:

--------------------
import IO

main = do hWaitForInput stdin 10000
putStrLn "Input is ready"
r <- hReady stdin
print r
c <- hGetChar stdin
print c
putStrLn "Done!"
--------------------

$ { printf "\xC2\xC2\xC2\xC2\xC2\xC2\xC2"; sleep 30; } | runhugs e.hs
Input is ready
True

Program error: <stdin>: IO.hGetChar: protocol error (invalid character encoding)
$

It takes 30 seconds for this error to be printed. This shows two issues:
First of all, I think you should be giving an error as soon as you have
a prefix that is the start of no character. Second, hReady now only
guarantees hGetChar won't block on a binary mode handle, but I guess
there is not much we can do except document that (short of some hideous
hacks).

Thanks
Ian

Wolfgang Thaller

unread,

Mar 19, 2005, 6:56:37 PM3/19/05

to Glynn Clements, haskel...@haskell.org

>> Also, IIRC, Java strings are supposed to be unicode, too -
>> how do they deal with the problem?
>
> Files are represented by instances of the File class:

> [...]

> The documentation for the File class doesn't mention encoding issues
> at all.

... which led me to conclude that they don't deal with the problem
properly.

>> I think that if we wait long enough, the filename encoding problems
>> will become irrelevant and we will live in an ideal world where
>> unicode
>> actually works. Maybe next year, maybe only in ten years.
>
> Maybe not even then. If Unicode really solved encoding problems, you'd
> expect the CJK world to be the first adopters, but they're actually
> the least eager; you are more likely to find UTF-8 in an
> English-language HTML page or email message than a Japanese one.

Hmm, that's possibly because english-language users can get away with
just marking their ASCII files as UTF-8. But I'm not arguing files or
HTML pages here, I'm only concerned with filenames. I prefer unicode
nowadays because I was born within a hundred kilometers of the "border"
between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language
texts, but as soon as I write about where I went for vacation, I need a
few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody
ever tried to sell ISO-2022 to me, so unicode was the only alternative.

So you've now convinced me that there is a considerable number of
computers using ISO-2022, where there's more than one way to encode the
same text (how do people use this from the command line??). There is
also multi-user systems where the user's don't agree on a single
encoding. I still reserve the right to call those systems messed-up,
but that's just my personal opinion and "reality" couldn't care less
about what I think.

So, as I don't want to stick with the status quo forever (lists of
bytes that pretend to be lists of unicode chars, even on platforms
where unicode is used anyway), how about we get to work - what do we
want?

I don't think we want a type class here, a plain (abstract) data type
will do:

> data File

Obviously, we'll need conversion from and to C strings. On Mac OS X,
they'd be guaranteed to be in UTF-8.

> withFilePathCString :: String -> (CString -> IO a) -> IO a
> fileFromCString :: CString -> IO File

We will need functions for converting to and from unicode strings. I'm
pretty sure that we want to keep those functions pure, otherwise
they'll be very annoying to use.

> fileFromPath :: String -> File

Any impure operations that might be needed to decide how to encode the
file name will have to be delayed until the File is actually used.

> fileToPath :: File -> String

Same here: any impure operation necessary to convert the File to a
unicode string needs to be done when the file is created.

What about failure? If you go from String to File, errors should be
reported when you actually access the file. At an earlier time, you
can't know whether the file name is valid (e.g. if you mount a
"classic" HFS volume on Mac OS X, you can only create files there whose
names can be represented in the volume's file name encoding - but you
only find that out once you try to create a file).

For going from File to String, I'm not so sure, but I would be very
annoyed if I had to deal with a Maybe String return type on platforms
where it will always succeed. Maybe there should be separate functions
for different purposes - i.e. for display, you'd use a File -> String
function that will silently use '?'s when things can't be decoded, but
in other situations you might use a File -> Maybe String function and
check for Nothing.

If people want to implement more sophisticated ways of decoding file
names than can be provided by the library, they'd get the C string and
do the same things.

Of course, there should also be lots of other useful functions that
make it more or less unnecessary to deal with path names directly in
most cases.

Thoughts?

Cheers,

Wolfgang

ro...@soi.city.ac.uk

unread,

Mar 19, 2005, 8:34:32 PM3/19/05

to 299...@bugs.debian.org, haskel...@haskell.org

On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:
> In the below, it looks like there is a bug in getDirectoryContents.

Yes, now fixed in CVS.

> Also, the error from w.hs is going to stdout, not stderr.

It's a nuisance, but noone has got around to changing it.

> Most importantly, though: is there any way to remove this file without
> doing something like an FFI import of unlink?
>
> Is there anything LC_CTYPE can be set to that will act like C/POSIX but
> accept 8-bit bytes as chars too?

en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
behaviour (and the GHC behaviour).

Indeed it's possible to have filenames (under POSIX, anyway) that H98
programs can't touch (under Hugs). That's pretty much follows from
the Haskell definition FilePath = String. The other thread under this
subject has touched on the need for an (additional) API using an abstract
FilePath type.

> Now consider this e.hs:
>
> --------------------
> import IO
>
> main = do hWaitForInput stdin 10000
> putStrLn "Input is ready"
> r <- hReady stdin
> print r
> c <- hGetChar stdin
> print c
> putStrLn "Done!"
> --------------------
>
> $ { printf "\xC2\xC2\xC2\xC2\xC2\xC2\xC2"; sleep 30; } | runhugs e.hs
> Input is ready
> True
>
> Program error: <stdin>: IO.hGetChar: protocol error (invalid character encoding)
> $
>
> It takes 30 seconds for this error to be printed. This shows two issues:
> First of all, I think you should be giving an error as soon as you have
> a prefix that is the start of no character. Second, hReady now only
> guarantees hGetChar won't block on a binary mode handle, but I guess
> there is not much we can do except document that (short of some hideous
> hacks).

Yes, I don't see how to avoid this when using mbtowc() to do the
conversion: it makes no distinction between a bad byte sequence and an
incomplete one.

John Meacham

unread,

Mar 19, 2005, 9:26:01 PM3/19/05

to haskel...@haskell.org

On Sat, Mar 19, 2005 at 03:04:04PM +0000, Glynn Clements wrote:
> I'm not suggesting inventing conventions. I'm suggesting leaving such
> issues to the application programmer who, unlike the library
> programmer, probably has enough context to be able to reliably
> determine the correct encoding in any specific instance.

But the whole point of Foreign.C.String is to interface to existing C
code. And one of the most common conventions of said interfaces is to
represent strings in the current locale, Which is why locale honoring
conversion routines are useful.

I don't think anyone is arguing that this is the end-all of charset
conversion, far from it. A general conversion library and parameterized
conversion routines are also needed for many of the reasons you said,
and will probably appear at some point. I have my own iconv interface
which I used for my initial implementation of with/peekCString etc. and
I am sure other people have written their own, eventually one will be
standardized. A general conversion facility has been on the wishlist for
a long time.

However, at the moment, the FFI is tackling a much simpler goal of
interfacing with existing C code, and non-parameterized locale-honoring
conversion routines are extremely useful for that. Even if we had a nice
generalized conversion routine, a simple locale-honoring front end would
be a very useful interface because it is so commonly needed when
interfacing to C code.

However, I am sure everyone would be happy if a nice cabalized general
charset conversion library appeared... I have the start of one here, whic=
h
should work on any POSIXy system, even if wchar_t is not unicode (no
windows support though)
http://repetae.net/john/recent/out/HsLocale.html

John

--
John Meacham - ⑆repetae.net⑆john⑈

Dimitry Golubovsky

unread,

Mar 19, 2005, 11:13:29 PM3/19/05

to haskel...@haskell.org

Glynn Clements wrote:

> To get a better idea, you would need to consult users whose language
> doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
> you don't usually find too many of them on lists such as this.

In Russia, we still have multiple one byte encodings for Cyrillic: KOI-8
(Unix), CP1251 (Windows), and getting more and more obsolete CP866
(MSDOS, OS/2). Regarding filenames, I am sure Windows stores them in
Unicode regarding of locale (I tried various chcp numbers in a console
window, printing directory containing filenames in Russian and in German
altogether, and it showed "non-characters" as question marks when
locale-based codepage was set, and showed everything with chcp 65001
which is Unicode). AFAIK Unix users do not create files named in Russian
very often, and Windows users do this frequently.

Dimitry Golubovsky
Middletown, CT

Ian Lynagh

unread,

Mar 19, 2005, 11:34:28 PM3/19/05

to haskel...@haskell.org, 299...@bugs.debian.org

On Sun, Mar 20, 2005 at 01:33:44AM +0000, ro...@soi.city.ac.uk wrote:
> On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:
>
> > Most importantly, though: is there any way to remove this file without
> > doing something like an FFI import of unlink?
> >
> > Is there anything LC_CTYPE can be set to that will act like C/POSIX but
> > accept 8-bit bytes as chars too?
>
> en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
> behaviour (and the GHC behaviour).

This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591
(or en_US). My /etc/locale.gen contains:

en_GB ISO-8859-1
en_GB.ISO-8859-15 ISO-8859-15
en_GB.UTF-8 UTF-8

So is there anything that /always/ works?

> Indeed it's possible to have filenames (under POSIX, anyway) that H98
> programs can't touch (under Hugs). That's pretty much follows from
> the Haskell definition FilePath = String. The other thread under this
> subject has touched on the need for an (additional) API using an abstract
> FilePath type.

Hmm. I can't say I'm convinced by all this without having something like
that API.

> Yes, I don't see how to avoid this when using mbtowc() to do the
> conversion: it makes no distinction between a bad byte sequence and an
> incomplete one.

Perhaps you could use mbrtowc instead?

My manpage says

If the n bytes starting at s do not contain a complete multibyte char-
acter, mbrtowc returns (size_t)(-2). This can happen even if n >=
MB_CUR_MAX, if the multibyte string contains redundant shift sequences.

If the multibyte string starting at s contains an invalid multibyte
sequence before the next complete character, mbrtowc returns
(size_t)(-1) and sets errno to EILSEQ. In this case, the effects on *ps
are undefined.

For both functions my manpage says

CONFORMING TO
ISO/ANSI C, UNIX98

Thanks
Ian

Keean Schupke

unread,

Mar 20, 2005, 8:06:44 AM3/20/05

to Ian Lynagh, 299...@bugs.debian.org, haskel...@haskell.org

One thing I don't like about this automatic conversion is that it is
hidden magic - and could catch people out. Let's say I don't want to use
it... How can I do the following
(ie what are the new API calls):

Open a file with a name that is invalid in the current locale (say a
zip disc from a computer with a different locale setting).

Open a file with contents in an unknown encoding.

What are the new binary API calls for file IO?

What type is returned from 'getChar' on a binary file. Should it
even be called getChar? what about getWord8 (getWord16, getWord32 etc...)

Does the encoding translation occur just on the filename or the
contents as well? What if I have an encoded filename with binary
contents and vice-versa.

Keean.

(I guess I now have to rewrite a lot of file IO code!)

ro...@soi.city.ac.uk

unread,

Mar 20, 2005, 1:43:38 PM3/20/05

to Keean Schupke, 299...@bugs.debian.org, haskel...@haskell.org

On Sun, Mar 20, 2005 at 12:59:52PM +0000, Keean Schupke wrote:
> How can I do the following (ie what are the new API calls):
>
> Open a file with a name that is invalid in the current locale (say a
> zip disc from a computer with a different locale setting).

A new API is needed for this.

> Open a file with contents in an unknown encoding.
>
> What are the new binary API calls for file IO?

see System.IO

> What type is returned from 'getChar' on a binary file. Should it
> even be called getChar? what about getWord8 (getWord16, getWord32 etc...)

Char, of course. And yes, it's not ideal. There's also a byte array
interface.

> (I guess I now have to rewrite a lot of file IO code!)

If it was doing binary I/O on H98 Handles, it already needed rewriting.
There's nothing to be done for filenames until a new API emerges.

Ross Paterson

unread,

Mar 21, 2005, 5:35:32 AM3/21/05

to haskel...@haskell.org, 299...@bugs.debian.org

On Sun, Mar 20, 2005 at 04:34:12AM +0000, Ian Lynagh wrote:
> On Sun, Mar 20, 2005 at 01:33:44AM +0000, ro...@soi.city.ac.uk wrote:
> > On Sat, Mar 19, 2005 at 07:14:25PM +0000, Ian Lynagh wrote:
> > > Is there anything LC_CTYPE can be set to that will act like C/POSIX but
> > > accept 8-bit bytes as chars too?
> >
> > en_GB.iso88591 (or indeed any .iso88591 locale) will match the old
> > behaviour (and the GHC behaviour).
>
> This works for me with en_GB.iso88591 (or en_GB), but not en_US.iso88591
> (or en_US). My /etc/locale.gen contains:
>
> en_GB ISO-8859-1
> en_GB.ISO-8859-15 ISO-8859-15
> en_GB.UTF-8 UTF-8
>
> So is there anything that /always/ works?

Since systems may have no locale other than C/POSIX, no.

> > Yes, I don't see how to avoid this when using mbtowc() to do the
> > conversion: it makes no distinction between a bad byte sequence and an
> > incomplete one.
>
> Perhaps you could use mbrtowc instead?

Indeed. Thanks for pointing it out.

Glynn Clements

unread,

Mar 21, 2005, 5:27:55 PM3/21/05

to John Meacham, haskel...@haskell.org

John Meacham wrote:

> > I'm not suggesting inventing conventions. I'm suggesting leaving such
> > issues to the application programmer who, unlike the library
> > programmer, probably has enough context to be able to reliably
> > determine the correct encoding in any specific instance.
>
> But the whole point of Foreign.C.String is to interface to existing C
> code. And one of the most common conventions of said interfaces is to
> represent strings in the current locale, Which is why locale honoring
> conversion routines are useful.

My point is that most C functions which accept or return char*s will
work regardless of whether those char*s can be decoded according to
the current locale. E.g.

while (d = readdir(dir), d)
{
stat(d->d_name, &st);
...
}

will stat() every filename in the directory regardless of whether or
not the filenames are valid in the locale's encoding.

The Haskell equivalent using FilePath (i.e. String),
getDirectoryContents etc currently only works because the char* <->
String conversions are hardcoded to ISO-8859-1, which is infallible
and reversible. If it used e.g. UTF-8, it would fail on any filename
which wasn't valid UTF-8 even though it never actually needs to know
the string of characters which the filename represents.

The same applies to reading filenames from argv[] and passing them to
open() etc. This is one of the most common idioms in Unix programming,
and it doesn't care about encodings at all. Again, it would cease to
work reliably in Haskell if the automatic char* <-> String conversions
in getArgs etc started using the locale.

I'm not arguing about *how* char* <-> String conversions should be
performed so much as arguing about *whether* these conversions should
be performed. The conversion issues are only problems because the
conversions are being done at all.

--
Glynn Clements <gl...@gclements.plus.com>