invalid character encoding
flag
Messages 31 - 40 of 49 - Collapse all
/groups/adfetch?hl=en&adid=WtKyqQ8AAABlh9cAjSJqZmrBCuXNB8uv
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
31.  Wolfgang Thaller  
View profile  
 More options Mar 17 2005, 11:17 pm
Newsgroups: fa.haskell
From: Wolfgang Thaller <wolfgang.thal...@gmx.net>
Date: Fri, 18 Mar 2005 04:17:01 GMT
Local: Thurs, Mar 17 2005 11:17 pm
Subject: Re: [Haskell-cafe] invalid character encoding

> If you try to pretend that I18N comes down to shoe-horning everything
> into Unicode, you will turn the language into a joke.

How common will those problems you are describing be by the time this
has been implemented?
How common are they even now?
I haven't yet encountered a unix box where the file names were not in
the system locale encoding. On all reasonably up-to-date Linux boxes
that I've seen recently, they were in UTF-8 (and the system locale
agreed).
On both Windows and Mac OS X, filenames are stored in Unicode, so it is
always possible to convert them to unicode.
So we can't do Unicode-based I18N because there exist a few unix
systems with messed-up file systems?

> Haskell's Unicode support is a joke because the API designers tried to
> avoid the issues related to encoding with wishful thinking (i.e. you
> open a file and you magically get Unicode characters out of it).

OK, that part is purely wishful thinking, but assuming that filenames
are text that can be represented in Unicode is wishful thinking that
corresponds to 99% of reality. So why can't the remaining 1 percent of
reality be fixed instead?

Cheers,

Wolfgang

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
32.  Glynn Clements  
View profile  
 More options Mar 18 2005, 2:01 pm
Newsgroups: fa.haskell
From: Glynn Clements <gl...@gclements.plus.com>
Date: Fri, 18 Mar 2005 19:01:01 GMT
Local: Fri, Mar 18 2005 2:01 pm
Subject: Re: [Haskell-cafe] invalid character encoding

Wolfgang Thaller wrote:
> > If you try to pretend that I18N comes down to shoe-horning everything
> > into Unicode, you will turn the language into a joke.

> How common will those problems you are describing be by the time this
> has been implemented?
> How common are they even now?

Right now, GHC assumes ISO-8859-1 whenever it has to automatically
convert between String and CString. Conversions to and from ISO-8859-1
cannot fail, and encoding and decoding are exact inverses.

OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
the correct encoding, but that doesn't actually matter a lot of the
time; frequently, you're just grabbing a "blob" of data from one
function and passing it to another.

The problems will only appear once you start dealing with fallible or
non-reversible encodings such as UTF-8 or ISO-2022. If and when that
happens, I guess we'll find out how common the problems are. Of
course, it's quite possible that the only test cases will be people
using UTF-8-only (or even ASCII-only) systems, in which case you won't
see any problems.

> I haven't yet encountered a unix box where the file names were not in
> the system locale encoding. On all reasonably up-to-date Linux boxes
> that I've seen recently, they were in UTF-8 (and the system locale
> agreed).

I've encountered boxes where multiple encodings were used; primarily
web and FTP servers which were shared amongst multiple clients. Each
client used whichever encoding(s) they felt like. IIRC, the most
common non-ASCII encoding was MS-DOS codepage 850 (the clients were
mostly using Windows 3.1 at that time).

I haven't done sysadmin for a while, so I don't know the current
situation, but I don't think that the world has switched to UTF-8 in
the mean time. [Most of the non-ASCII filenames which I've seen
recently have been either ISO-8859-1 or Win-12XX; I haven't seen much
UTF-8.]

> On both Windows and Mac OS X, filenames are stored in Unicode, so it is
> always possible to convert them to unicode.
> So we can't do Unicode-based I18N because there exist a few unix
> systems with messed-up file systems?

Declaring such systems to be "messed up" won't make the problems go
away. If a design doesn't work in reality, it's the fault of the
design, not of reality.

> > Haskell's Unicode support is a joke because the API designers tried to
> > avoid the issues related to encoding with wishful thinking (i.e. you
> > open a file and you magically get Unicode characters out of it).

> OK, that part is purely wishful thinking, but assuming that filenames
> are text that can be represented in Unicode is wishful thinking that
> corresponds to 99% of reality.
> So why can't the remaining 1 percent of reality be fixed instead?

The issue isn't whether the data can be represented as Unicode text,
but whether you can convert it to and from Unicode without problems.
To do this, you need to know the encoding, you need to store the
encoding so that you can convert the wide string back to a byte
string, and the encoding needs to be reversible.

--
Glynn Clements <gl...@gclements.plus.com>
_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
33.  Wolfgang Thaller  
View profile  
 More options Mar 19 2005, 1:10 am
Newsgroups: fa.haskell
From: Wolfgang Thaller <wolfgang.thal...@gmx.net>
Date: Sat, 19 Mar 2005 06:10:38 GMT
Local: Sat, Mar 19 2005 1:10 am
Subject: Re: [Haskell-cafe] invalid character encoding

Glynn Clements wrote:
> OK, so the intermediate string will be nonsense if ISO-8859-1 isn't
> the correct encoding, but that doesn't actually matter a lot of the
> time; frequently, you're just grabbing a "blob" of data from one
> function and passing it to another.

Yes. Of course, this also means that Strings representing non-ASCII
filenames will *always* be nonsense on Mac OS X and other UTF8-based
platforms.

> The problems will only appear once you start dealing with fallible or
> non-reversible encodings such as UTF-8 or ISO-2022.

In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
file name that is converted to Unicode cannot be converted back any
more (assuming you know for sure that it was ISO-2022 in the first
place)?

> Of course, it's quite possible that the only test cases will be people
> using UTF-8-only (or even ASCII-only) systems, in which case you won't
> see any problems.

I'm kind of hoping that we can just ignore a problem that is so rare
that a large and well-known project like GTK2 can get away with
ignoring it. Also, IIRC, Java strings are supposed to be unicode, too -
how do they deal with the problem?

>> So we can't do Unicode-based I18N because there exist a few unix
>> systems with messed-up file systems?

> Declaring such systems to be "messed up" won't make the problems go
> away. If a design doesn't work in reality, it's the fault of the
> design, not of reality.

In general, yes. But we're not talking about all of reality here, we're
talking about one small part of reality - the question is, can the part
of reality where the design doesn't work be ignored?

For example, as soon as we use any kind of path names in our APIs, we
are ignoring reality on good old "Classic" Mac OS (may it rest in
piece). Path names don't always uniquely denote a file there (although
they do most of the time). People writing cross-platform software have
been ignoring this fact for a long time now.

I think that if we wait long enough, the filename encoding problems
will become irrelevant and we will live in an ideal world where unicode
actually works. Maybe next year, maybe only in ten years. And while we
are arguing about how far we are from that ideal world, we should think
about alternatives. The current hack is really just a hack, and I don't
want to see this hack become the new accepted standard.

Do we have other alternatives? Preferably something that provides other
advantages over a unicode String than just making things work on
systems that many users never encounter, otherwise almost no one will
bother to use it. So maybe we should start looking for _other_ reasons
to represent file names and paths by an abstract datatype or something?

Cheers,

Wolfgang

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
34.  Einar Karttunen  
View profile  
 More options Mar 19 2005, 4:34 am
Newsgroups: fa.haskell
From: Einar Karttunen <ekart...@cs.helsinki.fi>
Date: Sat, 19 Mar 2005 09:34:54 GMT
Local: Sat, Mar 19 2005 4:34 am
Subject: Re: [Haskell-cafe] invalid character encoding

Wolfgang Thaller <wolfgang.thal...@gmx.net> writes:
> In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
> file name that is converted to Unicode cannot be converted back any
> more (assuming you know for sure that it was ISO-2022 in the first
> place)?

I am no expert on ISO-2022 so the following may contain errors,
please correct if it is wrong.

ISO-2022 -> Unicode is always possible.
Also Unicode -> ISO-2022 should be always possible, but is a relation
not a function. This means there are an infinite? ways of encoding a
particular unicode string in ISO-2022.

ISO-2022 works by providing escape sequences to switch between different
character sets. One can freely use these escapes in almost any way you
wish. Also ISO-2022 makes a difference between the same character in
japanese/chinese/korean - which unicode does not do.

See here for more info on the topic:
http://www.ecma-international.org/publications/files/ecma-st/ECMA-035...

Also trusting system locale for everything is problematic and makes
things quite unbearable for I18N. e.g. on my desktop 95% of things run
with iso-8859-1, 3% of things use utf-8 and a few apps use EUC-JP...

Using filenames as opaque blobs causes the least problems. If the
program wishes to display them in a graphical environment then they have
to be converted to a string, but very many apps never display the
filenames...

- Einar Karttunen
_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
35.  Glynn Clements  
View profile  
 More options Mar 19 2005, 9:33 am
Newsgroups: fa.haskell
From: Glynn Clements <gl...@gclements.plus.com>
Date: Sat, 19 Mar 2005 14:33:13 GMT
Local: Sat, Mar 19 2005 9:33 am
Subject: Re: [Haskell-cafe] invalid character encoding

Exactly.

Moreover, while there are an infinite number of equivalent
representations in theory (you can add as many redundant switching
sequences as you wish), there are multiple "plausible" equivalent
representations in practice.

--
Glynn Clements <gl...@gclements.plus.com>
_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
36.  Glynn Clements  
View profile  
 More options Mar 19 2005, 12:35 pm
Newsgroups: fa.haskell
From: Glynn Clements <gl...@gclements.plus.com>
Date: Sat, 19 Mar 2005 17:35:30 GMT
Local: Sat, Mar 19 2005 12:35 pm
Subject: Re: [Haskell-cafe] invalid character encoding

Wolfgang Thaller wrote:
> > Of course, it's quite possible that the only test cases will be people
> > using UTF-8-only (or even ASCII-only) systems, in which case you won't
> > see any problems.

> I'm kind of hoping that we can just ignore a problem that is so rare
> that a large and well-known project like GTK2 can get away with
> ignoring it.

1. The filename issues in GTK-2 are likely to be a major problem in
CJK locales, where filenames which don't match the locale (which is
seldom UTF-8) are common.

2. GTK's filename handling only really applies to file selector
dialogs. Most other uses of filenames in a GTK-based application don't
involve GTK; they use the OS API functions which just deal with byte
strings.

3. GTK is a GUI library. Most of the text which it deals with is going
to be rendered, so it *has* to be interpreted as characters. Treating
it as blobs of data won't work. IOW, on the question of whether or not
to interpret byte strings as character strings, GTK is at the far end
of the scale.

> Also, IIRC, Java strings are supposed to be unicode, too -
> how do they deal with the problem?

Files are represented by instances of the File class:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html

        An abstract representation of file and directory pathnames.

You can construct Files from Strings, and convert Files to Strings.

The File class includes two sets of directory enumeration methods:
list() returns an array of Strings, while listFiles() returns an array
of Files.

The documentation for the File class doesn't mention encoding issues
at all. However, with that interface, it would be possible to
enumerate and open filenames which cannot be decoded.

> >> So we can't do Unicode-based I18N because there exist a few unix
> >> systems with messed-up file systems?

> > Declaring such systems to be "messed up" won't make the problems go
> > away. If a design doesn't work in reality, it's the fault of the
> > design, not of reality.

> In general, yes. But we're not talking about all of reality here, we're
> talking about one small part of reality - the question is, can the part
> of reality where the design doesn't work be ignored?

Sure, you *can* ignore it; K&R C ignored everything other than ASCII.
If you limit yourself to locales which use the Roman alphabet (i.e.
ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot.

Most such users avoid encoding issues altogether by dropping the
accents and sticking to ASCII, at least when dealing with files which
might leave their system.

To get a better idea, you would need to consult users whose language
doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
you don't usually find too many of them on lists such as this.

I'm only familiar with one OSS project which has a sizeable CJK user
base, and that's XEmacs (whose I18N revolves around ISO-2022, and most
of the documentation is in Japanese). Even there, there are separate
mailing lists for English and Japanese, and the two seldom
communicate.

> I think that if we wait long enough, the filename encoding problems
> will become irrelevant and we will live in an ideal world where unicode
> actually works. Maybe next year, maybe only in ten years.

Maybe not even then. If Unicode really solved encoding problems, you'd
expect the CJK world to be the first adopters, but they're actually
the least eager; you are more likely to find UTF-8 in an
English-language HTML page or email message than a Japanese one.

--
Glynn Clements <gl...@gclements.plus.com>
_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
37.  Wolfgang Thaller  
View profile  
 More options Mar 19 2005, 6:56 pm
Newsgroups: fa.haskell
From: Wolfgang Thaller <wolfgang.thal...@gmx.net>
Date: Sat, 19 Mar 2005 23:56:37 GMT
Local: Sat, Mar 19 2005 6:56 pm
Subject: Re: [Haskell-cafe] invalid character encoding

>> Also, IIRC, Java strings are supposed to be unicode, too -
>> how do they deal with the problem?

> Files are represented by instances of the File class:
> [...]
> The documentation for the File class doesn't mention encoding issues
> at all.

... which led me to conclude that they don't deal with the problem
properly.

>> I think that if we wait long enough, the filename encoding problems
>> will become irrelevant and we will live in an ideal world where
>> unicode
>> actually works. Maybe next year, maybe only in ten years.

> Maybe not even then. If Unicode really solved encoding problems, you'd
> expect the CJK world to be the first adopters, but they're actually
> the least eager; you are more likely to find UTF-8 in an
> English-language HTML page or email message than a Japanese one.

Hmm, that's possibly because english-language users can get away with
just marking their ASCII files as UTF-8. But I'm not arguing files or
HTML pages here, I'm only concerned with filenames. I prefer unicode
nowadays because I was born within a hundred kilometers of the "border"
between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language
texts, but as soon as I write about where I went for vacation, I need a
few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody
ever tried to sell ISO-2022 to me, so unicode was the only alternative.

So you've now convinced me that there is a considerable number of
computers using ISO-2022, where there's more than one way to encode the
same text (how do people use this from the command line??). There is
also multi-user systems where the user's don't agree on a single
encoding. I still reserve the right to call those systems messed-up,
but that's just my personal opinion and "reality" couldn't care less
about what I think.

So, as I don't want to stick with the status quo forever (lists of
bytes that pretend to be lists of unicode chars, even on platforms
where unicode is used anyway), how about we get to work - what do we
want?

I don't think we want a type class here, a plain (abstract) data type
will do:

 > data File

Obviously, we'll need conversion from and to C strings. On Mac OS X,
they'd be guaranteed to be in UTF-8.

 > withFilePathCString :: String -> (CString -> IO a) -> IO a
 > fileFromCString :: CString -> IO File

We will need functions for converting to and from unicode strings. I'm
pretty sure that we want to keep those functions pure, otherwise
they'll be very annoying to use.

 > fileFromPath :: String -> File

Any impure operations that might be needed to decide how to encode the
file name will have to be delayed until the File is actually used.

 > fileToPath :: File -> String

Same here: any impure operation necessary to convert the File to a
unicode string needs to be done when the file is created.

What about failure? If you go from String to File, errors should be
reported when you actually access the file. At an earlier time, you
can't know whether the file name is valid (e.g. if you mount a
"classic" HFS volume on Mac OS X, you can only create files there whose
names can be represented in the volume's file name encoding - but you
only find that out once you try to create a file).

For going from File to String, I'm not so sure, but I would be very
annoyed if I had to deal with a Maybe String return type on platforms
where it will always succeed. Maybe there should be separate functions
for different purposes - i.e. for display, you'd use a File -> String
function that will silently use '?'s when things can't be decoded, but
in other situations you might use a File -> Maybe String function and
check for Nothing.

If people want to implement more sophisticated ways of decoding file
names than can be provided by the library, they'd get the C string and
do the same things.

Of course, there should also be lots of other useful functions that
make it more or less unnecessary to deal with path names directly in
most cases.

Thoughts?

Cheers,

Wolfgang

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
38.  Dimitry Golubovsky  
View profile  
 More options Mar 19 2005, 11:13 pm
Newsgroups: fa.haskell
From: Dimitry Golubovsky <dimi...@golubovsky.org>
Date: Sun, 20 Mar 2005 04:13:29 GMT
Local: Sat, Mar 19 2005 11:13 pm
Subject: Re: [Haskell-cafe] invalid character encoding

Glynn Clements wrote:
> To get a better idea, you would need to consult users whose language
> doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately,
> you don't usually find too many of them on lists such as this.

In Russia, we still have multiple one byte encodings for Cyrillic: KOI-8
(Unix), CP1251 (Windows), and getting more and more obsolete CP866
(MSDOS, OS/2). Regarding filenames, I am sure Windows stores them in
Unicode regarding of locale (I tried various chcp numbers in a console
window, printing directory containing filenames in Russian and in German
altogether, and it showed "non-characters" as question marks when
locale-based codepage was set, and showed everything with chcp 65001
which is Unicode). AFAIK Unix users do not create files named in Russian
very often, and Windows users do this frequently.

Dimitry  Golubovsky
Middletown, CT

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
39.  Marcin 'Qrczak' Kowalczyk  
View profile  
 More options Mar 19 2005, 1:18 pm
Newsgroups: fa.haskell
From: "Marcin 'Qrczak' Kowalczyk" <qrc...@knm.org.pl>
Date: Sat, 19 Mar 2005 18:18:51 GMT
Local: Sat, Mar 19 2005 1:18 pm
Subject: Re: [Haskell-cafe] invalid character encoding

Wolfgang Thaller <wolfgang.thal...@gmx.net> writes:
> Also, IIRC, Java strings are supposed to be unicode, too -
> how do they deal with the problem?

Java (Sun)
----------

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by "?".

Command line arguments and standard I/O are treated in the same way.

Java (GNU)
----------

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
   Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".

C# (mono)
---------

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, U+0000 throws an exception
   (System.ArgumentException: Path contains invalid chars), paired
   surrogates are treated correctly, and an isolated surrogate causes
   an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, unpaired surrogates are converted to pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.

--
   __("<         Marcin Kowalczyk
   \__/       qrc...@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
40.  Ian Lynagh  
View profile  
 More options Mar 19 2005, 2:14 pm
Newsgroups: fa.haskell
From: Ian Lynagh <ig...@earth.li>
Date: Sat, 19 Mar 2005 19:14:42 GMT
Local: Sat, Mar 19 2005 2:14 pm
Subject: Re: [Haskell-cafe] invalid character encoding

On Wed, Mar 16, 2005 at 11:55:18AM +0000, Ross Paterson wrote:
> On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote:
> > Do you have a list of functions which behave differently in the new
> > release to how they did in the previous release?
> > (I'm not interested in changes that will affect only whether something
> > compiles, not how it behaves given it compiles both before and after).

> I got lost in the negatives here.  It affects all Haskell 98 primitives
> that do character I/O, or that exchange C strings with the C library.

In the below, it looks like there is a bug in getDirectoryContents.

Also, the error from w.hs is going to stdout, not stderr.

Most importantly, though: is there any way to remove this file without
doing something like an FFI import of unlink?

Is there anything LC_CTYPE can be set to that will act like C/POSIX but
accept 8-bit bytes as chars too?

(in the POSIX locale)
$ echo 'import Directory; main = getDirectoryContents "." >>= print' > q.hs
$ runhugs q.hs
[".","..","q.hs"]
$ touch 1`printf "\xA2"`
$ runhugs q.hs
runhugs: Error occurred

ERROR - Garbage collection fails to reclaim sufficient space

$ echo 'import Directory; main = removeFile "1\xA2"' > w.hs
$ runhugs w.hs

Program error: 1?: Directory.removeFile: does not exist (file does not exist)
$ strace -o strace.out runhugs w.hs > /dev/null
$ grep unlink strace.out | head -c 14 | hexdump -C
00000000  75 6e 6c 69 6e 6b 28 22  31 3f 22 29 20 20        |unlink("1?")  |
0000000e
$ strace -o strace2.out rm 1*
$ grep unlink strace2.out | head -c 14 | hexdump -C
00000000  75 6e 6c 69 6e 6b 28 22  31 a2 22 29 20 20        |unlink("1.")  |
0000000e
$

Now consider this e.hs:

--------------------
import IO

main = do hWaitForInput stdin 10000
          putStrLn "Input is ready"
          r <- hReady stdin
          print r
          c <- hGetChar stdin
          print c
          putStrLn "Done!"
--------------------

$ { printf "\xC2\xC2\xC2\xC2\xC2\xC2\xC2"; sleep 30; } | runhugs e.hs
Input is ready
True

Program error: <stdin>: IO.hGetChar: protocol error (invalid character encoding)
$

It takes 30 seconds for this error to be printed. This shows two issues:
First of all, I think you should be giving an error as soon as you have
a prefix that is the start of no character. Second, hReady now only
guarantees hGetChar won't block on a binary mode handle, but I guess
there is not much we can do except document that (short of some hideous
hacks).

Thanks
Ian

_______________________________________________
Haskell-Cafe mailing list
Haskell-C...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2013 Google