Newsgroups: fa.haskell
From: Wolfgang Thaller <wolfgang.thal...@gmx.net>
Date: Fri, 18 Mar 2005 04:17:01 GMT
Local: Thurs, Mar 17 2005 11:17 pm
Subject: Re: [Haskell-cafe] invalid character encoding
> If you try to pretend that I18N comes down to shoe-horning everything How common will those problems you are describing be by the time this > into Unicode, you will turn the language into a joke. has been implemented? How common are they even now? I haven't yet encountered a unix box where the file names were not in the system locale encoding. On all reasonably up-to-date Linux boxes that I've seen recently, they were in UTF-8 (and the system locale agreed). On both Windows and Mac OS X, filenames are stored in Unicode, so it is always possible to convert them to unicode. So we can't do Unicode-based I18N because there exist a few unix systems with messed-up file systems? > Haskell's Unicode support is a joke because the API designers tried to OK, that part is purely wishful thinking, but assuming that filenames > avoid the issues related to encoding with wishful thinking (i.e. you > open a file and you magically get Unicode characters out of it). are text that can be represented in Unicode is wishful thinking that corresponds to 99% of reality. So why can't the remaining 1 percent of reality be fixed instead? Cheers, Wolfgang _______________________________________________ You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Glynn Clements <gl...@gclements.plus.com>
Date: Fri, 18 Mar 2005 19:01:01 GMT
Local: Fri, Mar 18 2005 2:01 pm
Subject: Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller wrote: Right now, GHC assumes ISO-8859-1 whenever it has to automatically > > If you try to pretend that I18N comes down to shoe-horning everything > > into Unicode, you will turn the language into a joke. > How common will those problems you are describing be by the time this convert between String and CString. Conversions to and from ISO-8859-1 cannot fail, and encoding and decoding are exact inverses. OK, so the intermediate string will be nonsense if ISO-8859-1 isn't The problems will only appear once you start dealing with fallible or > I haven't yet encountered a unix box where the file names were not in I've encountered boxes where multiple encodings were used; primarily > the system locale encoding. On all reasonably up-to-date Linux boxes > that I've seen recently, they were in UTF-8 (and the system locale > agreed). web and FTP servers which were shared amongst multiple clients. Each client used whichever encoding(s) they felt like. IIRC, the most common non-ASCII encoding was MS-DOS codepage 850 (the clients were mostly using Windows 3.1 at that time). I haven't done sysadmin for a while, so I don't know the current > On both Windows and Mac OS X, filenames are stored in Unicode, so it is Declaring such systems to be "messed up" won't make the problems go > always possible to convert them to unicode. > So we can't do Unicode-based I18N because there exist a few unix > systems with messed-up file systems? away. If a design doesn't work in reality, it's the fault of the design, not of reality. > > Haskell's Unicode support is a joke because the API designers tried to The issue isn't whether the data can be represented as Unicode text, > > avoid the issues related to encoding with wishful thinking (i.e. you > > open a file and you magically get Unicode characters out of it). > OK, that part is purely wishful thinking, but assuming that filenames but whether you can convert it to and from Unicode without problems. To do this, you need to know the encoding, you need to store the encoding so that you can convert the wide string back to a byte string, and the encoding needs to be reversible. -- You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Wolfgang Thaller <wolfgang.thal...@gmx.net>
Date: Sat, 19 Mar 2005 06:10:38 GMT
Local: Sat, Mar 19 2005 1:10 am
Subject: Re: [Haskell-cafe] invalid character encoding
Glynn Clements wrote: Yes. Of course, this also means that Strings representing non-ASCII > OK, so the intermediate string will be nonsense if ISO-8859-1 isn't > the correct encoding, but that doesn't actually matter a lot of the > time; frequently, you're just grabbing a "blob" of data from one > function and passing it to another. filenames will *always* be nonsense on Mac OS X and other UTF8-based platforms. > The problems will only appear once you start dealing with fallible or In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 > non-reversible encodings such as UTF-8 or ISO-2022. file name that is converted to Unicode cannot be converted back any more (assuming you know for sure that it was ISO-2022 in the first place)? > Of course, it's quite possible that the only test cases will be people I'm kind of hoping that we can just ignore a problem that is so rare > using UTF-8-only (or even ASCII-only) systems, in which case you won't > see any problems. that a large and well-known project like GTK2 can get away with ignoring it. Also, IIRC, Java strings are supposed to be unicode, too - how do they deal with the problem? >> So we can't do Unicode-based I18N because there exist a few unix In general, yes. But we're not talking about all of reality here, we're >> systems with messed-up file systems? > Declaring such systems to be "messed up" won't make the problems go talking about one small part of reality - the question is, can the part of reality where the design doesn't work be ignored? For example, as soon as we use any kind of path names in our APIs, we I think that if we wait long enough, the filename encoding problems Do we have other alternatives? Preferably something that provides other Cheers, Wolfgang _______________________________________________ You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Einar Karttunen <ekart...@cs.helsinki.fi>
Date: Sat, 19 Mar 2005 09:34:54 GMT
Local: Sat, Mar 19 2005 4:34 am
Subject: Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller <wolfgang.thal...@gmx.net> writes: I am no expert on ISO-2022 so the following may contain errors, > In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022 > file name that is converted to Unicode cannot be converted back any > more (assuming you know for sure that it was ISO-2022 in the first > place)? please correct if it is wrong. ISO-2022 -> Unicode is always possible. ISO-2022 works by providing escape sequences to switch between different See here for more info on the topic: Also trusting system locale for everything is problematic and makes Using filenames as opaque blobs causes the least problems. If the - Einar Karttunen You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Glynn Clements <gl...@gclements.plus.com>
Date: Sat, 19 Mar 2005 14:33:13 GMT
Local: Sat, Mar 19 2005 9:33 am
Subject: Re: [Haskell-cafe] invalid character encoding
Exactly.
Moreover, while there are an infinite number of equivalent -- You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Glynn Clements <gl...@gclements.plus.com>
Date: Sat, 19 Mar 2005 17:35:30 GMT
Local: Sat, Mar 19 2005 12:35 pm
Subject: Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller wrote: 1. The filename issues in GTK-2 are likely to be a major problem in > > Of course, it's quite possible that the only test cases will be people > > using UTF-8-only (or even ASCII-only) systems, in which case you won't > > see any problems. > I'm kind of hoping that we can just ignore a problem that is so rare CJK locales, where filenames which don't match the locale (which is seldom UTF-8) are common. 2. GTK's filename handling only really applies to file selector 3. GTK is a GUI library. Most of the text which it deals with is going > Also, IIRC, Java strings are supposed to be unicode, too - Files are represented by instances of the File class: > how do they deal with the problem? http://java.sun.com/j2se/1.5.0/docs/api/java/io/File.html An abstract representation of file and directory pathnames. You can construct Files from Strings, and convert Files to Strings. The File class includes two sets of directory enumeration methods: The documentation for the File class doesn't mention encoding issues > >> So we can't do Unicode-based I18N because there exist a few unix Sure, you *can* ignore it; K&R C ignored everything other than ASCII. > >> systems with messed-up file systems? > > Declaring such systems to be "messed up" won't make the problems go > In general, yes. But we're not talking about all of reality here, we're If you limit yourself to locales which use the Roman alphabet (i.e. ISO-8859-N for N=1/2/3/4/9/15), you can get away with a lot. Most such users avoid encoding issues altogether by dropping the To get a better idea, you would need to consult users whose language I'm only familiar with one OSS project which has a sizeable CJK user > I think that if we wait long enough, the filename encoding problems Maybe not even then. If Unicode really solved encoding problems, you'd > will become irrelevant and we will live in an ideal world where unicode > actually works. Maybe next year, maybe only in ten years. expect the CJK world to be the first adopters, but they're actually the least eager; you are more likely to find UTF-8 in an English-language HTML page or email message than a Japanese one. -- You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Wolfgang Thaller <wolfgang.thal...@gmx.net>
Date: Sat, 19 Mar 2005 23:56:37 GMT
Local: Sat, Mar 19 2005 6:56 pm
Subject: Re: [Haskell-cafe] invalid character encoding
>> Also, IIRC, Java strings are supposed to be unicode, too - ... which led me to conclude that they don't deal with the problem >> how do they deal with the problem? > Files are represented by instances of the File class: properly. >> I think that if we wait long enough, the filename encoding problems Hmm, that's possibly because english-language users can get away with >> will become irrelevant and we will live in an ideal world where >> unicode >> actually works. Maybe next year, maybe only in ten years. > Maybe not even then. If Unicode really solved encoding problems, you'd just marking their ASCII files as UTF-8. But I'm not arguing files or HTML pages here, I'm only concerned with filenames. I prefer unicode nowadays because I was born within a hundred kilometers of the "border" between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language texts, but as soon as I write about where I went for vacation, I need a few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody ever tried to sell ISO-2022 to me, so unicode was the only alternative. So you've now convinced me that there is a considerable number of So, as I don't want to stick with the status quo forever (lists of I don't think we want a type class here, a plain (abstract) data type > data File Obviously, we'll need conversion from and to C strings. On Mac OS X, > withFilePathCString :: String -> (CString -> IO a) -> IO a We will need functions for converting to and from unicode strings. I'm > fileFromPath :: String -> File Any impure operations that might be needed to decide how to encode the > fileToPath :: File -> String Same here: any impure operation necessary to convert the File to a What about failure? If you go from String to File, errors should be For going from File to String, I'm not so sure, but I would be very If people want to implement more sophisticated ways of decoding file Of course, there should also be lots of other useful functions that Thoughts? Cheers, Wolfgang _______________________________________________ You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Dimitry Golubovsky <dimi...@golubovsky.org>
Date: Sun, 20 Mar 2005 04:13:29 GMT
Local: Sat, Mar 19 2005 11:13 pm
Subject: Re: [Haskell-cafe] invalid character encoding
Glynn Clements wrote: In Russia, we still have multiple one byte encodings for Cyrillic: KOI-8 > To get a better idea, you would need to consult users whose language > doesn't use the roman alphabet, e.g. CJK or cyrillic. Unfortunately, > you don't usually find too many of them on lists such as this. (Unix), CP1251 (Windows), and getting more and more obsolete CP866 (MSDOS, OS/2). Regarding filenames, I am sure Windows stores them in Unicode regarding of locale (I tried various chcp numbers in a console window, printing directory containing filenames in Russian and in German altogether, and it showed "non-characters" as question marks when locale-based codepage was set, and showed everything with chcp 65001 which is Unicode). AFAIK Unix users do not create files named in Russian very often, and Windows users do this frequently. Dimitry Golubovsky _______________________________________________ You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: "Marcin 'Qrczak' Kowalczyk" <qrc...@knm.org.pl>
Date: Sat, 19 Mar 2005 18:18:51 GMT
Local: Sat, Mar 19 2005 1:18 pm
Subject: Re: [Haskell-cafe] invalid character encoding
Wolfgang Thaller <wolfgang.thal...@gmx.net> writes: Java (Sun) > Also, IIRC, Java strings are supposed to be unicode, too - > how do they deal with the problem? ---------- Filenames are assumed to be in the locale encoding. a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD. b) Creating. Characters which cannot be converted are replaced by "?". Command line arguments and standard I/O are treated in the same way. Java (GNU) Filenames are assumed to be in Java-modified UTF-8. a) Interpreting. If a filename cannot be converted, a directory listing b) Creating. All Java characters are representable in Java-modified UTF-8. Command line arguments are interpreted according to the locale. Standard I/O works in ISO-8859-1 by default. Obviously all input is C# (mono) Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS a) Interpreting. If a filename cannot be converted, it's skipped in The documentation says that if a filename, a command line argument b) Creating. If UTF-8 is used, U+0000 throws an exception Command line arguments are treated in the same way, except that if an Console.WriteLine emits UTF-8. Paired surrogates are treated Console.ReadLine interprets text as UTF-8. Bytes which cannot be -- You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
Newsgroups: fa.haskell
From: Ian Lynagh <ig...@earth.li>
Date: Sat, 19 Mar 2005 19:14:42 GMT
Local: Sat, Mar 19 2005 2:14 pm
Subject: Re: [Haskell-cafe] invalid character encoding
On Wed, Mar 16, 2005 at 11:55:18AM +0000, Ross Paterson wrote: In the below, it looks like there is a bug in getDirectoryContents. > On Wed, Mar 16, 2005 at 03:54:19AM +0000, Ian Lynagh wrote: > > Do you have a list of functions which behave differently in the new > > release to how they did in the previous release? > > (I'm not interested in changes that will affect only whether something > > compiles, not how it behaves given it compiles both before and after). > I got lost in the negatives here. It affects all Haskell 98 primitives Also, the error from w.hs is going to stdout, not stderr. Most importantly, though: is there any way to remove this file without Is there anything LC_CTYPE can be set to that will act like C/POSIX but (in the POSIX locale) ERROR - Garbage collection fails to reclaim sufficient space $ echo 'import Directory; main = removeFile "1\xA2"' > w.hs Program error: 1?: Directory.removeFile: does not exist (file does not exist) Now consider this e.hs: -------------------- main = do hWaitForInput stdin 10000 $ { printf "\xC2\xC2\xC2\xC2\xC2\xC2\xC2"; sleep 30; } | runhugs e.hs Program error: <stdin>: IO.hGetChar: protocol error (invalid character encoding) It takes 30 seconds for this error to be printed. This shows two issues: Thanks _______________________________________________ You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
| ||||||||||||||
| Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy |
| ©2013 Google |