Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

File IO with unicode

167 views
Skip to first unread message

sailor...@gmail.com

unread,
Aug 31, 2006, 11:50:49 PM8/31/06
to
Hi :

With the simple file text IO as follows:

(with-open-file (stream "/some/file/name.txt")
(format t "~a~%" (read-line stream)))

I tried two text files, both are Traditional Chinese,
one is Big-5(Codepage 950), the other is UTF-8

[1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
(format t "~a~%" (read-line stream)))
中文
NIL

This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
2.39),
it shows

Character #\u4E2D cannot be represented in the character set
CHARSET:ISO-8859-1
[Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]


[2]> (with-open-file (stream "/temp/UTF8_Chinese.txt")
(format t "~a~%" (read-line stream)))

*** - POSIX library error 42 (EILSEQ): Invalid multibyte or wide
character
The following restarts are available:
ABORT :R1 ABORT
Break 1 [3]> :R1

Pascal Bourguignon

unread,
Sep 1, 2006, 12:26:05 AM9/1/06
to
"sailor...@gmail.com" <sailor...@gmail.com> writes:

The Common Lisp standard specifies the standard character set to be exactly:


SP ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~

Nothing less, nothing more.

So why are you expecting to be able to read a file of character
containing any other character than these, with only the standard API?

Now, if you read the error message some more closely, you might notice
something. Try to read it again:


Character #\u4E2D cannot be represented in the character set
CHARSET:ISO-8859-1


What does this error message tell us?


You may want to read again also the CLHS page about OPEN:

http://www.lispworks.com/documentation/HyperSpec/Body/f_open.htm

and the clisp Implementation Notes

http://clisp.cons.org/impnotes/stream-dict.html#open

(only for a start, don't hesitate to further follow links, like:

http://clisp.cons.org/impnotes/encoding.html#def-file-enc
).


--
__Pascal Bourguignon__ http://www.informatimago.com/

Nobody can fix the economy. Nobody can be trusted with their finger
on the button. Nobody's perfect. VOTE FOR NOBODY.

Raffael Cavallaro

unread,
Sep 1, 2006, 12:39:58 AM9/1/06
to
On 2006-08-31 23:50:49 -0400, "sailor...@gmail.com"
<sailor...@gmail.com> said:

> With the simple file text IO as follows:
>
> (with-open-file (stream "/some/file/name.txt")
> (format t "~a~%" (read-line stream)))
>
> I tried two text files, both are Traditional Chinese,
> one is Big-5(Codepage 950), the other is UTF-8

Maybe you need the :external-format keyword option to with-open-file?

Sam Steingold

unread,
Sep 1, 2006, 12:42:25 AM9/1/06
to sailor...@gmail.com
> * sailor...@gmail.com <fnvybe...@tznvy.pbz> [2006-08-31 20:50:49 -0700]:

>
> Character #\u4E2D cannot be represented in the character set
> CHARSET:ISO-8859-1
> [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]

http://clisp.cons.org/impnotes/faq.html#faq-enc-err

--
Sam Steingold (http://www.podval.org/~sds) on Fedora Core release 5 (Bordeaux)
http://camera.org http://thereligionofpeace.com http://memri.org
http://honestreporting.com http://jihadwatch.org http://mideasttruth.com
Marriage is the sole cause of divorce.

sailor...@gmail.com

unread,
Sep 1, 2006, 1:45:35 AM9/1/06
to

I expect the file IO library would detect the BOM for the encoding of
text file.
http://en.wikipedia.org/wiki/Byte_Order_Mark

Pascal Bourguignon

unread,
Sep 1, 2006, 2:15:21 AM9/1/06
to
"sailor...@gmail.com" <sailor...@gmail.com> writes:

It works only for unicode files.

What about ISO-8859-1 files? What about ISO-2022-JP files? What
about BIG5 files? What about US-ASCII files?

--
__Pascal Bourguignon__ http://www.informatimago.com/

The rule for today:
Touch my tail, I shred your hand.
New rule tomorrow.

kavenchuk

unread,
Sep 1, 2006, 2:43:58 AM9/1/06
to

sailor...@gmail.com писал(а):

> http://en.wikipedia.org/wiki/Byte_Order_Mark

You read it?

"... Quite a lot of Windows software (including Windows Notepad) adds
one to UTF-8 files. However in Unix-like systems (which make heavy use
of text files for configuration) this practice is not recommended, as
it will interfere with correct processing of important codes such as
the hash-bang at the start of an interpreted script."

WBR, Yaroslav Kavenchuk.

Stephen Compall

unread,
Sep 1, 2006, 2:47:27 AM9/1/06
to
sailor...@gmail.com wrote:
> [1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
> (format t "~a~%" (read-line stream)))
> 中文
> NIL
>
> This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
> 2.39),
> it shows
>
> Character #\u4E2D cannot be represented in the character set
> CHARSET:ISO-8859-1
> [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]

I would guess that this relates to the coding system for communication
between Emacs and CLISP, if you are saying this works with plain CLISP
but not when connecting with SLIME.

In CLISP, after loading Swank but before starting the server, do:

(setq swank::*coding-system* :utf-8-unix)

In Emacs, after loading SLIME but before connecting to CLISP, do:

(setq slime-net-coding-system 'utf-8-unix)

I forget how to fix the inferior-lisp buffer to do this right, maybe
something about C-x <RET> f?

--
Stephen Compall
http://scompall.nocandysw.com/blog

Timofei Shatrov

unread,
Sep 1, 2006, 3:48:58 AM9/1/06
to
On Fri, 01 Sep 2006 06:47:27 GMT, Stephen Compall
<stephen...@gmail.com> tried to confuse everyone with this message:

>sailor...@gmail.com wrote:
>> [1]> (with-open-file (stream "/temp/Big5_Chinese.txt")
>> (format t "~a~%" (read-line stream)))
>> 中文
>> NIL
>>
>> This works in CLISP 2.39, but in LispBox (SLIME/ with CLISP upgraded to
>> 2.39),
>> it shows
>>
>> Character #\u4E2D cannot be represented in the character set
>> CHARSET:ISO-8859-1
>> [Condition of type EXT:SIMPLE-CHARSET-TYPE-ERROR]
>
>I would guess that this relates to the coding system for communication
>between Emacs and CLISP, if you are saying this works with plain CLISP
> but not when connecting with SLIME.
>
>In CLISP, after loading Swank but before starting the server, do:
>
>(setq swank::*coding-system* :utf-8-unix)

I don't think it is necessary, because the next step sets it up already:

>In Emacs, after loading SLIME but before connecting to CLISP, do:
>
>(setq slime-net-coding-system 'utf-8-unix)

Put this line into .emacs

--
|Don't believe this - you're not worthless ,gr---------.ru
|It's us against millions and we can't take them all... | ue il |
|But we can take them on! | @ma |
| (A Wilhelm Scream - The Rip) |______________|

Christopher Brown

unread,
Sep 1, 2006, 11:21:31 AM9/1/06
to
For what it's worth, this also fixed my problem (earlier thread about
sbcl & non-ascii filenames).
After I applied Yaroslav Kavenchuk's patches to sbcl, I found slime
would hang on directory listings. Changing the external-format as
below fixed that problem.

Cheers,
Chris

0 new messages