clisp parse-namestring and chinese character on Windows

zea...@126.com

unread,

Apr 19, 2012, 11:36:44 PM4/19/12

to

I am a Chinese clisp user, working on Windows Chinese Edition with
Emacs23.

I have found that clisp' s implementation of parse-namestring can not
handle
pathname string which includes Chinese characters correctly.

For example, all my encodings are gbk, and especially, the
*pathname-encoding* is gbk. when I call the function:

(parse-namestring "F:/工具")

there would be an error: " PARSE-NAMESTRING: syntax error in filename
"F:/工
具" at position 4". I think that meaning the Chinese character "具" can
not
be handled by clisp.

With some other experiments, I found that SOME Chinese characters
indeed can
not be handled correctly, any one can get the list with codes below:

(loop for i from #x4e00 to #x9fa5
do (multiple-value-bind (ret cond)
(ignore-errors (parse-namestring (string (code-char
i))))
(when (and (null ret) cond)
(princ (code-char i)))))

Alternatively, I set all my encodings to utf-8:

clisp -E UTF-8

But, when I call the function:

(parse-namestring "F:/工具")

I get the right result, but there still are OTHER characters can not
be
handled. You can use the codes above again to get another Characters
list
which can't be handled in UTF-8 encoding under Windows Chinese
Edition.

I try the same function calls in Ubuntu, and always get the right
result.

Maybe that's a BUG? Or I missed something?

BTW, clisp's ext:probe-directory will signal an internal error under
the
environment Windows Chinese Edition, gbk encoding, for example:

(ext:probe-directory "F:/工具/")

ERROR: PARSE-NAMESTRING: syntax error in filename "F:/工具" at position
4,

that means probe-directory will invoke parse-namestring which can not
handle
the character "具" as statements above, but "书籍" CAN BE HANDLED BY
parse-namestring, AND probe-directory SIGNALS ANOTHER ERROR:

(ext:probe-directory "F:/书籍/")

ERROR: Internal error: statement in file "../src/pathname.d", line
6144 has been reached!!

Maybe I can not use clisp in Windows?

tar...@google.com

unread,

Apr 20, 2012, 4:54:24 PM4/20/12

to

On Thursday, April 19, 2012 8:36:44 PM UTC-7, zea...@126.com wrote:
> I am a Chinese clisp user, working on Windows Chinese Edition with
> Emacs23.
>
> I have found that clisp' s implementation of parse-namestring can not
> handle
> pathname string which includes Chinese characters correctly.

...

> Alternatively, I set all my encodings to utf-8:
>
> clisp -E UTF-8
>
> But, when I call the function:
>
> (parse-namestring "F:/工具")
>
> I get the right result, but there still are OTHER characters can not
> be
> handled. You can use the codes above again to get another Characters
> list
> which can't be handled in UTF-8 encoding under Windows Chinese
> Edition.
>
> I try the same function calls in Ubuntu, and always get the right
> result.
>
> Maybe that's a BUG? Or I missed something?

...

> Maybe I can not use clisp in Windows?

Possibly.
I also know nothing about the file system interface in Windows and whether that can properly support utf-8 character encodings.
You might want to try your question on a clisp specific bug mailing list like clisp-devel

But the other thing to look into is to make sure that the input you are giving clisp is actually proper utf-8 encoding.

If you are reading the forms from a file, you need to make sure the file is saved in utf-8 format.

For starters, then, I would put the PARSE-NAMESTRING inside another test function whose name consists only of ascii characters. That will let you test without worrying about how the typing gets encoded into characters. Put that function in a text file and make sure the text file is properly saved in utf-8 format. Load the test file and try to call your test function.

If you are typing the characters at the terminal interactively, you would need to make sure that the characters that are passed are really proper utf-8 characters.

I would start by trying to examine the string that you use for the name and making sure that it is a proper utf-8 string. That will likely require you to use some clisp specific functions to look at the string and its encoding.

Pascal J. Bourguignon

unread,

Apr 20, 2012, 5:16:32 PM4/20/12

to

It may be more complicated than using proper UTF-8. There are several
normalized forms of UTF-8, and file systems may expect a single one
(that's the case of MacOSX, I don't know about MS-Windows).

I note that in the external-formats there is no way to specify a
normalized form along with the utf-8 encoding, whatever the
implementation.

--
__Pascal Bourguignon__ http://www.informatimago.com/
A bad day in () is better than a good day in {}.

Raymond Toy

unread,

Apr 21, 2012, 3:51:33 PM4/21/12

to

What do you mean by normalized forms of UTF-8? I thought there was only
one utf-8 encoding of a string, but there are at least four different
normalized forms of the string.

>
>
> I note that in the external-formats there is no way to specify a
> normalized form along with the utf-8 encoding, whatever the
> implementation.

If you're talking about normalized forms, isn't that a property of the
string and not of the encooding? CMUCL provides four functions to
convert a string to one of the four normalized forms:
lisp:string-to-nfc, lisp:string-to-nfkc, lisp:string-to-nfd and
lisp:string-to-nfkd. (Symbols are always normalized to one of these
forms before being interned. I forget which form.)

Ray

Pascal J. Bourguignon

unread,

Apr 21, 2012, 4:10:46 PM4/21/12

to

Raymond Toy <toy.r...@gmail.com> writes:

>> It may be more complicated than using proper UTF-8. There are several
>> normalized forms of UTF-8, and file systems may expect a single one
>> (that's the case of MacOSX, I don't know about MS-Windows).
>
> What do you mean by normalized forms of UTF-8? I thought there was only
> one utf-8 encoding of a string, but there are at least four different
> normalized forms of the string.

Yes, I meant utf-8 encoding of each of the four different normalized

forms of the string.

>> I note that in the external-formats there is no way to specify a
>> normalized form along with the utf-8 encoding, whatever the
>> implementation.
>
> If you're talking about normalized forms, isn't that a property of the
> string and not of the encooding? CMUCL provides four functions to
> convert a string to one of the four normalized forms:
> lisp:string-to-nfc, lisp:string-to-nfkc, lisp:string-to-nfd and
> lisp:string-to-nfkd. (Symbols are always normalized to one of these
> forms before being interned. I forget which form.)

Good. Now let's see what other implementations do about it?

Raymond Toy

unread,

Apr 21, 2012, 6:07:32 PM4/21/12

to

On 4/21/12 1:10 PM, Pascal J. Bourguignon wrote:

> Raymond Toy <toy.r...@gmail.com> writes:
>>> I note that in the external-formats there is no way to specify a
>>> normalized form along with the utf-8 encoding, whatever the
>>> implementation.
>>
>> If you're talking about normalized forms, isn't that a property of the
>> string and not of the encooding? CMUCL provides four functions to
>> convert a string to one of the four normalized forms:
>> lisp:string-to-nfc, lisp:string-to-nfkc, lisp:string-to-nfd and
>> lisp:string-to-nfkd. (Symbols are always normalized to one of these
>> forms before being interned. I forget which form.)
>
> Good. Now let's see what other implementations do about it?

I'm guessing that other implementations don't normalize strings before
interning them. There's an ansi-test case that cmucl fails because it
expects the symbol name to be exactly the same as the given string.
Since cmucl normalizes, the test fails because normalization has changed
the string.

Oh, cmucl normalizes string to NFC form for symbols. And it does the
same for package names.

Ray