unicode

paul johnston

unread,

Jun 15, 2000, 3:00:00 AM6/15/00

to

I've got a bit of a problem so could anyone out there help me.
As a computer officer in the Dept of Language Engineering at UMIST I am
asked to supply software for various projects.

We tend to have to work in several languages with non-latin scripts,
i.e. Greek Cyrillic and even Arabic. Does anyone have a suggestion as to
a unicode compatible lisp that we can use.
We have Allegro CL ver 5.0 has anyone any experience in using non-latin
scripts with this, either under NT or Solaris7?
Many Thanks

--
Paul Johnston
System Admin
Language Engineering
UMIST
Tel 0161 200 3111

William Deakin

unread,

Jun 16, 2000, 3:00:00 AM6/16/00

to

paul johnston wrote:

> We tend to have to work in several languages with non-latin scripts,
> i.e. Greek Cyrillic and even Arabic. Does anyone have a suggestion as to
> a unicode compatible lisp that we can use.
> We have Allegro CL ver 5.0 has anyone any experience in using non-latin
> scripts with this, either under NT or Solaris7?

Although I have very limited experience with unicode (I wrote a couple of
strings in Italian once), there was an interesting discussion on c.l.l. a
short while ago about issues related to this but I am not sure if it is
exactly what you wanted[1].

I would also contact Franz directly. Finally, having run a quick search on
the Franz website I found references to International Allegro CL which has
support for Japanese (kanji, ganji &c) so I would have though Greek,
Cyrillic, Arabic or whatever must be tractable.

Best Regards,

:) will

[1] This was the thread `strings and characters' see
www.deja.com/getdoc.xp?AN=598005460. Also Deja is always a good starting
point for researching historic postings from c.l.l.

Arthur Lemmens

unread,

Jun 16, 2000, 3:00:00 AM6/16/00

to

paul johnston wrote:
>
> I've got a bit of a problem so could anyone out there help me.
> As a computer officer in the Dept of Language Engineering at UMIST I am
> asked to supply software for various projects.
>

> We tend to have to work in several languages with non-latin scripts,
> i.e. Greek Cyrillic and even Arabic. Does anyone have a suggestion as to
> a unicode compatible lisp that we can use.

Lispworks has reasonably good support for Unicode. I've used it to edit and
process some Unicode files that contained characters from the ASCII, Latin1
and Cyrillic character blocks.

It was pretty easy to configure the editor so it could switch between two
sets of keyboard bindings. (I think it would be easy to support Greek in the
same way, but configuring the editor for working with Arabic would be a lot
more difficult, as you probably know.)

I had a few small problems with Lispwork's Unicode support, but no major
gotchas.

Arthur

hai...@clisp.cons.org

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

paul johnston <pa...@ccl.umist.ac.uk> asks:

> Does anyone have a suggestion as to a unicode compatible lisp that
> we can use.

The Linux Unicode HOWTO [1], section 5.3, answers your question:

The Common Lisp standard specifies two character types: `base-char'
and `character'. It's up to the implementation to support Unicode or
not. The language also specifies a keyword argument `:external-format'
to `open', as the natural place to specify a character set or
encoding.

Among the free Common Lisp implementations, only CLISP http://clisp.cons.org/
supports Unicode. You need a CLISP version from March 2000 or
newer. ftp://clisp.cons.org/pub/lisp/clisp/source/clispsrc.tar.gz.
The types `base-char' and `character' are both equivalent to 16-bit
Unicode. The functions char-width and string-width provide an API
comparable to wcwidth() and wcswidth(). The encoding used for file or
socket/pipe I/O can be specified through the `:external-format'
argument. The encodings used for tty I/O and the default encoding for
file/socket/pipe I/O are locale dependent.

Among the commercial Common Lisp implementations, only Eclipse
http://www.elwood.com/eclipse/eclipse.htm supports Unicode. See
http://www.elwood.com/eclipse/char.htm. The type `base-char' is
equivalent to ISO-8859-1, and the type `character' contains all
Unicode characters. The encoding used for file I/O can be specified
through a combination of the `:element-type' and `:external-format'
arguments to `open'. Limitations: Character attribute functions are
locale dependent. Source and compiled source files cannot contain
Unicode string literals.

The commercial Common Lisp implementation Allegro CL does not support
Unicode yet, but Erik Naggum is working on it.

Bruno http://clisp.cons.org/~haible/

hai...@clisp.cons.org

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

> The Linux Unicode HOWTO [1], section 5.3

Oops, here's the URL:
ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html

Erik Naggum

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

* Bruno Haible

| The Linux Unicode HOWTO [1], section 5.3, answers your question:
:

| The commercial Common Lisp implementation Allegro CL does not support
| Unicode yet, but Erik Naggum is working on it.

Franz Inc has had Unicode support in Allegro CL for Windows for
quite some time, now, thanks to the efforts of Charles Cox. He has
also been working on Unicode support for Allegro CL for Unix for
quite some time, now. Allegro CL 6.0 supports Unicode natively.

#:Erik
--
If this is not what you expected, please alter your expectations.

Marco Antoniotti

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

Erik Naggum <er...@naggum.no> writes:

> * Bruno Haible
> | The Linux Unicode HOWTO [1], section 5.3, answers your question:
> :
> | The commercial Common Lisp implementation Allegro CL does not support
> | Unicode yet, but Erik Naggum is working on it.
>
> Franz Inc has had Unicode support in Allegro CL for Windows for
> quite some time, now, thanks to the efforts of Charles Cox. He has
> also been working on Unicode support for Allegro CL for Unix for
> quite some time, now. Allegro CL 6.0 supports Unicode natively.

Now the question is: are CLisp, ECLipse and ACL compatible in their
treatment of Unicode?

Cheers

--
Marco Antoniotti ===========================================

Erik Naggum

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

* Marco Antoniotti <mar...@parades.rm.cnr.it>

| Now the question is: are CLisp, ECLipse and ACL compatible in their
| treatment of Unicode?

Since basically the only useful thing to do with Unicode (data) is
to have _real_ wide strings, with characters at least 16 bits wide
_each_ and real character types that reflect real Unicoditude, and
since Unicode (the standard) defines pretty much what you can do in
the outside world, the question of what it means to be compatible
appears to be a question of how each Common Lisp treats _streams_ of
Unicode characters.

Kent M Pitman

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

Erik Naggum <er...@naggum.no> writes:

>
> * Marco Antoniotti <mar...@parades.rm.cnr.it>
> | Now the question is: are CLisp, ECLipse and ACL compatible in their
> | treatment of Unicode?
>
> Since basically the only useful thing to do with Unicode (data) is
> to have _real_ wide strings, with characters at least 16 bits wide
> _each_ and real character types that reflect real Unicoditude, and
> since Unicode (the standard) defines pretty much what you can do in
> the outside world, the question of what it means to be compatible
> appears to be a question of how each Common Lisp treats _streams_ of
> Unicode characters.

Not having played with it but just thinking about it for a second, I'd
think there'd also the issue of what #\xxx you write to refer to such
a character, and whether a unicode A is char= to a non-unicode A (intended
to be constrained by the CL spec, but...), and probably many other
little details. It would certainly be interesting to hear about differences
people uncover.

Marco Antoniotti

unread,

Jun 20, 2000, 3:00:00 AM6/20/00

to

Erik Naggum <er...@naggum.no> writes:

> * Marco Antoniotti <mar...@parades.rm.cnr.it>
> | Now the question is: are CLisp, ECLipse and ACL compatible in their
> | treatment of Unicode?
>
> Since basically the only useful thing to do with Unicode (data) is
> to have _real_ wide strings, with characters at least 16 bits wide
> _each_ and real character types that reflect real Unicoditude, and
> since Unicode (the standard) defines pretty much what you can do in
> the outside world, the question of what it means to be compatible
> appears to be a question of how each Common Lisp treats _streams_ of
> Unicode characters.

Well, Bruno mentioned CHAR-WIDTH and STRING-WIDTH. He also mentioned
the treatment of :EXTERNAL-FORMAT.

Does ACL have these functions?

Erik Naggum

unread,

Jun 21, 2000, 3:00:00 AM6/21/00

to

* Marco Antoniotti <mar...@parades.rm.cnr.it>

| Well, Bruno mentioned CHAR-WIDTH and STRING-WIDTH.

From the names, I guess these are relics of coding systems. If you
think you need to work with coding systems, you are mistaken. If
you still think you need to work with coding systems, measuring the
width of characters in bytes is wrong.

| He also mentioned the treatment of :EXTERNAL-FORMAT.

The various external-formats you will need in the complex world of
universal character sets are not covered by the standard. Nor
should they. There are, however, several conflicting attempts to
enumerate them outside of the Lisp world, and it is not necessarily
useful to standardize on one of those.

| Does ACL have these functions?

I hope to <deity> that there won't be any char-width or similar
cruftitude in Allegro 6.0. If anything, we should have learned from
the Great Emacs Experience that exposing coding systems internals to
users is Just Plain Wrong.

hai...@clisp.cons.org

unread,

Jun 21, 2000, 3:00:00 AM6/21/00

to

Marco Antoniotti <mar...@parades.rm.cnr.it> asked:

>
> Now the question is: are CLisp, ECLipse and ACL compatible in their
> treatment of Unicode?

Let me try to give a summary of the features in
- CLISP 2000-03-06
- Eclipse
- ACL 6.0 (not yet released)
- LispWorks 4.0.1

* Character and string types

In Eclipse, ACL, LispWorks the type BASE-CHAR includes only Latin-1
characters, whereas the CHARACTER type includes all of Unicode (16 bit).

In CLISP, BASE-CHAR and CHARACTER are equivalent and include all of
Unicode (16 bit). The memory representation of read-only strings
(e.g. symbol print names and program literals) is optimized to 1
byte/character if possible.

* Supported external formats of streams

- CLISP 2000-03-06: Around 80 external formats, including all of the
ones supported by browsers and Linux locales.
- Eclipse: Only :ASCII (1 byte/character), :UCS (2 bytes/character),
and :MULTI-BYTE (locale dependent multibyte representation, works
only on OSes for which wchar_t is Unicode).
- ACL 6.0: Lots of external formats, mostly table-driven.
- LispWorks: Around 10 external formats, including Latin-1, Unicode
(2 bytes/character), UTF-8, and the most important Japanese encodings
(but not ISO-2022-JP).

Different end-of-line conventions are indicated to OPEN through the
:external-format argument in CLISP and LispWorks, and through an extra
argument to OPEN in Eclipse.

* Additional API

- CLISP: STRING-WIDTH returns the display width of a string, used by
FORMAT ~T.
- Eclipse: none.
- ACL: unknown.
- LispWorks: functions for guessing the encoding of a file (important
for Japanese environments)

* FFI support

- CLISP: FFI can pass strings only with single-byte encodings.
- Eclipse, ACL: unknown
- LispWorks: a few specialized macros for passing strings from/to C.

Bruno

hai...@clisp.cons.org

unread,

Jun 21, 2000, 3:00:00 AM6/21/00

to

Kent M Pitman <pit...@world.std.com> wrote:
>
> there'd also the issue of what #\xxx you write to refer to such
> a character

LispWorks uses the syntax #\U+203E, CLISP uses #\U203E, and ACL and LispWorks
have no such notation.

> whether a unicode A is char= to a non-unicode A

The entire idea of Unicode is that there is are no characters outside
Unicode. The only non-Unicode characters I have ever seen in use in Web pages
are Inuktitut (some Eskimo people in Canada).

The bits and font attributes in CL are a different issue, of course.

Bruno

hai...@clisp.cons.org

unread,

Jun 21, 2000, 3:00:00 AM6/21/00

to

Erik Naggum <er...@naggum.no> wrote:
>| Well, Bruno mentioned CHAR-WIDTH and STRING-WIDTH.
>
> From the names, I guess these are relics of coding systems.

Naggum, you are guessing wrong, because you neglected to look up the
documentation of the things you are talking about.
http://clisp.sourceforge.net/impnotes.html#string-width

These functions are needed for anyone wanting to perform tabular output
or word wrapping, assuming an output device with a fixed size font
and a double size font, like kterm or xterm.

> If you still think you need to work with coding systems, measuring the
> width of characters in bytes is wrong.

Measuring the width of characters in bytes is *only* useful when you deal
with memory allocation, which you don't normally do in Lisp.

> If anything, we should have learned from the Great Emacs Experience
> that exposing coding systems internals to users is Just Plain Wrong.

I agree with you. What do you think about the NATIVE-STRING-SIZEOF
function in ACL 5.0.1
(see http://www.franz.com/support/documentation/5.0.1/doc/cl/iacl.htm) ?

Bruno

Erik Naggum

unread,

Jun 21, 2000, 3:00:00 AM6/21/00

to

* Bruno Haible

| Naggum, you are guessing wrong, because you neglected to look up the
| documentation of the things you are talking about.

I love you, too.

| Measuring the width of characters in bytes is *only* useful when you deal
| with memory allocation, which you don't normally do in Lisp.

Well, gee, _I_ think you may not have paid attention to the Great
Emacs Experiment, but you wouldn't do something that would even make
it _possible_ to claim you haven't looked up the documentation of
the things you're talking about, would you? Nah, of course not.

| I agree with you.

I don't generally consider that comforting. This is no exception.

Pekka P. Pirinen

unread,

Jun 23, 2000, 3:00:00 AM6/23/00

to

hai...@clisp.cons.org writes:
> Let me try to give a summary of the features in
> - CLISP 2000-03-06
> - Eclipse
> - ACL 6.0 (not yet released)
> - LispWorks 4.0.1

Current version is LW 4.1, but the differences should be small. Liquid
5.0 has many of the same interfaces.

> * Supported external formats of streams

> - LispWorks: Around 10 external formats, including Latin-1, Unicode
> (2 bytes/character), UTF-8, and the most important Japanese encodings
> (but not ISO-2022-JP).

LW support would probably help you, if you needed another external
format. There's a relatively painless way of adding one.

Also, on Windows, all the installed codepages are available as
external formats.

> * Additional API

> - LispWorks: functions for guessing the encoding of a file (important
> for Japanese environments)

Also functions for code conversions (see package EXTERNAL-FORMAT in
the Reference Manual), lots of string and character types and
predicates (package LISPWORKS), and *DEFAULT-CHARACTER-ELEMENT-TYPE*
for controlling the default size of strings &c.

> * FFI support

> - LispWorks: a few specialized macros for passing strings from/to C.

That's true, but lest people think that describes a limitation, the
FLI is pretty C-oriented, and the macros provided together with the
foreign types :EF-MB-STRING and :EF-WC-STRING allow passing of strings
in any encoding. The types take an external-format parameter, that
defaults to the encoding used by the current C locale (assuming you
tell LW what that is, see SET-LOCALE in the FLI manual).
--
Pekka P. Pirinen, Adaptive Memory Management Team, Harlequin Limited
Controlling complexity is the essence of computer programming.
- Kernighan

Steven M. Haflich

unread,

Jun 24, 2000, 3:00:00 AM6/24/00

to

hai...@clisp.cons.org wrote:
>
> Marco Antoniotti <mar...@parades.rm.cnr.it> asked:

> Let me try to give a summary of the features in
> - CLISP 2000-03-06
> - Eclipse
> - ACL 6.0 (not yet released)
> - LispWorks 4.0.1

> ...

I'm sorry to say that Bruno's information about ACL (both 5.0 and 6.0)
is incorrect in a number of regards. Anyone wanting accurate information
should get it directly from Franz, not usenet.