I've been thinking a lot about languages lately. Languages are cool.
Lest you get the impression that I mean programming languages, which
can also be cool, I'm talking about natural languages. English,
German, Russian ... that sort of thing.
Given that Perl was started by a linguist, namely Larry, and has also
been under the influence of another, namely Tom, it seemed to me that
if any language would be friendly to languages other than boring
English (and Esperanto!) with non-accented, 7-bit ASCII characters, it
would be Perl.
Trying to feed things like character code (such as, for example,
0x81), to parsers has been known to produce results that one might
generously classify as "suboptimal". But if there's any parser that
can handle such things, thought I, it would be written by Larry,
Mr. Weird Parsing himself. And if there is a language anywhere for
which a parser to handle such things has been written, it would be
Perl.
Since we have an English module that allows one to use the English
equivalent of such bizarre things that we might classify as "native
Perl", it seemed only fair that we would have other modules that would
allow one to use keywords and whatnot based on languages other than
English, in their native writing systems. (I recognize the difficulty
of multibyte encoding schemes, so let's pretend for the time being
that this isn't a problem.)
I mean, really, wouldn't it be cool to be able to do something like
#!/usr/bin/perl
use Français; # minor bootstrapping problem. :-)
écrivez "Bonjour, monde!\n";
I certainly think so.
We might write a small program to see if it can be done:
#!/usr/local/bin/perl
écrivez "Bonjour, monde!\n";
sub écrivez ($) {
my $m = shift;
print $m;
}
In so doing, we're likely to make this discovery:
$ ./foo.pl
Unrecognized character \351 at ./foo.pl line 3.
Well, that's no fun.
How about Python?
#!/usr/local/bin/python
def écrivez (m):
print m;
écrivez("Bonjour, monde!");
hmm...
$ ./foo.py
File "./foo.py", line 5
def écrivez (m):
^
SyntaxError: invalid syntax
What a bore!
Let's see ... what other (programming) languages might support this
sort of thing? How about Lisp? XEmacs Lisp, at that?
(defun écrivez (m)
"Écrivez un message."
(message m))
Sure enough, evaluating the expression `(écrivez "Bonjour, monde!")'
will write `Bonjour, monde!' in the minibuffer completely sans
complaint.
This is too much. The implications are astounding. Newfangled vi
implementations that use Perl as their customization language are
actually less functional than XEmacs Lisp! Lisp can do something that
Perl can't!
Pray, what do we do? Larry, will you start using Lisp? Tom, will you
start hacking Lisp to customize your Emacs sessions? Is this the
beginning of the end?
Gleefully yours,
Just Another (Lisp|Perl|Java|Unix|.*)+ Hacker
--
Matt Curtin cmcu...@interhack.net http://www.interhack.net/people/cmcurtin/
CLISP Common Lisp has no problems with this either:
[1]> (defun écrivez (m)
"Écrivez un message."
(write-line m))
ÉCRIVEZ
[2]> (écrivez "Bonjour, tout le monde!")
Bonjour, tout le monde!
"Bonjour, tout le monde!"
[3]>
À bientôt!
The next release of CLISP will not only be 8-bit clean, it will also support
Unicode 16-bit characters.
Bruno http://clisp.cons.org/
This is a good thing, althogh only relatively.
If you are going to release a Unicode version (or UTF8) of CLISP, it'd
probably be a good thing to post some document that describes the
addendum so that other implementors can take advantage of it.
On top of that, since it seems there will be another round of the ANSI
CL committee, this could be meat for that venue.
Cheers
--
Marco Antoniotti ===========================================
PARADES, Via San Pantaleo 66, I-00186 Rome, ITALY
tel. +39 - (0)6 - 68 10 03 17, fax. +39 - (0)6 - 68 80 79 26
http://www.parades.rm.cnr.it
> [1]> (defun écrivez (m)
> "Écrivez un message."
> (write-line m))
> ÉCRIVEZ
> [2]> (écrivez "Bonjour, tout le monde!")
> Bonjour, tout le monde!
> "Bonjour, tout le monde!"
> [3]>
> À bientôt!
>
> The next release of CLISP will not only be 8-bit clean, it will also support
> Unicode 16-bit characters.
>
> Bruno http://clisp.cons.org/
With Macintosh Common Lisp you can program in Kanji, if you like. ;-)
It would be a very nice feature to have several CL-Functions localized,
so you dont have to invent your own routines for do this. I like to
mention
- format (with date, time, floats)
- char-upcase etc. (eg Allegro is wrong when german special chars
are involved)
- Daylight saving time & time zones
Since every serious OS supports localization LISP-Implementations should
be forced to use these.
Johannes Beck
--
Johannes Beck be...@informatik.uni-wuerzburg.de
http://www-info6.informatik.uni-wuerzburg.de/~beck/
Tel.: +49 931 312198
Fax.: +49 931 7056120
PGP Public Key available by finger://be...@informatik.uni-wuerzburg.de
[16-bit characters galore deleated]
> On top of that, since it seems there will be another round of the ANSI
> CL committee, this could be meat for that venue.
Hmmm... MULL (MUlti Lingual Lisp)? This isn't going to make users of
8-bit character sets experience increased storage overhead for the
exact same string objects and a performance hit in string bashing
functions, now is it?
On the upside, unicode support could give an additional excuse for
Lisp's apparent "slowness" in certain situations. In my Java class the
instructor seems to always bring up unicode support as part of the
excuse for Java's lousy performance (hmm... this isn't really
comforting for some reason though...).
Christopher
> It would be a very nice feature to have several CL-Functions localized,
> so you dont have to invent your own routines for do this. I like to
> mention
> - format (with date, time, floats)
(format t "Let me be the ~:R ~R to mention format with ~~R!" 1 1)
--
Thomas A. Russ, USC/Information Sciences Institute t...@isi.edu
> Johannes Beck <be...@informatik.uni-wuerzburg.de> writes:
>
> > It would be a very nice feature to have several CL-Functions localized,
> > so you dont have to invent your own routines for do this. I like to
> > mention
> > - format (with date, time, floats)
>
> (format t "Let me be the ~:R ~R to mention format with ~~R!" 1 1)
(format t "~@R'll ~:R that. ~
And ~@R want to be ~0@*~:R to take a ~:R out to mention ~~@R!"
1 2 1)
You've just ran into what I believe is a misunderstanding. One that is
one of my pet peaves. A while back in "IEEE Computer" magazine, some
yahoo decided that we don't need to use 16 bits to handle international
characters. Instead, we usually only need 8 bits at a time, and that we
would get better performance by using 8-bit characters for everything
along with a locally understood "current char set". They eventually
printed a "letter to the editor" I sent, and the whole thing bugs me
enough that I'm going to repeat it here.
One issue you bring up that is not covered in the letter is whether
speed is effected in Lisp by simultaneously supporting BOTH ASCII and
Unicode. I admit that runtime dispatching between the two different
string representations would cost time if the compiler can't figure out
at compile time which is being used. In principle, proper declarations
fixes this. Of course, if you don't have any declarations at all, then
dispatching between all the different sequence types won't be any more
expensive if there are two kinds of strings possible. However, its true
that there is a volume of code (including system supplied macros) which
declare things to simply be of type string (as oppossed to base-string
or extended-string), and the benefits of these declarations might be
mostly lost if there are two kinds of strings. One solution would be
for an implementation to simply ALWAYS use extended-strings, and it is
either this situation or fully declared code that is assumed in the
letter.
Anyway, here's my rant on trying to save space by using 8-bit characters
everywhere. If I'm rattling this off to quickly, I'll be happy to
expand on any of the points.
----
I am confused by Neville Holmes essay "Toward Decent Text Encoding"
(Computer,
Aug. 1998, p. 108). If I understand correctly, Holmes argues that 16-bit
Unicode
encoding of characters wastes space. Instead, different regions of the
world
should use an implicitly understood 8-bit local encoding. Have we truly
learned
nothing from the Y2K problem?
My understanding is that
It is not quite correct to refer to Unicode as a 16-bit standard.
Unicode
actually uses a 32-bit space. It is one of the more popular subsets of
Unicode, UCS-2, that happens to fit in 16 bits.
The distinction between in-memory encoding and external
representation cannot be overemphasized. Within a program, many
algorithms rely for their efficiency on being able to assume that
characters within a string are represented using a uniform width. But
uniform width is much less important for external representations.
Therefore, multibyte and switching encodings can be
used for data transfer. By definition, it is necessary to convert
external
representations to uniform-width representations only when slow external
resources are involved. Throughput need not be affected.
Performance should not be greatly effected by the choice of 8-bit
or 16-bit uniform representation within memory. On the other hand,
using nonuniform or shifting encodings in memory would have a much
greater effect on performance.
Interface performance is also not an issue. When sending character
data to a file or over the Internet, compression or alternate encodings
can be used. For example, the UTF-8 encoding of Unicode is, bit-for-bit,
identical to ASCII encoding when used for text that happens to contain
only ASCII characters.
Many international and de facto standards involving written
representations, especially for programming languages, include keywords
and punctuation from the European/Latin local encoding. Within such
documents, then, this 8-bit system must coexist simultaneously with
other (presumably 8-bit) "local" encodings. I do not believe, therefore,
that outside Europe and North America, only 8-bits would be consistently
sufficient within a single application or even within a single document.
localization and internationalization has done more to destroy what was
left of intercultural communication and respect for cultural needs than
anything else in the entire computer history. the people who are into
these things should not be allowed to work with them for the same reason
people who want power should be the last people to get it.
| I like to mention
| - format (with date, time, floats)
date and time should be in ISO 8601. people who want something else can
write their own printers (and parsers). nobody agrees to date-related
representation even within the same office if left to themselves, let
alone in a whole country or culture. it's _much_ worse to put something
in a standard that people will compete with than not to put something in
a standard.
I have a printer and reader for date and time that goes like this:
(describe (get-full-time tz:pacific))
#[1999-02-09 13:16:33.692-08:00] is an instance of #<standard-class full-time>:
The following slots have :instance allocation:
sec 3127583793
msec 692
zone #<timezone pacific loaded from "US/Pacific" @ #x202b1e02>
format nil
[1970-01-01]
=> 2208988800
(setf *parse-time-default* '(1999 02 09))
[21:19]
=> 3127583940
a friend of mine commented that using [...] (returns a universal-time)
and #[...] (returns a full-time object) for this was kind of a luxury
syntax, but the application this was written for reads and writes dates
and times millions of times a day.
(format nil "~/ISO:8601/" [21:19])
=> "1999-02-09 21:19:00.000"
(format nil "~920:/ISO:8601/" [21:19])
=> "21:19"
(let ((*default-timezone* tz:oslo))
(describe (read-from-string "#[22:19]")))
=> #[1999-02-09 22:19:00.000+01:00] is an instance of #<standard-class full-time>:
The following slots have :instance allocation:
sec 3127583940
msec 0
zone -1
format 920
now, floating-point values. I _completely_ fail to see the charm of the
comma as a decimal point, and have never used it. (I remember how
grossly unfair I thought it was to be reprimanded for refusing to succumb
to the ambiguity of the comma in third grade. it was just too stupid to
use the same symbol both inside and between numbers, so I used a dot.)
if you want this abomination, it will be output-only, upon special
request. none of the pervase default crap that localization in C uses.
e.g., a version of Emacs failed mysteriously on Digital Unix systems and
the maintainers just couldn't figure out why, until the person in
question admitted to having used Digital Unix's fledling "localiation"
support. of course, Emacs Lisp now read floating point numbers in the
"C" locale to avoid this braindamage. another pet peeve is that "ls" has
a ridiculously stupid format, but what do people do? instead of getting
it right and reversing the New Jersey stupidity, the just translate it
_halfway_ into other cultures. sigh. and since programs have to deal
with the output of other programs, there are some things you _can't_ just
translate without affecting everything. the result is that people can't
use these "localizations" except under carefully controlled conditions.
| - char-upcase etc. (eg Allegro is wrong when german special chars are
| involved)
well, I have written functions to deal with this, too.
(system-character-set)
=> #<naggum-software::character-set ASCII @ #x20251be2>
(string-upcase "soylent grün ist menschen fleisch!")
=> "SOYLENT GRüN IST MENSCHEN FLEISCH!"
;;;; ^
(setf (system-character-set) ISO:8859-1)
=> #<naggum-software::character-set ISO 8859-1 @ #x20250f52>
(string-upcase "soylent grün ist menschen fleisch!")
=> "SOYLENT GRÜN IST MENSCHEN FLEISCH!"
;;;; ^
note, upcase rules that deal with ß->SS and ÿ->IJ are not implemented;
this is still a simple character-to-character translation, so it leaves
these two characters alone.
| - Daylight saving time & time zones
Common Lisp is too weak in this respect, and so are most other solutions.
it is wrong to let a time zone be just a number when parsing or decoding
time specifications. it is wrong to allow only one time zone to be fully
supported. I needed to fix this, so time zone data is fetched from the
timezone database on demand, since the time zone names need to be loaded
before they can be referenced.
e.g., tz:berlin is initialized like this:
(define-timezone #"berlin" "Europe/Berlin")
after which tz:berlin is bound to a timezone object:
(describe tz:berlin) ->
#<timezone berlin lazy-loaded from "Europe/Berlin" @ #x202b1fd2> is an instance
of #<standard-class timezone>:
The following slots have :instance allocation:
name timezone:berlin
filename "Europe/Berlin"
zoneinfo <unbound>
reversed <unbound>
using it loads the data automatically:
(get-full-time tz:berlin)
=> #[1999-02-09 22:40:24.517+01:00]
tz:berlin
=> #<timezone berlin loaded from "Europe/Berlin" @ #x202b1fd2>
you can ask for just the timezone of a particular time and zone, and you
get the timezone and the universal-time of the previous and next changes,
so it's possible to know how long a day in local time is without serious
wastes. (i.e., it is 23 or 25 hours at the change of timezone due the
infinitely stupid daylight savings time crap, but people won't switch to
UTC, so have to accomodate them fully.)
(time-zone [1999-07-04 12:00] tz:pacific)
=> 7 t "PDT" 3132208800 3150349200
| Since every serious OS supports localization LISP-Implementations should
| be forced to use these.
I protest vociferously. let's get this incredible mess right. if there
is anything that causes more grief than the mind-bogglingly braindamaged
attempts that, e.g., Microsoft, does at adapting to other cultures, I
don't know what it is, and the Unix world is just tailing behind them,
making the same idiotic mistakes. IBM has done an incredible job in this
area, but they _still_ listen to the wrong people, and don't realize that
there are as many ways to write a date in each language as there are in
the United States, so calling one particular format "Norwegian" is just
plain wrong. forcing one format on all Americans in the silly belief
that they are all alike would would perhaps cause sufficient rioting to
get somebody's attention, because countries with small populations than
some U.S. cities just won't be heard.
e.g., if you want to use the supposedly "standard" Norwegian notation,
that's 9.2.99, but people will want to write 9/2-99 or 9/2 1999, and if
you do this, those who actually have to communicate with people elsewhere
in the world will now be crippled unless they turn _off_ this cultural
braindamage, and revert to whatever choice they get with the default.
computers and programmers should speak English. if you want to talk to
people in your own culture, first consider international standards that
get things right (like ISO 8601 for dates and times), then the smartest
thing you can think of, onwards through to the stupidest thing you can
think of, then perhaps what people have failed to understand is wrong.
you don't have to adapt to anyone -- nobody adapts to you, and adapting
should be a reciprocal thing, so do whatever is right and explain it to
people. 90% of them will accept it. the rest can go write their own
software. force accountants to see four-digit years, force Americans and
the British to see 24-hour clocks, use dot as a decimal point, write
dates and times with all numbers in strictly decreasing unit order, lie
to managers when they ask if they can have the way they learned stuff in
grade school in 1950 and say it's impossible in this day and age.
computers should be instruments of progress. if that isn't OK with some
doofus, give him a keypunch, which is what computers looked like at the
time the other things they ask the computers do to day was normal. if
people want you to adapt, put them to the test and see if they think
adaptation is any good when it happens to themselves. if it does, great
-- they do what you say. if not, you tell them "neither do I", and force
them to accept your way, anyway. it's that simple.
#:Erik
--
Y2K conversion simplified: Januark, Februark, March, April, Mak, June,
Julk, August, September, October, November, December.
give me a break. Common Lisp has all it needs to move to a smart wide
character set such as Unicode. we even support external character set
codings in the :EXTERNAL-FORMAT argument to stream functions. it's all
there. all the stuff that is needed to handle input and output should
also be properly handled by the environment -- if not, there's no use for
such a feature since you can neither enter nor display nor print Unicode
text.
| This isn't going to make users of 8-bit character sets experience
| increased storage overhead for the exact same string objects and a
| performance hit in string bashing functions, now is it?
there are performance reasons to use 16 bits per character over 8 bits in
modern hardware already, but if you need only 8 bits, use BASE-STRING
instead of STRING. it's only a vector, anyway, and Common Lisp can
already handle specialized vectors of various size elements.
if it is important to separate between STRING and BASE-STRING, I'm sure a
smart implementation would do the same for strings as the standard does
for floats: *READ-DEFAULT-FLOAT-FORMAT*.
| On the upside, unicode support could give an additional excuse for Lisp's
| apparent "slowness" in certain situations. In my Java class the
| instructor seems to always bring up unicode support as part of the excuse
| for Java's lousy performance (hmm... this isn't really comforting for
| some reason though...).
criminy. can teachers be sued for malpractice? if so, go for it.
the first ISO 10646 draft actually had this feature for character sets
that only need 8 bits, complete with "High Octet Prefix", which was
intended as a stateful encoding that would _never_ be useful in memory.
this was a vastly superior coding scheme to UTF-8, which unfortunately
penalizes everybody outside of the United States. I actually think UTF-8
is one of the least intelligent solutions to this problem around: it
thwarts the whole effort of the Unicode Consortium and has already proven
to be a reason why Unicode is not catching on.
instead of this stupid encoding, only a few system libraries need to be
updated to understand the UCS signature, #xFEFF, at the start of strings
or streams. it can even be byte-swapped without loss of information. I
don't think two bytes is a great loss, but the stateless morons in New
Jersey couldn't be bothered to figure something like this out. argh!
when the UCS signature becomes widespread, any string or stream can be
viewed initially as a byte sequence, and upon first access can easily be
inspected for its true nature and the object can then change class into
whatever the appropriate class should be. it might even be byteswapped
if appropriate. this is not at all rocket science. I think the UCS
signature is among the smarter things in Unicode. that #xFFFE is an
invalid code and #xFEFF is a zero-width space are signs of a brilliant
mind at work. I don't know who invented this, but I _do_ know that UTF-8
is a New Jersey-ism.
| One issue you bring up that is not covered in the letter is whether speed
| is effected in Lisp by simultaneously supporting BOTH ASCII and Unicode.
there is actually a lot of evidence that UTF-8 slows things down because
it has to be translated, but UTF-16 can be processed faster than ISO
8859-1 on most modern computers because the memory access is simpler with
16-bit units than with 8-bit units. odd addresses are not free.
| It is not quite correct to refer to Unicode as a 16-bit standard.
| Unicode actually uses a 32-bit space. It is one of the more popular
| subsets of Unicode, UCS-2, that happens to fit in 16 bits.
well, Unicode 1.0 was 16 bits, but Unicode is now 16 bits + 20 bits worth
of extended space encoded as 32 bits using 1024 high and 1024 low codes
from the set of 16-bit codes. ISO 10646 is a 31-bit character set
standard without any of this stupid hi-lo cruft.
your point about the distinction between internal and external formats is
generally lost on people who have never seen the concepts provided by the
READ and WRITE functions in Common Lisp. Lispers are used to dealing
with different internal and external representations, and therefore have
a strong propensity to understand much more complex issues than people
who are likely to argue in favor of writing raw bytes from memory out to
files as a form of "interchange", and who deal with all text as _strings_
and repeatedly maul them with regexps.
my experience is that there's no point in trying to argue with people who
don't understand the concepts of internal and external representation --
if you want to reach them at all, that's where you have to start, but be
prepared for a paradigm shift happening in your audience's brain. (it
has been instructive to see how people suddenly grasp that a date is
always read and written in ISO 8601 format although the machine actually
deals with it as a large integer, the number of seconds since an epoch.
Unix folks who are used to seeing the number _or_ a hacked-up crufty
version of `ctime' output are truly amazed by this.) if you can explain
how and why conflating internal and external representation is bad karma,
you can usually watch them people get a serious case of revelation and
their coding style changes there and then. but just criticizing their
choice of an internal-friendly external coding doesn't ring a bell.
EMPHASIS ON THE FOLLOWING!
> but just criticizing their
> choice of an internal-friendly external coding doesn't ring a bell.
>
> #:Erik
I think that's good advice. On the grounds that Lispers are used to
distinguishing between internal and external representations for objects
in general, programs, structures, lists, floating point numbers, dates,
etc., I'll repeat something here that people might use in evangalizing,
er, trying to explain things to others:
I/O is much slower than computation!
For example, consider mobile code that you want to distribute around the
internet. With limited bandwidth and modern processors, its often faster
to send a compressed encoding (i.e. gzcat) of the source code for some
expressive language (like Lisp), and then, on the receiving end, to
uncompress and compile it, then it is to send the "binaries". (I think
there was a paper on this, with numbers, within the last 2 years
somewhere. Comm. of the ACM?)
There are similar processing wins for dealing with file-systems and just
about anything with moving parts. In some cases, it's even better to do
lots more computation within a program just to avoid having to do a lot
with system calls to the operating system.
The point is that if you want to make a program fast, do it right. When
taken with Erik's other point about how 16 (or even 32) bit characters
may be faster than 8-bit anyway (due to word access in memory (Erik: I'm
not sure if this stays true with instruction prefecthing, etc.)), one
can support Unicode characters within programs, and then do whatever is
needed on I/O.
Don't worry about the extra space WITHIN MEMORY, and use processing
power to get rid of the extra space EXTERNALLY when needed.
> my experience is that there's no point in trying to argue with people who
> don't understand the concepts of internal and external representation --
> if you want to reach them at all, that's where you have to start, but be
> prepared for a paradigm shift happening in your audience's brain. (it
> has been instructive to see how people suddenly grasp that a date is
> always read and written in ISO 8601 format although the machine actually
> deals with it as a large integer, the number of seconds since an epoch.
> Unix folks who are used to seeing the number _or_ a hacked-up crufty
> version of `ctime' output are truly amazed by this.) if you can explain
> how and why conflating internal and external representation is bad karma,
> you can usually watch them people get a serious case of revelation and
> their coding style changes there and then. but just criticizing their
> choice of an internal-friendly external coding doesn't ring a
> bell.
just to throw my two cents in
unix does do the internal/external dance in the filesystem. files are
known to the user by a string. files are identified to the system by
inode number. directories are simply maps from string to number.
from the user's perspective, the inode numbers hardly exist at all.
(that the mapping ought to be performed by a hash table rather than an
association list not relevant to the abstraction at work.) this
pathname abstraction is one of the things that unix does do correctly.
the same principle should be applied to char-sets. you shouldn't care
what bits it takes to represent the letter `A'. it should be taken
out of the user's domain to prevent worrying about it and getting it
wrong. the user generally assumes a constant width character
representation. breaking this invites trouble. worrying about
optimization of space should be delayed.
to support all the languages, i think 16 bit chars sounds like a good
thing. i wouldn't mind having a cpu which could only address 16 bit
bytes (it'd double your address space in one fell swoop) and say
goodbye to 8 bit processing altogether. a few (un)packing
instructions could be thrown in for space saving representations.
after all, how much is text these days anyway? if you doubled the
size of the text files, how much more disk could it possibly use? if
you are worried about it, a second packed 8 bit system could be used.
better yet a huffman table (for your language or application) could be
applied on small sections perhaps (roughly) a line by line basis.
resulting in far higher data density than 8 bits per char. this
packing should be, as much as possible, transparent to the user.
--
johan kullstam
[...]
> to support all the languages, i think 16 bit chars sounds like a good
> thing. i wouldn't mind having a cpu which could only address 16 bit
> bytes (it'd double your address space in one fell swoop) and say
> goodbye to 8 bit processing altogether. a few (un)packing
> instructions could be thrown in for space saving representations.
>
> after all, how much is text these days anyway? if you doubled the
> size of the text files, how much more disk could it possibly use?
Well, since I roughly estimate that over 80% of the files on the
filesystems of my 9GB disk are ASCII encoded source code and
postscript, html and text documentation and configuration files, quite
a bit.
Christopher