A lot of functional languages represent strings as lists rather than
arrays (Haskell also does that, and only recently got bytestrings)
because lists are their basic collection datatype (due to being a
recursive structure and everything), and this allows the use of all
the list-related functions on strings.
Representing strings as arrays of bytes or character (which is pretty
much also what Java does, by the way) is an attribute of imperative
languages whose basic collection datatype is the array.
My guess is that's the reason why: a lot of string operations were
already implemented on lists (reduces code duplication) and string
efficiency wasn't really of importance in the erlang world until
fairly recently, so strings being represented as lists of integers
wasn't much of a problem.
That is the reason. Hysterical Raisins.
There was a time when Erlang didn't have binaries. Someone thought it
would be a good idea to make "ABC" a way to write [65,66,67]. If you
look at the old "eddie" web load balancer project you see the dns
protocol being decoded using lists. The "" syntax for lists is a
pragmatic solution to make code more readable when you need ascii
sequences in your lists.
I would not ask "Why does erlang internally represent strings as lists?".
Erlang does not have strings. It has a shorthand syntax for creating
lists. If you still consider "ABC" to be a string, then the list is
certainly not an "internal representation". Go ahead and treat it as a
list.
I would ask "Why do some programmers store their large text-masses as lists?"
Of course, I know the answer already; because there is a 'string'
module that operates on lists as strings. Lazy buggers.
Alternative ways to handle larger text-masses:
- binaries (features representation that is 1:1 with the character
encoding itself, now also (R12B) with efficient scanning and
tail-construction)
- iolists (features cheap concatenation of large texts)
- list of words and a word-dictionary (features quicker scanning of
...words, efficient storage too)
It all comes down to what you really are doing with your large texts.
PS.
For the scanning of protocols, I have been looking at Ragel as a tool
to create C-code FSMs as a loadable driver that recognizes tokens and
sends these tokens to the port owner process. The port owner in turn
feeds the port binary chunks, since incremental parsing isnt much of a
problem for state machines.
Of course, I have only reached so far as to teach myself Ragel and
realizing that it is still easy to make mistakes. It would be nice
with a Ragel that produces erlang code.
Have anyone else experimented with Ragel? I know that angry-ruby-guy
used it for Mongrel, that is how I found out about it.
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
> I think it all boils down to what you are going to *do* with these
> strings. If you are just going to store them somewhere for later
> then converting them to a binary definitely save space. If, however,
> you are going to *work* with them then having them as lists is
> definitely much better. It is so much easier than having fixed
> sequence of octets. Also most, if not all declarative languages
> functional and logic, have very optimised list handling because
> lists are so practical to work with.
Hold on... lists aren't really a particular convenient or efficient
data structure for working with strings. First off, I append to
strings a lot more than I prepend to them. Yeah, I could work with
reversed strings, but that's a hack to deal with using the wrong data
type. Plus, I probably prefix match more often than suffix matching
(although this is less lopsided than append vs. prepend). Of course,
I also like to do substring matching and regular expressions quite a
bit, and Boyer-Moore is definitely more efficient with arrays than
lists.
It's probably worth noting that none of the languages which are
considered _good_ for working with strings (AFAIK) use a list
representation.
-kevin
Small correction: UTF-16 and UTF-32 are practically dead, you certainly
need to think in terms of UTF-8 nowadays.
> Robert
>
> On 12/02/2008, *Masklinn* <mask...@masklinn.net
> erlang-q...@erlang.org <mailto:erlang-q...@erlang.org>
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
>
> ------------------------------------------------------------------------
Christian S wrote:
>
> I would ask "Why do some programmers store their large text-masses as
> lists?"
>
> Of course, I know the answer already; because there is a 'string'
> module that operates on lists as strings. Lazy buggers.
>
Still there is a need for standard string datatype, which will be good for
90% of uses and it should be accepted by all standard libs.
I reperesent strings as binaries, and my code become much more verbose
(almost unreadable), i.e using:
* <<"ABC">>, instead of "ABC"
* <<S1/bytes,S2/bytes>> instead of S1++S2
* using file:delete(binary_to_list(Filename)) instead of
file:delete(Filename)
* xmerl and erlsom parse into lists and not binaries (I heard about expat
port, which can parse binary XML, but I don't know how to extract it's code
out of ejabberd).
etc.
Christian S wrote:
>
> - list of words and a word-dictionary (features quicker scanning of
> ...words, efficient storage too)
>
I want to implement something like this, but using atoms for words. Is this
a good idea?
There is a limit to number of atoms in VM (I think ~1M). I can preload lists
of atoms-per-word and then use only list_to_existing_atom ...
I'll have around 100000 words/atoms. Do you think that it's much better to
use ets with integer word IDs mapped to binaries?
Christian S wrote:
>
> For the scanning of protocols, I have been looking at Ragel as a tool
> to create C-code FSMs as a loadable driver that recognizes tokens and
> sends these tokens to the port owner process. The port owner in turn
> feeds the port binary chunks, since incremental parsing isnt much of a
> problem for state machines.
>
How Ragel is better, than other lexical analysers? Do you use it primarily
because it's parsing binary input, why Erlang leexer working with lists?
BR,
Zvi
--
View this message in context: http://www.nabble.com/Strings-as-Lists-tp15436835p15448906.html
Sent from the Erlang Questions mailing list archive at Nabble.com.
Still there is a need for standard string datatype, which will be good for
90% of uses and it should be accepted by all standard libs.
I reperesent strings as binaries, and my code become much more verbose
(almost unreadable), i.e using:
* <<"ABC">>, instead of "ABC"
* <<S1/bytes,S2/bytes>> instead of S1++S2
* using file:delete(binary_to_list(Filename)) instead of
file:delete(Filename)
* xmerl and erlsom parse into lists and not binaries (I heard about expat
port, which can parse binary XML, but I don't know how to extract it's code
out of ejabberd).
etc.
Still there is a need for standard string datatype, which will be good for
90% of uses and it should be accepted by all standard libs.
I reperesent strings as binaries, and my code become much more verbose
(almost unreadable), i.e using:
* <<"ABC">>, instead of "ABC"
>
> Small correction: UTF-16 and UTF-32 are practically dead, you certainly
> need to think in terms of UTF-8 nowadays.
Only for input and output. Internally, it is much easier to have a list
with one element per Unicode character.
/Bjorn
--
Björn Gustavsson, Erlang/OTP, Ericsson AB
> On Feb 12, 2008, at 2:13 PM, Robert Virding wrote:
>
> Hold on... lists aren't really a particular convenient or efficient
> data structure for working with strings. First off, I append to
> strings a lot more than I prepend to them.
You can append by building a deep list and only flatten it at the end.
NewString = [AListOfChars|AnotherListOfChars]
or
NewString = [AListOfChars,ACharacter]
Or you can simply do a recursion (not tail-recursion) and use
the '++' operator. That will be efficient, because the recursion will
ensure that the '++' operators are executed in a right-to-left order.
If you ask me, that is string(). It is good for the majority of uses.
> * <<"ABC">>, instead of "ABC"
Yes, this is a bit annoying to type.
I'm not the biggest fan of the syntactical appearance of binary
comprehensions either.
I have been pondering the use of LEFT and RIGHT-POINTING DOUBLE ANGLE
QUOTATION MARK in latin1 as shorthand for <<"">>. Of course, I realize
that ~99,8% (everyone but me :) of all erlang users dont want
non-ascii characters in the syntax, even though Erlang source code is
specified to be in latin1.
> * <<S1/bytes,S2/bytes>> instead of S1++S2
[S1,S2] and then do iolist_to_binary/1 if you need it flat.
> * using file:delete(binary_to_list(Filename)) instead of
> file:delete(Filename)
Why handle filenames as binaries?
> > - list of words and a word-dictionary (features quicker scanning of
> > ...words, efficient storage too)
> I want to implement something like this, but using atoms for words. Is this
> a good idea?
[snip]
Go with your own dictionary and word ids. Erlang handles small
integers as fixnums so they're as efficient to compare as atoms.
If you have an a-priori known dictionary then you can of course map to atoms.
> How Ragel is better, than other lexical analysers? Do you use it primarily
> because it's parsing binary input, why Erlang leexer working with lists?
Leex is very cool and I have been playing with it some. It generates
erlang code which is good. I mostly see it as the solution for parsing
files.
I want non-greedy matching, and also some push down automata support.
Ragel can do this. With it you can parse quoted strings as a single
token, and still have incremental parsing (i.e. in chunks). The fact
that you generate C code from Ragel can also be beneficial in speed
but of course risky. The higher-level programming of using Ragel can
hopefully decrease risk of security problems.
I'm really just experimenting with Ragel as a tool. I still suck at it
after having spent ~6 hours or so with it.
> Robert Virding wrote:
>> I think it all boils down to what you are going to *do* with these
>> strings. If you are just going to store them somewhere for later
>> then converting them to a binary definitely save space. If,
>> however, you are going to *work* with them then having them as
>> lists is definitely much better. It is so much easier than having
>> fixed sequence of octets. Also most, if not all declarative
>> languages functional and logic, have very optimised list handling
>> because lists are so practical to work with.
>> As mentioned in the next mail you can also keep them as iolists
>> while processing to make it efficient to send the strinigs into the
>> big wide world. This is sort best of both worlds.
>> Also having them as lists means you get UTF-16 and 32 for free, and
>> most of your libraries still work straight out of the bag. This,
>> UTF-16/32, I think will become much more important in the future
>> when the number of internet users who don't have a latin charset as
>> their base increases. Think of the influence of a few hundred
>> million indians and chinese who want 32 bit charsets. :-)
>
> Small correction: UTF-16 and UTF-32 are practically dead, you
> certainly
> need to think in terms of UTF-8 nowadays.
>
I need to think in terms of none of these. They're all transformation
formats, in other words exchange formats. Inner representation of
characters on 32 bits means we should get raw unicode codepoints and
be done with it, the codepoints are the universal theoretical "values"
for each character and there is *no reason* to use an UTF or an UCS
format as the internal representation of characters.
You can do it only because ISO-8859-1 is enough for you. Your code will be crap for me, because I would not use it! One char not equal to one byte, remeber it! List is best solution for all non english world, because one list element equals to one char. If you want make onelingual programs, continue in your practise. God save you.
The tokenizer expects latin1 input:
http://erlang.org/doc/reference_manual/introduction.html#1.6
Of course, if you put utf8 encoded data into your strings it will
happily create interpret your latin "Ä" as [$Ã, $Ä] or whatever
sequence the Ä utf8-encoded looks like as when viewed as latin1. The
list [195,132] is what shows up if i enter it using my utf8 xterm.
Didn't the list have a long thread about io character encodings a
couple years ago? Or am I mixing it up with the with character
encoding issues in common lisp's io system?
There is some issues that show up if you write an alternative lexer
that decodes utf8 into Unicode character points in lists, such as
list_to_binary() expecting a string() type and choking if there is an
integer above 255. This must be what the list had a thread about. The
need for unicode_list_to_utf8_binary/1 and a dozen others target
encodings.
And yet we see so few programs written in Swedish. :)
-vVnce
On Feb 13, 2008 5:00 AM, Zvi <ex...@walla.com> wrote:
> How Ragel is better, than other lexical analysers? Do you use it primarily
> because it's parsing binary input, why Erlang leexer working with lists?
Leex is very cool and I have been playing with it some. It generates
erlang code which is good. I mostly see it as the solution for parsing
files.
I want non-greedy matching, and also some push down automata support.
Ragel can do this. With it you can parse quoted strings as a single
token, and still have incremental parsing (i.e. in chunks). The fact
that you generate C code from Ragel can also be beneficial in speed
but of course risky. The higher-level programming of using Ragel can
hopefully decrease risk of security problems.
I'm really just experimenting with Ragel as a tool. I still suck at it
after having spent ~6 hours or so with it.
Well you really should take strings out of source code if you need
i18n. The litterals one can keep in source code are those for protocol
framing... say the "HTTP 200 OK" reply in http or "EHLO fqdn" in smtp.
Nobody would love lots of case Lang of en -> <<"Hello">>; sv ->
<<"Hej">>; es -> <<"Hola">> end in the code.
Once you go i18n(hello) you can treat everything as utf8 encoded
binaries and just piece it together.
Oh sure, but with a state-stack one can suddenly parse more than
regular grammars.
Parsing a quoted string as a token from leex is difficult if you know
that the end-quote might not be included in the chunk you just fed
into leex, but the next chunk read from the tcp stream.
With fully recursive grammars I can see how one wants to let yecc
handle it, but a quoted string is not really recursive: You cant have
a quoted string inside a quoted string the same way you can have, say,
an if-expression inside an if-expression inside an if-expression etc
in a programming language.
Leex is a tool I would use for when I know I have some file of finite
length and I could do a two-pass parsing with yecc as the second
stage, I would not use it for tokenizing SMTP/IRC/NNTP...
I'm looking for a better tool (as in quicker and easier code to
maintain/extend) than writing protocol parsing "by hand".
PS.
I reserve the right to be completly mistaken about everything.
There was a time when Erlang didn't have _anything_,
but that's almost surely not the reason.
> Someone thought it
> would be a good idea to make "ABC" a way to write [65,66,67].
Don't forget, Erlang was heavily influenced by Prolog and Strand84.
If memory serves me correctly, the original prototype implementation
was done in Prolog, and the Prolog reader has read "ABC" as [65,66,67]
since the late 1970s at least.
It's not Hysterical Raisins at all: it is simplicity (the preferred
sequence type in Erlang is lists, and strings are just sequences of
characters), power (because any time someone defines a function on
lists you get to use it on strings, and there are *lots* of useful
list functions), and processing efficiency (because working down
one character at a time doesn't require allocating *any* new storage,
not even for slices).
ByteStrings were added to Haskell (actually in two flavours) to support
high volume I/O. One of the reasons they've seen a lot of acceptance is
that various people have gone to a lot of trouble to make them LOOK as
much like lists as possible. Thanks to Haskell's typeclass machinery,
that can be "very much indeed" (see the ListLike package, the use of
which
means that you can continue to write code that works on lists *or*
strings).
The one thing that bytestrings aren't much good for these days, of
course,
is holding *decoded* characters.
It's important to realise that in ANY programming language there is
more than
one way to represent strings. For example, even for character=byte
there are
three different representations supported in C89, and C89 programs
often had
to deal with a fourth. You don't *have* to stick with the one that the
compiler translates "ABC" to.
> Why does erlang internally represent strings as lists? In every
> language I've used other than Java, a string is a sequence of
> octets, just like Erlang's binary type.
Recall the programming proverb:
it is better to have one data type with 100 functions
than 10 data types with 10 functions each.
It is actually quite commonplace in declarative languages (Prolog,
Mercury,
Haskell, some others) to implement strings as lists of characters
because
that way you get a vast number of functions you can usefully apply to
strings.
Not only that, you can process strings *incrementally* if they are
lists.
I never tire of telling this story:
I was on the team that ported Quintus Prolog from the "UNIX" world
to the Xerox Lisp machines (the 1108, 1109, 1185, and 1186, aka
Dandelion, Dandetiger, Daybreak, and something else I forget).
These machines compiled Interlisp to bytecode which was then
executed by microcode, and gave very respectable performance for
Lisp, in their day. The existing microcode supported Interlisp
strings, which were byte vectors as you describe.
There wasn't much microcode space left to support Quintus Prolog,
so we had a microcoded WAM with plain old 2 word list cells and
strings represented by 1 list cell per character.
A series of benchmarks I did showed string processing going
FIVE TIMES FASTER in Prolog using lists than in Lisp using byte
vectors, largely because we could represent "the rest of a string"
using NO new allocations whatever.
The Java string representation is (slice of (array of uint16_t))
which means
that "the rest of a string" costs only O(1) time and space, but O(1)
extra
space isn't O(0) extra space. I would expect most Javascript
implementations
to use something similar to this.
The guiding rule is
- if you just want to hold onto a string for a while, use a binary
- if you want to build or process a string, use a list (possibly in
Erlang a deep list).
- if you want to represent something that has structure, and you want
your program to be aware of that structure, turn it into a
structured
data value and work with it in that form.
Basically, most languages get strings embarassingly wrong. Another
story
I like to tell is how I was able to do in 10 pages of C what a friend of
mine needed 150 pages of PL/I to accomplish, largely because PL/I *does*
have a "real" string data type and C doesn't, so I could accomplish what
I needed very simply, while he had to fight the language every step
of the
way.
For some people they are.
> First off, I append to
> strings a lot more than I prepend to them.
I take this to mean that you do stuff like
S0 = ~0~,
S1 = S0 ++ ~1~,
...
Sn = Sn_1 ++ ~n~
perhaps in a loop. Right, this is not efficient. But it is
spectacularly
inefficient in programming languages with more conventional
representations.
It is O(n**2). For example,
x = ""
for (i = 1; i <= 100000; i++) x = x "a"
just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
and a massive
631 seconds in Java. That was using gcj; I lost patience with the
Sun SDK and
killed it. (AWK faster than Java? Yes, it often is.) Building the
same string
in Erlang using
loop(100000, "")
where
loop(0, S) -> lists:reverse(S);
loop(N, S) -> loop(N-1, "a"++S).
takes 0.15 second on the same machine.
> Yeah, I could work with
> reversed strings, but that's a hack to deal with using the wrong data
> type.
No, it has to do with the fact that appending on the right is O(n)
whether one uses lists or arrays. Arguably, you should be using the
right data type THE RIGHT WAY. Can you provide an example of your code?
Alternatively, maybe iolists would suit you better:
loop(0, S) -> lists:flatten(S);
loop(N, S) -> loop(N-1, [S|"a"]).
also takes 0.15 second, and is appending on the right. It's quite a
general
principle that the data structure or traversal that is convenient for
building
something up isn't necessarily the same as the data structure or
traversal that
is convenient for processing it afterwards;
> Plus, I probably prefix match more often than suffix matching
> (although this is less lopsided than append vs. prepend). Of course,
> I also like to do substring matching and regular expressions quite a
> bit, and Boyer-Moore is definitely more efficient with arrays than
> lists.
The Boyer-Moore algorithm requires space proportional to the alphabet
size.
The Unicode alphabet size is enormous (last time I looked there were
nearly
100 000 characters defined). Unicode substring matching is
definitely nasty,
given that, for example, you would like (e,floating acute) to match
é, and
the Boyer-Moore algorithm assumes a unique encoding of any given string,
which Unicode does not even begin to have. Then there are fun things
like
U+0028 is the code for the "(" character, but if I see a "(" and go
looking
for it, I might have to look for a U+0029 instead, and there really
doesn't
seem to be any way of dealing with that without processing a bunch of
control characters. (I know I am talking about appearance here, not
semantics,
but sometimes one wants to search based on appearance.)
>
> It's probably worth noting that none of the languages which are
> considered _good_ for working with strings (AFAIK) use a list
> representation.
Not really. Perl is normally "considered _good_ for working with
strings",
but it is pretty much hopeless at building strings by concatenation.
(Yes, I know about StringBuffer and StringBuilder, but they are
DIFFERENT
from String. The only language I know that's really good at it is
Smalltalk, where you can do
result := String writeStream: [:s |
"any code you want, writing to the character stream s"].
and the internal representation of the intermediate result is none of
your
business.
For some people they are.
> First off, I append to
> strings a lot more than I prepend to them.
I take this to mean that you do stuff like
S0 = ~0~,
S1 = S0 ++ ~1~,
...
Sn = Sn_1 ++ ~n~
perhaps in a loop. Right, this is not efficient. But it is
spectacularly
inefficient in programming languages with more conventional
representations.
It is O(n**2). For example,
x = ""
for (i = 1; i <= 100000; i++) x = x "a"
just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
and a massive
631 seconds in Java. That was using gcj; I lost patience with the
Sun SDK and
killed it. (AWK faster than Java? Yes, it often is.) Building the
same string
in Erlang using
loop(100000, "")
where
loop(0, S) -> lists:reverse(S);
loop(N, S) -> loop(N-1, "a"++S).
takes 0.15 second on the same machine.
> Yeah, I could work with
> reversed strings, but that's a hack to deal with using the wrong data
> type.
No, it has to do with the fact that appending on the right is O(n)
whether one uses lists or arrays. Arguably, you should be using the
right data type THE RIGHT WAY. Can you provide an example of your code?
Alternatively, maybe iolists would suit you better:
loop(0, S) -> lists:flatten(S);
loop(N, S) -> loop(N-1, [S|"a"]).
also takes 0.15 second, and is appending on the right. It's quite a
general
principle that the data structure or traversal that is convenient for
building
something up isn't necessarily the same as the data structure or
traversal that
is convenient for processing it afterwards;
> Plus, I probably prefix match more often than suffix matching
> (although this is less lopsided than append vs. prepend). Of course,
> I also like to do substring matching and regular expressions quite a
> bit, and Boyer-Moore is definitely more efficient with arrays than
> lists.
The Boyer-Moore algorithm requires space proportional to the alphabet
size.
The Unicode alphabet size is enormous (last time I looked there were
nearly
100 000 characters defined). Unicode substring matching is
definitely nasty,
given that, for example, you would like (e,floating acute) to match
é, and
the Boyer-Moore algorithm assumes a unique encoding of any given string,
which Unicode does not even begin to have. Then there are fun things
like
U+0028 is the code for the "(" character, but if I see a "(" and go
looking
for it, I might have to look for a U+0029 instead, and there really
doesn't
seem to be any way of dealing with that without processing a bunch of
control characters. (I know I am talking about appearance here, not
semantics,
but sometimes one wants to search based on appearance.)
>
> It's probably worth noting that none of the languages which are
> considered _good_ for working with strings (AFAIK) use a list
> representation.
Not really. Perl is normally "considered _good_ for working with
strings",
but it is pretty much hopeless at building strings by concatenation.
The only language I know that's really good at it is Smalltalk, where
you can do
result := String writeStream: [:s |
"any code you want, writing to the character stream s"].
and the internal representation of the intermediate result is none of
your
business. There are also libraries from the functional programming
world
that give you efficient incremental update of large strings, but they
are
fairly heavyweight.
I think you confusing datatype with it's implementation/representation.
My biggest problem with Erlang standard string representation is that in
64bit mode, each character taking 16 bytes. So typical message of 10KB after
converted to XML become 20KB and after parsed by 64bit Erlang VM become
320KB. I talking about server-side code, so I need 320KB-per-client. Using
binary, even if I copy this XML 3 times in memory, I still can handle 5-6
times more clients.
Also, you forget that what was good for LISP Machines is not good for
current machine architectures.
Binary strings can be handled very efficiently using vectorized SIMD code
and they use modern caches much more effectively, than lists.
Deep lists and io-lists are essentially analog of Java's StringBuffer or C++
STL's ostringstream.
io-lists can also be more efficient fo IO, if underlying OS has gather-write
system call.
So in general I want immutable String ADT. The ADT implementation should be
smart enough, to switch to the best representation, according to usage.
Similar to some languages, which implement associative arrays ADT using
various datastructures, i.e. property lists for small arrays and hashtables
for large.
Or I can give hints which implementation of String ADT I want to use (using
for example parametrized modules), i.e. string(deepList):concat(S1,S2) or
string(binarySlice):substr(S,1,3).
BTW: most sophisticated string implementation I know about was in SNOBOL. I
think it was list of unique substrings.
Zvi
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
>
--
View this message in context: http://www.nabble.com/Strings-as-Lists-tp15436835p15474643.html
Best regards,
Kirill
A minor correction: appending at the end of an array (std::vector<>) is
O(1) operation in C++. The trick is that it allocates a bit more space
that it is actually required. If it gets full, it allocates a two times
bigger block and copies the array into that block. In this way, the
average number of copying of the elements is at most 2. Of course it
requires 1.5 times more memory on the average.
perhaps in a loop. Right, this is not efficient. But it is
> For example,
> x = ""
> for (i = 1; i <= 100000; i++) x = x "a"
> just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
> and a massive
> 631 seconds in Java.
All of these C++ counterparts requires unmeasurably low time ( < 4ms) on
my laptop:
std::vector<char> x;
for (int i = 1; i <= 100000; i++) x.push_back('a');
std::string x;
for (int i = 1; i <= 100000; i++) x.push_back('a');
std::string x;
for (int i = 1; i <= 100000; i++) x+='a';
std::string x;
for (int i = 1; i <= 100000; i++) x+="a";
Regards,
Alpar
Yep, that's a widely known fact, so it's not surprising and it's why
people usually suggest using the equivalent to IOLists in imperative
languages with immutable array-based strings. In the case of Java, a
StringBuilder or StringBuffer (I just did the test on my macbook 2GHz,
timing with `time`, using a StringBuilder for that loop yields 0.314s,
using regular strings... 243s).
This isn't a fault of the language, just the lack of libraries.
Pretending that lists with a bit of DIY are good enough doesn't help.
Yeah, you can load text in any Unicode encoding into an Erlang list
with no problems... but there's much more to supporting Unicode than
that.
For example, say you've got the string "привет" (which is
Russian for "hi") encoded in UTF-8 in list L:
L = [208, 191, 209, 128, 208, 184, 208, 178, 208, 181, 209, 130]
Now say you want to convert it to uppercase. Well, you can't.
string:to_upper() won't work, as the only encoding it's aware of is
ISO Latin-1.
As soon as you've got text in anything other than ISO Latin-1, the
arguments about niceties of being able to do maps/folds/
comprehensions on lists pretending to be strings become void. You
can't reliably iterate over each character in a UTF-8 or UTF-16
string in a plain list, because they are variable-width encodings.
Neither could you do it even if your strings were in UTF-32, because
they may have composed characters, and you'd have to normalize the
string first... and then you're well on your way to re-implementing
Unicode in Erlang yourself. Good luck.
Anyway, I've been working on an Erlang Unicode string library based
on ICU (http://www.icu-project.org/) for the past week. It's coming
along nicely, and I'll release an alpha version in another week or so.
Erlang is a great language and platform, and non-existent Unicode
support is probably the biggest drawback it has. I hope we'll get it
fixed soon.
You should apply a fun "to_upper" on that list. Since the OTP libraries
do not have
to_upper defined for Cyrillic you need to write that yourself.
When storing your Unicode strings, it is a good idea to convert it to
utf-8 and
then to a binary. Storing this binary is cheaper than storing the
Unicode list.
Lists in erlang consumes a lot of space.
/Peter
Hasan Veldstra skrev:
> Erlang currently sucks for working with Unicode, and as a
> consequence, sucks for working with strings.
>
> This isn't a fault of the language, just the lack of libraries.
<...>
> As soon as you've got text in anything other than ISO Latin-1, the
> arguments about niceties of being able to do maps/folds/
> comprehensions on lists pretending to be strings become void. You
> can't reliably iterate over each character in a UTF-8 or UTF-16
> string in a plain list, because they are variable-width encodings.
> Neither could you do it even if your strings were in UTF-32, because
> they may have composed characters, and you'd have to normalize the
> string first... and then you're well on your way to re-implementing
> Unicode in Erlang yourself. Good luck.
I have run into this brick wall as well.
> Anyway, I've been working on an Erlang Unicode string library based
> on ICU (http://www.icu-project.org/) for the past week. It's coming
> along nicely, and I'll release an alpha version in another week or so.
Excellent!
> Erlang is a great language and platform, and non-existent Unicode
> support is probably the biggest drawback it has. I hope we'll get it
> fixed soon.
Why don't you open a project on Google code so other folks can chip in? I for one would also like to see this capability added to Erlang as well.
Sincerely,
[X || X <- [47,66,111,98,32,59,41]].
Sean
Excellent!
> Anyway, I've been working on an Erlang Unicode string library based
> on ICU (http://www.icu-project.org/) for the past week. It's coming
> along nicely, and I'll release an alpha version in another week or so.
On Tue, 12 Feb 2008 18:00:30 +0100, you said:
RC> Strings as lists is simple and flexible (i.e., if you already have lists,
RC> you don't need to add another data type). Functions that work on lists,
RC> such as append, reverse, etc., can be used directly on strings; you don't
RC> need to program in different styles if you're traversing a list or a
RC> string; etc. Other languages that represent strings as lists include
RC> Prolog (which was a big influence on Erlang) and Haskell. That said, in
RC> larger systems it is better to represent string constants in a more
RC> space-efficient way. Binaries let you do this in Erlang, but they were a
RC> later addition to the language, and the syntax for constructing and
RC> decomposing binaries came even later.
Some time ago I made a VList[1] implementation and altered it to handle
efficiently lists of bytes. If list construction and destruction operations
(cons, hd, tl) are replaced transparently by compiler to VList ones, then you
get:
Pros:
* No need to rewrite programs and add new syntax, just continue to use
lists.
* Lists of bytes consume approximately from n to 2n bytes.
* Lists containing other values consume approximately from n to 2n words.
* length/1, lists:nth/2, and lists:nthtail/2 work in O(log n) time.
* Binaries can use the same representation (this way it is efficient to
add bytes to the beginning while in the current implementation it is
efficient to add them to the end of a binary).
Cons:
* List destruction allocates memory (but can sometimes be optimized by
compiler).
* Cons, hd and tl are slower than traditional list implementation, though a
program can work faster due to less memory usage and less GC load.
* Small lists consume more memory.
Denis
> -----Message d'origine-----
> De : erlang-quest...@erlang.org [mailto:erlang-questions-
> bou...@erlang.org] De la part de Richard Carlsson
> Envoyé : mardi 12 février 2008 12:01
> À : tsuraan
> Cc : Erlang Questions
> Objet : Re: [erlang-questions] Strings as Lists
>
> tsuraan wrote:
> > Why does erlang internally represent strings as lists? In every
> > language I've used other than Java, a string is a sequence of octets,
> > just like Erlang's binary type. I know that you can represent a string
> > efficiently by using <<"string">> rather than just "string", but why
> > doesn't erlang do this by default? Is it just because pre-12B binary
> > handling wasn't as efficient as list handling, or is Erlang intended to
> > support UTF-32?
>
> Strings as lists is simple and flexible (i.e., if you already have lists,
> you don't need to add another data type). Functions that work on lists,
> such as append, reverse, etc., can be used directly on strings; you
> don't need to program in different styles if you're traversing a list
> or a string; etc. Other languages that represent strings as lists include
> Prolog (which was a big influence on Erlang) and Haskell. That said, in
> larger systems it is better to represent string constants in a more
> space-efficient way. Binaries let you do this in Erlang, but they were
> a later addition to the language, and the syntax for constructing and
> decomposing binaries came even later.
>
> /Richard
On the contrary. Remember the topic of this thread? Someone complained
that Erlang uses lists for strings (as its default implementation).
I have
explained why this is a perfectly reasonable, indeed seriously good,
way to
go for *some* uses of strings, but I have *also* explicitly made the
point
that There Is More Than One Way To Do It, and that using the "default"
representation in ANY programming language is not always a good idea.
> My biggest problem with Erlang standard string representation is
> that in
> 64bit mode, each character taking 16 bytes.
Now *you* are focussing on a particular implementation.
Erlang isn't standing behind your chair pointing a gun at your head and
saying "you MUST use lists for text or I will kill you"!
Given that Erlang processes in a single program may easily be scattered
across any number of machines, it seems to me that if you were seriously
worried about storage, you would be worried about ALL storage, not just
strings, and would be running many 32-bit OS-level processes on a
machine
in order to halve your storage requirements. As long as processes that
communicated a lot tended to be in the same process, this should work.
On today's multicore machines, this is (or should be) an attractive
approach.
Use a tuple, and you will still be taking 8 bytes per Unicode character
when 3 will suffice. By packing 3 unicode characters to an integer,
and using unrolled strict lists instead of plain lists (I have a first
draft of an ursl module with a subset of the lists: operations, if
anyone
wants it) you can get 3*4 characters in 6*8 bytes, or 4 bytes each.
That's a factor of four. And it's easy enough to write a preprocessor
to convert string literals (possibly flagged in some way) to that form.
More importantly, holding quantities of text in memory that are big
enough
for the space required to be a serious problem is Not The Erlang
Way. The
idea is to stream *chunks* of text through the system.
> So typical message of 10KB after converted to XML become 20KB
Why are you converting to XML?
I know XML is bulky, but a doubling in size like that suggests a poorly
chosen XML schema. Why are you not converting to *compressed* XML?
> and after parsed by 64bit Erlang VM become 320KB.
But why do that? Lists of characters are, and are intended to be,
useful
for *processing* (chunks of) text in Erlang. If someone is handing you
globs of stuff that are you simply going to squirt down a wire, then a
binary is exactly what you want.
> Also, you forget that what was good for LISP Machines is not good for
> current machine architectures.
Please, do not make a habit of assuming your correspondent is a moron.
You appear to have missed the point, which is that what people THOUGHT
was good for Lisp machines WASN'T good for Lisp machines. The only real
way to KNOW what is good or not is to implement and MEASURE. You had
better believe that on a machine with a 16-bit bus, where we had a total
of 4MB address space available for Prolog's use, people complained
bitterly
about the *obvious* inefficiency of using 8 bytes per character
instead of
1. With respect to time, they were simply wrong. With respect to
space,
the answer was "program so it doesn't matter". We didn't go around
converting
bitmaps to lists!
> Binary strings can be handled very efficiently using vectorized
> SIMD code
> and they use modern caches much more effectively, than lists.
Um. There are so many presuppositions in there that it's really not
practical
to address them all in a single message. First, the number of things
that can
be done efficiently to *Unicode* using Vis/MMX/AltiVec instructions
is rather
limited. Second, what it's mostly limited to is things you should
not be
doing.
But the big thing is that for many text processing tasks *both* lists
*and*
binaries are spectacularly inefficient because of the limits they put on
sharing. For example, my XML processing kit in C cannot and will not
*change*
a document representation, so for most of the transformations I do, the
transformed XML data structure shares a large proportion of its
memory with
the input. You can't do that with a byte-vector string, and SIMD
instructions
don't help with it. DAGs are wonderful!
>
> Deep lists and io-lists are essentially analog of Java's
> StringBuffer or C++
> STL's ostringstream.
Not "essentially". They are *essentially* trees that offer O(1)
concatenation
and shared substructure. StringBuilder (I do hope your Java code
doesn't use
StringBuffer any more) doesn't offer either of those.
> So in general I want immutable String ADT. The ADT implementation
> should be
> smart enough, to switch to the best representation, according to
> usage.
Who was the famous computer scientist who said
"If anyone says to you, 'I want a programming language in which
I just say what I want', give him a lollipop."
You want a data structure which can't change (it's immutable) but does
change (it switches to another representation)?
We've been there. Anyone else in this mailing list remember SETL?
Anyone remember the hopes for automatic selection of representation?
As I recall it, the optimiser got so big that they were never able to
run it over the compiler. (And if anyone, DOES remember SETL and knows
how I can get in touch with David Bacon, I'd be grateful. And if you
have
a copy of On Programming that you don't want, I could give it a
loving home.)
>
> BTW: most sophisticated string implementation I know about was in
> SNOBOL. I
> think it was list of unique substrings.
The Bell Labs SNOBOL4 system and the SPITBOL system used different
string
representations. The main problem for SNOBOL was trying to support
computer
hardware that didn't have byte addressing.
Parsing a quoted string as a token from leex is difficult if you know
that the end-quote might not be included in the chunk you just fed
into leex, but the next chunk read from the tcp stream.
With fully recursive grammars I can see how one wants to let yecc
handle it, but a quoted string is not really recursive: You cant have
a quoted string inside a quoted string the same way you can have, say,
an if-expression inside an if-expression inside an if-expression etc
in a programming language.
Leex is a tool I would use for when I know I have some file of finite
length and I could do a two-pass parsing with yecc as the second
stage, I would not use it for tokenizing SMTP/IRC/NNTP...
Since this is the *Erlang* mailing list, this should have been read as
"whether one uses (immutable) lists or (immutable) arrays".
On 14 Feb 2008, at 8:37 pm, Alpár Jüttner wrote:
> A minor correction: appending at the end of an array
> (std::vector<>) is
> O(1) operation in C++.
std::vector<> is not a class of immutable arrays.
Suppose we have two byte-strings of length n and m.
If they are immutable, the concatenation cost is O(n+m).
If you are willing to smash the one on the left, and it is "stretchy",
the cost is STILL not O(1), it is O(m).
The technique Alpár Jüttner discussed is, or should be, in chapter 1
or 2
of any good data structures and algorithms book.
>
> All of these C++ counterparts requires unmeasurably low time ( <
> 4ms) on
> my laptop:
And they all do it by doing something wildly irrelevant to Erlang:
SMASHING a mutable data structure.
Last year, on the same general topic, someone mentioned Phil Bagwell's
VList data structure. Annoyingly, his paper doesn't really analyse the
asymptotic performance of most operations. His comparison tables give
specific times on specific machines for specific problems, not general
formulas. He says that they could be made to support additions at
either
end but doesn't go into detail. Anyway, that could be quite a good
structure for adding small amounts of text at either end, but I don't
think
it would handle general concatenation well. There *are* sequence data
structures in the functional community that handle concatenation well.
tsuraan wrote:Why does erlang internally represent strings as lists? In every language I've used other than Java, a string is a sequence of octets, just like Erlang's binary type. I know that you can represent a string efficiently by using <<"string">> rather than just "string", but why doesn't erlang do this by default? Is it just because pre-12B binary handling wasn't as efficient as list handling, or is Erlang intended to support UTF-32?Strings as lists is simple and flexible (i.e., if you already have lists, you don't need to add another data type). Functions that work on lists, such as append, reverse, etc., can be used directly on strings; you don't need to program in different styles if you're traversing a list or a string; etc.
This is what you should have in your list:
1> Text = [16#442, 16#435, 16#43a, 16#441, 16#442].
[1090,1077,1082,1089,1090]
You can convert it to utf8 for output
2> xmerl_ucs:to_utf8(Text).
[209,130,208,181,208,186,209,129,209,130]
And you can reverse it and convert that to utf8.
3> xmerl_ucs:to_utf8(lists:reverse(Text)).
[209,130,209,129,208,186,208,181,209,130]
Strings as lists is simple and flexible (i.e., if you already have lists,
you don't need to add another data type). Functions that work on lists,
such as append, reverse, etc., can be used directly on strings; you
don't need to program in different styles if you're traversing a list
> On 13 Feb 2008, at 12:41 pm, Kevin Scaldeferri wrote:
>> Hold on... lists aren't really a particular convenient or efficient
>> data structure for working with strings.
>
> For some people they are.
>
>> First off, I append to
>> strings a lot more than I prepend to them.
>
> I take this to mean that you do stuff like
>
> S0 = ~0~,
> S1 = S0 ++ ~1~,
> ...
> Sn = Sn_1 ++ ~n~
No, certainly not one character at a time. But I frequently build up
an HTML document or some report piece by piece.
>
>
> perhaps in a loop. Right, this is not efficient. But it is
> spectacularly
> inefficient in programming languages with more conventional
> representations.
> It is O(n**2).
No, in a decent string implementation it is O(n).
> For example,
> x = ""
> for (i = 1; i <= 100000; i++) x = x "a"
> just took 30.5 seconds in awk on my machine, 62.2 seconds in Perl,
> and a massive
> 631 seconds in Java.
You either have a very old perl, a very slow machine, or have
implemented it quite badly:
% time perl -le 'my $x = ""; $x .= "a" for 1..100_000; print length $x'
100000
perl -le '...' 0.02s user 0.00s system 89% cpu 0.024 total
even faster (but even less realistic, and not really the same
benchmark):
% time perl -le 'my $x = "a" x 100_000; print length $x'
100000
perl -le '...' 0.00s user 0.00s system 85% cpu 0.006 total
>
>> Yeah, I could work with
>> reversed strings, but that's a hack to deal with using the wrong data
>> type.
>
> No, it has to do with the fact that appending on the right is O(n)
> whether one uses lists or arrays. Arguably, you should be using the
> right data type THE RIGHT WAY. Can you provide an example of your
> code?
Someone else explained how C++ strings/vectors handle this. Perl uses
essentially the same strategy.
>
>
>> Plus, I probably prefix match more often than suffix matching
>> (although this is less lopsided than append vs. prepend). Of course,
>> I also like to do substring matching and regular expressions quite a
>> bit, and Boyer-Moore is definitely more efficient with arrays than
>> lists.
>
> The Boyer-Moore algorithm requires space proportional to the alphabet
> size.
> The Unicode alphabet size is enormous (last time I looked there were
> nearly
> 100 000 characters defined).
If you use UTF-8 encoded strings, you can use Boyer-Moore unmodified
(matching at the level of bytes, rather than characters). For a fixed
width encoding, you can also work with bytes, but you'll have to take
the character width into account when considering how to slide.
> Unicode substring matching is
> definitely nasty,
> given that, for example, you would like (e,floating acute) to match
> é, and
> the Boyer-Moore algorithm assumes a unique encoding of any given
> string,
> which Unicode does not even begin to have. Then there are fun things
> like
> U+0028 is the code for the "(" character, but if I see a "(" and go
> looking
> for it, I might have to look for a U+0029 instead,
Indeed, Unicode is really a PITA like this, but I don't think a list
representation makes this magically go away, either. In fact, in your
example with combining characters, you need to be able to backtrack,
which is more awkward and less performant on lists than arrays. But,
if you disagree, I think this would be a very interesting sort of
benchmark to do. Currently, Erlang appears to perform quite awfully
on anything to do with strings:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=hipe&lang2=perl
Now, most of these conflate string handling and file I/O, so it could
be that file I/O is more of the problem. Or maybe they are badly
implemented. Since you (and many on this list) are concerned highly
with Unicode string handling, I think it would be quite interested to
design a benchmark of that. Perhaps something that did substring (or
regexp) matching on a large number of candidate strings. Perhaps
Erlang would do better. Certainly I'd expect it to be much less code
than C/C++ (although that's practically a given). I'd be very
interested to compare to Perl's Unicode strings, or Java.
>>
>> It's probably worth noting that none of the languages which are
>> considered _good_ for working with strings (AFAIK) use a list
>> representation.
>
> Not really. Perl is normally "considered _good_ for working with
> strings",
> but it is pretty much hopeless at building strings by concatenation.
I hope I've demonstrated above that this claim is false.
-kevin
That's because the second line is currently not a legal Erlang program.
The tokenizer will assume that your source code is encoded using Latin-1,
and since you are giving the compiler garbage input, it gives you garbage
output. Basically, the compiler thinks that you wrote "Ñ\202екÑ\201Ñ\202",
not "текст", and the reverse of that is indeed "\202Ñ\201ѺеÐ\202Ñ",
which is what you got (regardless of what you _wanted_).
What Erlang needs to support non Latin-1 languages, is filters for decoding
input and encoding output. (Right now, you have to write the conversion
functions yourself if you want to work with Russian text.) The internal
string representation - lists of integers using one integer per code
point - needs no modification, whether it's ASCII, Latin-1, or Unicode;
what I said before applies equally well to all of them. Multibyte encodings
are not practical for general string manipulations regardless of how they
are stored in memory.
/Richard
> What Erlang needs to support non Latin-1 languages, is filters for decoding
> input and encoding output. (Right now, you have to write the conversion
> functions yourself if you want to work with Russian text.) The internal
> string representation - lists of integers using one integer per code
> point - needs no modification, whether it's ASCII, Latin-1, or Unicode;
> what I said before applies equally well to all of them. Multibyte encodings
> are not practical for general string manipulations regardless of how they
> are stored in memory.
I can confirm that is possible to use lists of Unicode characters, and quite
easy too. In Wings 3D, I have implemented my own limited support for Unicode.
1. To translate lists of UTF-8 characters to lists of Unicode characters,
there is the function wings_util:expand_utf8/1. (Wings keeps all text strings
for other languages than English in text files, which are read as needed.
If you want to have Russian text in strings in the actual source code files,
you could write a simple parse transform to handle the translation.)
2. As a simple replacement for io:format/2, there is
wings_util:format/2 which doesn't have all the functionality of
io:format/2, but allows arguments to ~s to be lists containing Unicode
characters.
3. For output, Wings has its own fonts and font handling (meaning that Wings
has no need to translate back to UTF-8 on output).
/Bjorn
--
Björn Gustavsson, Erlang/OTP, Ericsson AB
Yep. How extensive would be the changes to perform to have a
configurable tokenizer? Something like Python where you can specify
the encoding of your source code if you want something other than the
default (which, in python, is ASCII)?
This would not work on a string with combining characters, e.g. ü
represented as u followed by ¨, or a CJKV ideograph.
A lot of glyphs *cannot* be represented by a single Unicode codepoint.
Plain lists or binaries are good enough in two cases:
1. You don't need to support anything other than ISO Latin-1 (i.e.
Western European languages).
2. You don't need to do much with the Unicode text apart from simply
storing it and spitting it back to the user as-is.
For any other case, what Erlang/OTP offers now is subpar compared to
other modern languages / platforms.
Implementing Unicode from scratch is nasty, and the DIY attitude is
unproductive and dangerous. There needs to be a standard library,
used and tested by everyone.
As I already mentioned in this thread, I'm working on such a library,
and will release an alpha version soon.
Cheers,
DBM
> I wrote:
>>> No, it has to do with the fact that appending on the right is O(n)
>>> whether one uses lists or arrays.
>
> Since this is the *Erlang* mailing list, this should have been read as
> "whether one uses (immutable) lists or (immutable) arrays".
>
> On 14 Feb 2008, at 8:37 pm, Alpár Jüttner wrote:
>> A minor correction: appending at the end of an array
>> (std::vector<>) is
>> O(1) operation in C++.
>
> std::vector<> is not a class of immutable arrays.
> Suppose we have two byte-strings of length n and m.
> If they are immutable, the concatenation cost is O(n+m).
> If you are willing to smash the one on the left, and it is "stretchy",
> the cost is STILL not O(1), it is O(m).
Pardon my ignorance, but how it is that concatenating to the end of a
length n immutable list is not O(n)? Isn't it necessary to copy all
the list elements?
(It's also worth noting that while strictly speaking the append on a
mutable array is O(m), in practice the coefficient is very small since
it's implemented as a single memcpy.)
-kevin
>
> This would not work on a string with combining characters, e.g. ü
> represented as u followed by ¨, or a CJKV ideograph.
>
> A lot of glyphs *cannot* be represented by a single Unicode codepoint.
>
> Plain lists or binaries are good enough in two cases:
> 1. You don't need to support anything other than ISO Latin-1 (i.e.
> Western European languages).
> 2. You don't need to do much with the Unicode text apart from simply
> storing it and spitting it back to the user as-is.
How about this:
A string is a list of characters.
A character is one or more Unicode code points. A single code point can be
represented by an integer. Multiple code points can be represented by a
tuple. A list wouldn't be good as flatten would then destroy this structure.
Utility functions convert between UTF8 in binaries and this structure.
--
Anthony Shipman Mamas don't let your babies
a...@iinet.net.au grow up to be outsourced.
When you need to do a manipulation on string in constant length,
you need reverse. For example, to_upper() can be implemented as
to_upper(L) = foldr(fun(C, A) -> [char_upper(C) | A] end, [], L).
or
to_upper(L) = map(fun char_upper/1, L).
of which both both have O(n) space requirement, or via
to_upper(L) = lists:reverse(to_upper(L, []))
to_upper([], Acc) -> Acc;
to_upper([C|T], Acc) -> to_upper(T, [char_upper(T)|Acc]).
which is O(n) in space.
Both are O(n) in time, though with different coefficients.
--
Lev Walkin
v...@lionet.info
My question was in regards to reversing strings, not lists of characters.
Specifically, Hasan Veldstra's complaint that representing strings as lists
doesn't work when you use lists:reverse to reverse them:
> This would not work on a string with combining characters, e.g. ü
> represented as u followed by ¨, or a CJKV ideograph.
>
> A lot of glyphs *cannot* be represented by a single Unicode codepoint.
Your example is a case on "unreversing" a reversal done during the up-casing
process.
My guess is that in the "ü represented as u followed by ¨" case, it would
work just right: the "u" would be up-cased to "U", and the "¨" would follow
capital "U" (following the list:reverse to unreverse the list). I don't
think up-casing a CJKV ideograph makes any sense, so you'd probably end up
with the same string you started with.
So the question goes back to Mr. Veldstra (or anyone) as to why you would
want to reverse a Unicode string (unless it is to unreverse a previous
algorithmic reversal, in which case we have no problem with combining
characters).
Many tail-recursive algorithms produce reversed lists, so you need
lists:reverse/1 to put them back again at the end. Typical example:
to_upper(List) ->
Rev = to_upper(List, []),
lists:reverse(Rev).
to_upper([H|T], Acc) ->
to_upper(T, [H band bnot 16#20|Acc]);
to_upper([], Acc) ->
Acc.
Yes, there are other ways of writing to_upper/1.
Yes, I could have put a guard on it.
Yes, I can hear the non-ASCII people grinding their teeth over the band bnot
No, the lists:reverse/1 does not change the big O
Matt
Well, sometimes you do need to trick certain imps to go back to the
dimension they came from. (http://en.wikipedia.org/wiki/Mister_Mxyzptlk)
Apart from that, there is not much real use for reversed strings.
drahciR\
> My question was in regards to reversing strings, not lists of
> characters.
> Specifically, Hasan Veldstra's complaint that representing strings
> as lists
> doesn't work when you use lists:reverse to reverse them:
That wasn't what I said. I gave an example of when a string reversal
would fail as a consequence of treating Unicode codepoints as
characters ("characters" from user's point of view, not how Unicode
defines "characters").
>> This would not work on a string with combining characters, e.g. ü
>> represented as u followed by ¨, or a CJKV ideograph.
>>
>> A lot of glyphs *cannot* be represented by a single Unicode
>> codepoint.
>
> Your example is a case on "unreversing" a reversal done during the
> up-casing
> process.
Sorry, I'm not following you here. I didn't even mention upcasing in
my last message.
> My guess is that in the "ü represented as u followed by ¨" case,
> it would
> work just right: the "u" would be up-cased to "U", and the "¨"
> would follow
> capital "U" (following the list:reverse to unreverse the list).
Yes, maybe this would work, thanks to Erlang's awareness of Western
European scripts. How would you convert this string to uppercase in
Erlang though: "Καλημέρα κόσμε"? With libraries that are
available now, it's impossible.
How about doing case-insensitive comparisons of strings containing
Russian text? Or even doing a case-insensitive comparison of "straße"
and "STRASSE"? Again, no library support.
Or how about comparing two strings that look identical when printed,
but one of them contains the pre-composed "ü" character, while the
other contains "u" followed by "¨"? Again, you can't do this and
similar comparisons reliably using plain lists. Unless you implement
Unicode from scratch yourself, of course.
> I don't think up-casing a CJKV ideograph makes any sense
I know little about East Asian scripts, and I don't know if they have
the uppercase/lowercase distinction, but I never said you'd want to
upcase a CJKV ideograph.
> So the question goes back to Mr. Veldstra (or anyone) as to why you
> would
> want to reverse a Unicode string
I don't know. String reversal was a convenient example for the point
I was trying to make.
As you can see from the quoted text above, that's EXACTLY what I said.
>
> (It's also worth noting that while strictly speaking the append on
> a mutable array is O(m), in practice the coefficient is very small
> since it's implemented as a single memcpy.)
memcpy() isn't that fast.
By using suitable (immutable) trees, you can get concatenation down
to O(1)
while still keeping random access to O(lg n). The way Erlang (ab)
uses lists
as "iolists" you can get concatenation in O(1) followed by O(n)
flattening,
so it's easy for an Erlang program to build a string cheaply in
either left
to right or right to left order using lists, but wouldn't be using
binaries.
To me, embedding regexps, LaTeX etc. in strings is painful and I make
loads of mistakes
forgetting to quote things. Would it be a good idea to have something
like python string literals,
changing the details of course, just to confuse python programmers :-)
To turn off quoting write ~n"abc\n", being short for [97,98,99,92,110],
as opposed to "abc\n" which is short for [97,98,99,10]
This would be especially useful for regexps ~r" ..." and allow things like
regexp:match(~r".......", ...) to be compiled far better than is possible today.
We could use
~n"...." turn off quoting
~r"...." string is a regexp
~x"..." string is xml
~x/FlunkyStuff ... FunkyStuff (string is xml terminated by FunkyStuff)
~myExpander/FunkyStuff .... FunckyStuff
(expand ... -- ie the stuff between FunckyStuff literals, with the
function myExpander (must be defined in a parse
transform) - this mechanism would generalise the ideas of "..." being
syntactic sugar for a list.
So ~Op"......" would mean that "...." was syntactic sugar for *anything*
Is this a good idea or an invitation to write totally unreadable code?
This would make Erlang more powerful (and is backwards compatible) but
less readable - is the extra power
worth the readability?
/Joe Armstrong
2008/2/12 tsuraan <tsu...@gmail.com>:
> Why does erlang internally represent strings as lists? In every language
> I've used other than Java, a string is a sequence of octets, just like
> Erlang's binary type. I know that you can represent a string efficiently by
> using <<"string">> rather than just "string", but why doesn't erlang do this
> by default? Is it just because pre-12B binary handling wasn't as efficient
> as list handling, or is Erlang intended to support UTF-32?
>
> Thanks for any input!
> ~n"...." turn off quoting
> ~r"...." string is a regexp
> ~x"..." string is xml
> ~x/FlunkyStuff ... FunkyStuff (string is xml terminated by
> FunkyStuff)
> ~myExpander/FunkyStuff .... FunckyStuff
I agree with the difficulty of embedding languages into strings, and
avoid it myself whenever possible. Mostly because I'm puritanical, but
whatever.
My issue regards overloading the double-quote character. I'd rather
something completely generic to avoid situations like this:
~r"/"/"
~xml"<root attr="oops"/>"
and so on.
perl's qr/w/x operators might be worth looking at. They don't
completely fix the issue, but they work around it by allowing the
programmer to specify delimiters.
Personally, I'd rather just natural syntax. Both regexp and xml
naturally terminate or have errors, so switching the parser into an
xml/regexp mode seems reasonable to me.
-bjc
It then becomes a problem to parse the straight HTML, which could
contain JavaScript. The browser is supposed to have similar smarts
in how it treats javascript quoting inside a quoted attribute.
However, this would ask for the order-2 smarts from the erlang parser.
I'd propose something like trac code, which is
{{{literal"string"<xml>, code, etc.}}}
where the number and the shape of braces is debatable. To avoid
confusion with tuples, perhaps 3 angle braces would do. Example:
<<<<html><head>"some invalid> htmlcode</html>>>>
which parses quite straightforwardly.
--
vlm
> On 18-Feb-2008, at 09:49, Joe Armstrong wrote:
>
>> ~n"...." turn off quoting
>> ~r"...." string is a regexp
>> ~x"..." string is xml
>> ~x/FlunkyStuff ... FunkyStuff (string is xml terminated by
>> FunkyStuff)
>> ~myExpander/FunkyStuff .... FunckyStuff
>
> perl's qr/w/x operators might be worth looking at. They don't
> completely fix the issue, but they work around it by allowing the
> programmer to specify delimiters.
Interesting ideas! For comparison's sake, and food for thought, here's
how this issue is handled in several other languages:
-----------------------
Perl: per Brian's comment above, and more (~8 variations)
http://perldoc.perl.org/perlop.html#Quote-and-Quote-like-Operators
Python:
http://docs.python.org/ref/strings.html
* single or double-quoted : normal string literals, all escapes
processed
* Triple-quoted strings ("""example""", '''example''') may contain
unescaped newlines or quotes
* These may be prefixed by [uU] and/or [rR]
* u"", U"" = unicode string
* r"", R"" = raw (regexp) string, not interpreted for escape
sequences
PHP:
http://www.php.net/manual/en/language.types.string.php
* single-quoted : limited escapes
* double-quoted : all escapes processed, plus variables ($foo) expanded
* "heredoc syntax", ala Perl or Bourne shell
Ruby:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/tut_stdtypes.html#S2
* single-quoted : limited escapes
* double-quoted : all escapes processed
* %q or %Q : user-defined delimiter (ala Perl qr/w/x operators)
* "heredoc syntax"
-----------------------
I like Python's approach, because the elements are "stackable": you
can combine prefixes to say
ur"\u0062\n"
which yields three unicode characters ('LATIN SMALL LETTER B',
'REVERSE SOLIDUS', 'LATIN SMALL LETTER N'). I also find Python triple-
quotes to be as useful as "heredocs" in perl/php/ruby/bash, but the
syntax is simpler.
Cheers,
--Matt
--
Matt Kangas
kan...@bway.net – www.p16blog.com
> To me, embedding regexps, LaTeX etc. in strings is painful and I make
> loads of mistakes forgetting to quote things.
Patient: Doctor, it hurts when I do <this>.
Doctor: Then don't do that.
We've had essentially this discussion before.
I'm reminded of the classic design botch in SGML.
XML has <![CDATA[...]]>
in which the only special characters are the "]]>" ending quote,
the others being taken literally. Too bad if you want to nest them!
SGML also has <!ELEMENT tag - - CDATA>
which lets you use <tag>...</tag> to quote the characters ... *and*
to put a wrapper around them. It would be perfect for quoting bits
of SGML in tutorials except that (here comes the botch): it is closed
by *any* end-tag, not just the one it started with.
There really is no programming language which handles "textual things
inside
textual things" terribly well. The quoting and meta-quoting stuff
ends up
being pretty nasty no matter what you do. The specific proposal Joe
made
just now
- would make life horrible for editors like Emacs
- would make things very confusing for people
- would STILL be hard to use.
So, in the light of the old joke, let's not do that.
Principle 1:
NO NESTING.
I love nested blocks, and have since Algol 60.
I love nested expressions, and have since Lisp.
But when you want to combine multiple notations, as for example
XSLT (hiss, spit) does, nesting is for the birds.
I would have called this ONE HEADACHE AT A TIME, but cannot see
how to get by with fewer than two.
Principle 2:
NAME AND CONQUER.
If it's big enough to be a problem, it's big enough to have a name.
Principle 3:
WHAT I SEE ISN'T WHAT HE GETS.
In order to be used in Erlang programs, a notation has to be accepted
by the Erlang tool chain, but it does not have to be part of the Erlang
language or understood by the Erlang compiler. We have already
accepted this idea for Yecc and Leex. Keeping it out of the compiler
is also suggested by the next principle:
Principle 4:
LET A HUNDRED SCHOOLS CONTEND.
It's most unlikely that we'll come up with the right design, or even
the right *kind* of design, on the first pop. Maybe the right way to
do it is to write all our code in Microsoft Word (hiss, spit, screech,
jump, claw!) using styles to distinguish one reading of the text from
another. Maybe we should be using an SGML-based or XML-based markup
language (not entirely unlike the one I proposed some years ago,
perhaps) with something like Amaya as our editor. Maybe we should be
asking the aliens from Zeta Reticuli to do our programming for us.
Principle 5:
RUN IT UP THE FLAGPOLE AND SEE IF ANYONE FAINTS.
Here's a sketch of something that can handle large chunks of text in
a mixture
of notations. The key ideas are
- there are Notations (hmm, haven't I heard that before, oh yes, it
was SGML...). A Notation provides a rule for quoting interpolated
text, and may also be associated with a syntax checker.
- there is interpolation, of two kinds. In the case of Literal
interpolation, a string in any notation is interpolated as literal
data according to the Notation's rule. In the case of Interpreted
interpolation, a text in one notation may be interpolated in text
of the same notation only, the result being subject to the syntax
check of the notation, if any.
- @id@ indicates literal interpolation
@@ is a plain @
%id% indicates interpreted interpolation
%% is a plain %
%\n is removed; it's continuation.
These characters were chosen to be minimally obtrusive in LaTeX
and regular expressions and Erlang text.
- Text chunks have names, which can be used in Erlang code.
- Text chunks may have arguments.
<text definition> ::=
<function name> '(' [<arguments] ')' ['/' <notation name>] 'is' '\n'
<data line>*
'.' '\n'
<data line> ::=
<one white space character> <data item>* ['%'] '\n'
<data item> ::=
'%' <expr> '%'
| '%' '%'
| '@' <expr> '@'
| '@' '@'
| [^%@\n]
<expr> ::=
<variable>
| <function name> '(' [<expr> {',' <expr>}] ')
The set of notations we'd need has yet to be determined,
but it would certainly include
latex
regexp
xml
string (" and \ are special)
atom (' and \ are special)
url (\ is illegal)
Example:
time()/regexp is
^1?[0-9]:[0-5][0-9] [AP]M$
.
explanation()/latex is
The regular expression \verb|@time()@| matches
any string of the form
\textit{h}:\textit{m}\verb*| |\textit{ampm}
where \textit{h} is one or two decimal digits,
representing an hour 1--12, \textit{m} is two
decimal digits, with a leading zero if necessary,
representing a minute 00--59, and \textit{ampm}
is either AM (\textit{ante meridiem}) or
PM (\textit{post meridiem}).\foot{For the pedants
amongst you, note that it is meridiEM, not
meridiAN}
.
base()/url is
http://erlang.example.org/%
.
relative()/url is
erlang/doc/preproc/hundred.html%
.
complete()/url is
%base()%%relative%%
.
omnium_gatherum() is
{@time()@,@explanation()@,@complete()@}%
.
A preprocessor would turn this into plain Erlang.
There isn't actually much, if anything, in this that is specific to
Erlang.
> Principle 5:
> RUN IT UP THE FLAGPOLE AND SEE IF ANYONE FAINTS.
Absolutely. :-)
> Principle 4:
> LET A HUNDRED SCHOOLS CONTEND.
Erm... are you the one now expressing PERL envy? :-) One of Perl's
mottos, after all, is: "There's more than one way to do it."
I think there is value in providing *one obvious solution* to a
problem. If it solves > 90% of common use-cases, is syntactically
simple, and easy to use in the default configuration, then it's
probably worth considering. The value here is not total expressive
power, but instead the likelihood of adoption by users, and thus
likelihood of solving real, in-the-wild problems.
(Yes, this is the standard "Zen of Python" retort to Perl mongers.)
I'm fascinated by the flexibility you propose, but confused about the
implications. Should we need to support a Tower of Hanoi for
notations? How likely are users to ever embed > 1 notation? > 2?
--Matt
--
Matt Kangas
kan...@bway.net – www.p16blog.com
On Wed, Feb 13, 2008 at 10:07:52PM -0800, Zvi wrote:
>
> I think you confusing datatype with it's implementation/representation.
> My biggest problem with Erlang standard string representation is that in
> 64bit mode, each character taking 16 bytes.
>From the "use the right datatype for the job" corner:
I was afraid of the 8 bytes / character problem when I started my pet
project. However, I keep the data inside mnesia, and I observed that
mnesia seems to store strings in a compressed form, so I never bothered
to add explicit to-binary conversion (... myself?)
Regards,
-is
Yes, when integers are smaller than 255 (a true string() type value)
then the list can (will?) be turned into a vector in the external
representation [1].
A guess is that it wouldn't be difficult to add a 16-bit vector
representation for an imaginary string16() type, but it would be a bit
more difficult to measure if it is worth it. Especially when binary
comprehensions makes it so easy to do something similar but
explicitly.
[1] : http://www.erlang.org/doc/apps/erts/erl_ext_dist.html#8.12
What's wrong with heredocs, like ruby or sh has?
$stdout.write <<EOM
this is some text
"this is some more" text
this isn't ugly text at all
EOM
You can put any token where the EOM is, and that's what finishes the
string. If you want to have nice indentation, you can use <<-EOM, and
then the EOM can be indented. Vim recognizes the tag that you use,
and switches to that language, if it exists, so you can do <<HTML, and
then it uses html highlighting.
What are the problems with this style of string?
They are OK so long as the language supports indented here docs which
remove leading whitespace, so that they don't screw up the layout of the
code.
Tony.
--
f.a.n.finch <d...@dotat.at> http://dotat.at/
SOUTHEAST ICELAND: SOUTHERLY, VEERING WESTERLY FOR A TIME, 6 TO GALE 8. ROUGH
OR VERY ROUGH. RAIN OR SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.
Your take on it is interesting/nice though.
I am *not* suggesting that there ought to be many ways to do it in the
language. What I am suggesting is pretty much the opposite. The idea
in Perl is to avoid design work. I'm saying let's try lots of designs
BEFORE adding anything to the language, and then let's add at most ONE
thing. After all, I said to let a hundred schools CONTEND, not to let
a hundred schools PREVAIL.
> I think there is value in providing *one obvious solution* to a
> problem.
"For every problem, there is an answer which is simple, obvious, and
WRONG."
Joe probably thinks his sketch is close to being "one obvious
solution" to
the needs he expressed. But I think it is horribly ugly, error
prone, and
more of a pain in the rectal region to deal with than the problem it is
supposed to solve.
> If it solves > 90% of common use-cases,
Then it isn't a solution. Let's face it, what we have NOW solves > 90%
of common use-cases.
> is syntactically simple,
the solution Joe proposed is NOT syntactically simple. It puts a
great deal
of syntactic (and some semantic) processing in the lexical analyser,
which is
not a really wonderful place to put it. For the problem he is trying
to deal
with, this may not be avoidable: trying to embed several levels of
lexical
structure that were never designed to fit together is NOT going to be
easy.
>
> I'm fascinated by the flexibility you propose, but confused about the
> implications. Should we need to support a Tower of Hanoi for
> notations? How likely are users to ever embed > 1 notation? > 2?
I see no tower of Hanoi here. In fact it is precisely the point of my
design that to support an additional notation
- the "meta-notation" (function header, lines, %% and @@ insertions,
and dot) are all handled by the *framework*, which remains completely
ignorant of any specific notation
- you add ONE function that takes a string and adds the quotation
needed for your particular notation.
Instead of towers, there are at most bucket brigades.
Here's another approach.
My emacs-like text editor "thief" has a command ESC [ ` which means
"convert region to HTML by changing the characters <>"'& to entity
references."
(All the HTML commands are on ESC [.) So I can (and do) write
whatever I
want, such as embedded programming language text, just the way it looks,
and then convert it.
I also have a library package with quoting and unquoting code for
AWK
C and C++ (without trigraphs)
C and C++ (with trigraphs)
Csh
DEC-10 Prolog
Fortran 77 and 90 (but only printing characters)
Java
Lisp
M4
Quintus Prolog
sh
TeX
I happen not to have needed Eiffel, Erlang, or Haskell in this
library yet,
but it is really quite a small matter of programming to do that. It
would
also be a small matter of programming to plug these into my editor.
So if I wanted a fragment of TeX in the query of a URL, I would then
be able to
1. Type the text the way I would normally type it.
2. Select the TeX part and ESC ` u (quote region as URL)
3. Select the whole URL and ESC ` e (quote region as Erlang)
Please remember, the framework for this DOES exist, but quote-as-URL and
quote-as-Erlang currently do NOT. What I am demonstrating here is a
DESIGN.
This design completely solves the problem of WRITING embedded notations
without ANY language change whatever. (As does my previous proposal;
that
was for a fairly language-independent preprocessor, NOT for something to
go in the Erlang compiler.)
It doesn't really solve the problem of writing embedded notations
READABLY,
which my previous proposal did (and which Joe's proposal failed to).
The simplest most obvious design that could solve Joe's problem in a
readable
way has four levels:
(1) lexical: some kind of 'literal' string
(2) syntactic: no change to the existing language whatever
(3) library: a suite of functions that take a string and add whatever
quotation is needed for a specified notation to treat all the
characters literally (rather like my C library mentioned above).
(4) optimisation: the compiler is allowed, but not required, to
evaluate calls to certain functions with known arguments at
compile time; the quoting functions may but need not be in the
set of such functions.
The difficult thing is (1), which I think Joe wants anyway. I'm
aware of
several lexical devices for this, and they all stink in one way or
another,
because there isn't ANY delimiter character that you might not want
to include
in the data; it is even conceivable that the data might include at
least one
instance of every character. The only lexical design that doesn't
have that
problem is the old Fortran 66
<count>H<characters>
notation, which is easy enough to generate with a text editor.
Perhaps if we
say that a literal string begins with n+2 quotation marks and a
single character
that is not a letter, digit, space, or tab, and then ends with
another copy of
that single character followed by n+2 quotation marks. (In Erlang,
the quotation
marks could be " for a string or ' for an atom.) For any literal
string, there
is some longest block of quotation marks, so it is always possible to
select a
bracketing run that is longer. Note that the single character that
ends the
run of quotation marks could be a new line, so we could have
Literal_String = """
Here is `'"\$^some literal text with an embedded
but no trailing newline
""",
Another = ""!He said "Foo!" But that was not the end!!""
Hm. I think I may finally have something simple, obvious, readable, and
it just might work.
here's an excerpt from http://www.lua.org/manual/5.1/manual.html#2.1
specifying how literal strings can be defined in Lua:
Literal strings can also be defined using a long format enclosed by
long brackets. We define an opening long bracket of level n as an
opening square bracket followed by n equal signs followed by another
opening square bracket. So, an opening long bracket of level 0 is
written as [[, an opening long bracket of level 1 is written as [=[,
and so on. A closing long bracket is defined similarly; for instance,
a closing long bracket of level 4 is written as ]====]. A long string
starts with an opening long bracket of any level and ends at the first
closing long bracket of the same level. Literals in this bracketed
form may run for several lines, do not interpret any escape sequences,
and ignore long brackets of any other level. They may contain anything
except a closing bracket of the proper level.
Richard -- re: your last post on this subject, which I found quite
illuminating as well...
A syntax for literal string declarations is, by definition, syntactic
sugar. It's not making the core language more expressive. It's for
improving the human-machine interface in specific situations. Since
usability is really the goal, then your first-and-foremost requirement
is to... make it usable in those situations. Syntactic power (nesting
multiple syntaxes) is *very nice*, but should be a distinctly
secondary goal.
As you said, no one quoting character sequence will suffice for _all_
situations. In the case of multi-line strings, approaches we've seen
include:
1) Define a framework, have the framework know (and hide) the
appropriate terminating char-sequence
2) Let the user define a terminating char sequence. (Perl/PHP/Ruby's
answer)
3) Have one terminating multi-line char sequence. (Python's answer)
4) Have one char-sequence, but permit its length to vary to allow
nesting, within reason. (Lua's answer)
Richard, from your last post:
> Literal_String = """
> Here is `'"\$^some literal text with an embedded
> but no trailing newline
> """,
> Another = ""!He said "Foo!" But that was not the end!!""
That example falls into camp (2), user-defined terminating char, yes?
I suppose the motivation for (3) or (4) could be, perhaps, a desire to
make the string-enclosing syntax consistent, thus making it easier to
read unfamiliar code. The reader doesn't have to guess (or look
carefully for) what terminating-sequence the author chose. (4)
encourages consistency while still permitting nesting.
As I said before, I think nesting is a nice, not necessary, property.
I presume you won't consider anything a "solution" unless it _is_
nestable. :)
Comparing (1) and (2), I believe the programmer who's writing the code
is best-positioned to decide what's an appropriate terminating
sequence. I think hiding the terminating sequence behind a name ("/
xml", "/latex", "/url") is likely to cause bugs, or at least weird
compilation errors.
And.. we haven't discussed "raw" strings for regexes. Doh!
Joe's original proposal was:
> ~n"...." turn off quoting
> ~r"...." string is a regexp
> ~x"..." string is xml
> ~x/FlunkyStuff ... FunkyStuff (string is xml terminated by
> FunkyStuff)
> ~myExpander/FunkyStuff .... FunckyStuff
Richard, which parts of this seem especially troublesome, and which
are salvageable?
(IMO, seems like a combination of camps (1) and (2) per above...)
--Matt
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
--
Matt Kangas
kan...@bway.net – www.p16blog.com
> 1) Define a framework, have the framework know (and hide) the
> appropriate terminating char-sequence
> 2) Let the user define a terminating char sequence. (Perl/PHP/
> Ruby's answer)
> 3) Have one terminating multi-line char sequence. (Python's answer)
> 4) Have one char-sequence, but permit its length to vary to allow
> nesting, within reason. (Lua's answer)
>
> Richard, from your last post:
>
>> Literal_String = """
>> Here is `'"\$^some literal text with an embedded
>> but no trailing newline
>> """,
>> Another = ""!He said "Foo!" But that was not the end!!""
>
>
> That example falls into camp (2), user-defined terminating char, yes?
It's really closer to (1), or arguably to (4).
The framework says the opening and closing sequences are x y and y x
respectively, where x is 2 or more quotation marks and y is either
a newline or a printing character that is not a quotation mark.
The user only gets to chose how many quotation marks to use and which
non-quotation mark. *ALL* strings still begin and end with a
quotation mark.
It is important that there isn't any such animal as a user-defined
terminating CHARACTER but a user-selected terminating SEQUENCE.
> I suppose the motivation for (3) or (4) could be, perhaps, a desire
> to make the string-enclosing syntax consistent, thus making it
> easier to read unfamiliar code. The reader doesn't have to guess
> (or look carefully for) what terminating-sequence the author chose.
> (4) encourages consistency while still permitting nesting.
My suggestion also addresses this: "funny" strings always begin and
end with
multiple quotation marks, so anything without multiple quotation
marks isn't a
"funny" string.
>
> Comparing (1) and (2), I believe the programmer who's writing the
> code is best-positioned to decide what's an appropriate terminating
> sequence.
I can't believe this. My experience of using \verb|x| in LaTeX is that
time after time I've found myself choosing a delimiting character that
doesn't work. Fond as I am of TeX, it has warts, and this is one of
them.
What's even worse is that something that *did* work may cease to after
what looks like a small edit.
With my proposed literal string notation, and with several others, it
would
be very straightforward to have an editor command literal-quote-
region that
(a) determined the length of the longest run of quotation marks and
(b) always generated "..."| .... |"..." or, if | were frequent, or
for any
other reason, selected some other suitable character.
Combine this with literal-unquote-region, and one can easily
- unquote the region
- make an edit
- requote it
and expect the result to work, whereas author-chosen terminators are
less
likely to work.
> I think hiding the terminating sequence behind a name ("/xml", "/
> latex", "/url") is likely to cause bugs, or at least weird
> compilation errors.
My proposal does *NOT* hide terminating sequences behind /xml or /
latex or /url
or anything else. Those things (mainly) name *ESCAPING* rules
determining
what happens *after* the string has been read; they have nothing
whatsoever to
do with deciding where the string *ends*.
>
> And.. we haven't discussed "raw" strings for regexes. Doh!
It's there.
>
> Joe's original proposal was:
>
>> ~n"...." turn off quoting
>> ~r"...." string is a regexp
>> ~x"..." string is xml
>> ~x/FlunkyStuff ... FunkyStuff (string is xml terminated by
>> FunkyStuff)
>> ~myExpander/FunkyStuff .... FunckyStuff
>
> Richard, which parts of this seem especially troublesome, and which
> are salvageable?
For one thing, ~n obviously cannot work; nor can anything which
relies on
the termination sequence being a single character. Actually, it doesn't
turn off quoting; it quotes really hard. What it turns off is
presumably
*escaping*.
For another, "n" is just too little a letter to bear a heavy freight of
meaning. All of these single letter modifiers are just too Perlish, too
arcane, too obscure. I often irritate my daughters by quoting one of
T.S.Eliot's "Casey" poems to them: "You gotta use words when you
talk to me."
While strings may be a *compact* notation for regular expressions,
they are
often a grossly inconvenient one. With hindsight, I realise that I
have spent
more time desperately hacking away at regexp backslashes than I would
have lost
by using some kind of S-expression-like format. Stringy
representations are
popular in C and AWK because they don't *have* any S-expression-like
format,
but Erlang does. Why stretch the syntax to breaking point just in
order to
make it easier to do the wrong thing?
The same goes for ~x. Last year I explained how easy it would be to mix
XML with Erlang syntax, and why this would be so much *better* than
having
XML strings. I don't want to have to go through all that again.
Even without
that, an S-expression-like form (such as I use when hacking XML in
Scheme) is
unutterably more convenient in almost every way than a string-like form.
And of course I repeat that the main point of my proposal is keeping all
this stuff *out* of the language until we have some experience with
several
solutions and know which one(s) work(s) best.
> so it's easy for an Erlang program to build a string cheaply in
> either left
> to right or right to left order using lists, but wouldn't be using
> binaries.
>
>
iolists are great for cheaply constructing data that is about to
go out on port soon. iolists suck when some code later on has to
traverse/manipulate it. The ports will just automatically flatten the iolist
in O(n) and that (n) is negligible because the port code anyway has
to copy the data to an output buffer which is _also_ O(n)
One of the original goals with the binary() datatype was that
appending to the right should be as cheap as it is for an iolist
and it is.
List2 = [List1 | More]
or
List2 = [List1 , More]
doesn't really matter as long as More indeed is a list
Anyway,
Bin2 = <<Bin1/binary, More/binary>>
is equally efficent as the iolist construction. It's just
pointer manipulations. O(1)
/klacke
Archaeological rear view mirror coming up. The binary() datatype
was introduced by me when Per Hedeland an I ages ago ripped out all
explicit UNIX syscalls from the emulator proper.
Especially the file calls to load code. That's also when we introduced
the driver concept. A driver (C code) which from erlang looks like
a Port reads the actual file and sends it back up to erlang. All the
code loading BIFs that used to take a filename as parameter was changed
to take a binary() as param instead.
Those were the days .... 1994 ??
/klacke