[Caml-list] Correct way of programming a CGI script

Tom

unread,

Oct 8, 2007, 11:11:45 AM10/8/07

to Caml-list List

Hi! I am in a process of making a website (which might receive substantial
amounts of traffic), and am considering options for the backend. I discarded
PHP and other similar server-side scripting languages, due to performance
problems (I suspect that PHP and similar could not scale well, unless I
implemented complex caching techniques). I plan to use OCaml to generate
static .html documents from the content from the database. Since the content
will probably change not as often as it will be accessed, I believe this is
the better way (as opposed to accessing the database every time a user wants
to load the page).

So, OCaml programs will only be run seldomly to access the database and
generate HTML files, using the data fetched from the DB. However, I am still
worried whether this would cause too much performance impact.

I heard that OCaml is particularly slow (and probably memory-inefficient)
when it comes to string manipulation. What is the preferred way in handling
strings (building long strings from short parts - something StringBuilder
would be used in Java)? Does anybody have any experience concerning this
kind of applications?

What about the startup time and memory usage of the program? Could these
affect the stability and efficiency of the web server?

(Hope someone will be able to decipher my language and care to answer :P )

- Tom

Dario Teixeira

unread,

Oct 8, 2007, 11:33:28 AM10/8/07

to Tom, Caml-list List

> Hi! I am in a process of making a website (which might receive substantial
> amounts of traffic), and am considering options for the backend. I discarded
> PHP and other similar server-side scripting languages, due to performance

> (...)

Hi Tom,

I suggest you take a look at Ocsigen (http://www.ocsigen.org/). It's
a fully-featured web server written in OCaml, that not only supports
static pages and traditional CGI programming, but also has a module
called Eliom that allows you to build dynamic websites using all the
best features of the OCaml language.

As for performance, the bottleneck will surely be the database backend.
Even when generating dynamic pages with Eliom, Ocsigen can easily output
close to a hundred pages per second on a decent machine. (And of course
it's even faster with static content!)

Cheers,
Dario Teixeira

________________________________________________________
Nervous about who has your email address? Yahoo! Mail can help you win the war against spam.
http://uk.docs.yahoo.com/mail/addressguard2.html

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Gerd Stolpmann

unread,

Oct 8, 2007, 12:05:53 PM10/8/07

to Tom, Caml-list List

Am Montag, den 08.10.2007, 17:08 +0200 schrieb Tom:
> Hi! I am in a process of making a website (which might receive
> substantial amounts of traffic), and am considering options for the
> backend. I discarded PHP and other similar server-side scripting
> languages, due to performance problems (I suspect that PHP and similar
> could not scale well, unless I implemented complex caching
> techniques). I plan to use OCaml to generate static .html documents
> from the content from the database. Since the content will probably
> change not as often as it will be accessed, I believe this is the
> better way (as opposed to accessing the database every time a user
> wants to load the page).
>
> So, OCaml programs will only be run seldomly to access the database
> and generate HTML files, using the data fetched from the DB. However,
> I am still worried whether this would cause too much performance
> impact.
>
> I heard that OCaml is particularly slow (and probably
> memory-inefficient) when it comes to string manipulation. What is the
> preferred way in handling strings (building long strings from short
> parts - something StringBuilder would be used in Java)? Does anybody
> have any experience concerning this kind of applications?

No, this is nonsense. Of course, you can slow everything down by using
strings in an inappropriate way, like

let rec concat_list l =
match l with
[] -> ""
| s :: l' -> s ^ concat_list l'

Use the Buffer module instead:

let concat_list l =
let b = Buffer.create 243 in
let rec concat l =
[] -> ()
| s :: l' ->
Buffer.add_string b s;
concat l' in
concat l;
Buffer.contents b

>
> What about the startup time and memory usage of the program? Could
> these affect the stability and efficiency of the web server?
>
> (Hope someone will be able to decipher my language and care to
> answer :P )

Have a look at ocamlnet (ocamlnet.sf.net). It has plenty of ways of
building web apps. For example, you can easily run your own HTTP server
in a multi-processing or multi-threading setup. Or you can connect your
web app with Apache by using fastcgi or a few other available protocols.
All this is pretty much scalable.

There is no support for generating HTML, however.

An example for a stand-alone webserver (it is accompanied only by a
config file):

https://godirepo.camlcity.org/wwwsvn/trunk/code/examples/nethttpd/netplex.ml?rev=1122&root=lib-ocamlnet2&view=auto

Here is the same for the "connect to Apache" approach:

https://godirepo.camlcity.org/wwwsvn/trunk/code/examples/cgi/netcgi2-plex/?root=lib-ocamlnet2

In either way, it is possible to keep the connection to the db in case
you need it for generating the page.

Hope this helps,

Gerd
--
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
ge...@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------

Loup Vaillant

unread,

Oct 8, 2007, 12:12:31 PM10/8/07

to Caml-list List

2007/10/8, Tom <tom.pr...@gmail.com>:

>
> I heard that OCaml is particularly slow (and probably memory-inefficient)
> when it comes to string manipulation. What is the preferred way in handling
> strings (building long strings from short parts - something StringBuilder
> would be used in Java)? Does anybody have any experience concerning this
> kind of applications?

Someone (don't remember the name) implemented ropes in Ocaml. Ropes
were specifically designed for string manipulation, if I remember
well. Maybe this is worth a look.

Loup Vaillant

Christophe TROESTLER

unread,

Oct 8, 2007, 3:08:34 PM10/8/07

to caml...@inria.fr

On Mon, 8 Oct 2007 18:11:52 +0200, Loup Vaillant wrote:
>
> 2007/10/8, Tom <tom.pr...@gmail.com>:
> >
> > I heard that OCaml is particularly slow (and probably memory-inefficient)
> > when it comes to string manipulation. What is the preferred way in handling
> > strings (building long strings from short parts - something StringBuilder
> > would be used in Java)? Does anybody have any experience concerning this
> > kind of applications?
>
> Someone (don't remember the name) implemented ropes in Ocaml. Ropes
> were specifically designed for string manipulation, if I remember
> well. Maybe this is worth a look.

Have a look here : http://sourceforge.net/projects/ocaml-rope

ChriS

skaller

unread,

Oct 8, 2007, 5:38:14 PM10/8/07

to Gerd Stolpmann, Caml-list List

On Mon, 2007-10-08 at 18:04 +0200, Gerd Stolpmann wrote:

> > I heard that OCaml is particularly slow (and probably
> > memory-inefficient) when it comes to string manipulation.
>

> No, this is nonsense. Of course, you can slow everything down by using
> strings in an inappropriate way, like
>
> let rec concat_list l =
> match l with
> [] -> ""
> | s :: l' -> s ^ concat_list l'

Now Gerd, I would not call the claim nonsense. If you can't
use a data structure in a natural way, I'd say the claim indeed
has some weight.

The example above is ugly because it isn't tail recursive.
If you consider an imperative loop to concat the strings
in an array

let s = ref "" in
for i = 0 to Array.length a do
s := !s ^ a.[i]
done;

then Ocaml is likely to do this slowly. C++ on the other
hand will probably do this faster, especially if you reserve
enough storage first.

The poor performance of Ocaml in such situations is a result
of two factors. The first is the worst-possible choice for a
data representation: mutable characters and immutable length.
The mutability of characters has limited utility and prevents
easy functional optimisations, the useful mutability would
have to include both the content and the length (as in C++).

The second issue would probably make a functional string have
poor performance: Ocaml doesn't do any serious optimisations,
so it probably wouldn't recognize an optimisation opportunity
anyhow.

Note this is by design policy, it isn't a bug or laziness.
[I'm sure someone can quote a ref to Xavier's comments on this]

The effect is that you do have to make fairly low level choices
in Ocaml to get good performance. The good thing about this
is that the optimisation techniques are manifest in the
source code so you have control over them.

Felix does high level optimisations and sometimes a tiny
change in the code can cause orders of magnitude performance
differences, and when I notice it can take me (the author)
quite some time to track down what triggered the difference
in the generated code.

Now, back to the issue: in the Felix compiler itself, the
code generator is printing out C++ code. This is primarily
done with Ocaml string concatenation of exactly the kind which
one might call 'inappropriate'. Buffer is used too, but only
for aggregating large strings.

The reason, trivially, is that it is easier and clearer to write

"class " ^ name ^ " {\n" ^
catmap "\n" string_of_member members ^
"\n};\n"

than to use the imperative Buffer everywhere. The above gives more
clue to what the output looks like.

Despite the cost of using strings this way .. the compiler backend
code generator isn't a performance bottleneck.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

Erik de Castro Lopo

unread,

Oct 8, 2007, 6:22:54 PM10/8/07

to caml...@inria.fr

skaller wrote:

> Now Gerd, I would not call the claim nonsense. If you can't
> use a data structure in a natural way, I'd say the claim indeed
> has some weight.

The original claim was:

>> I heard that OCaml is particularly slow (and probably memory-inefficient)

>> when it comes to string manipulation. What is the preferred way in handling
>> strings (building long strings from short parts - something StringBuilder
>> would be used in Java)? Does anybody have any experience concerning this
>> kind of applications?

ie comparing Ocaml string handling to Java and other web languages like
php, perl, ruby and python.

While I agree that yes, it is possible to write slow code in Ocaml
(or any other language), I suspect that idiomatic Ocaml string handling
compiled to a binary is just as fast if not faster than Java/Perl/Python/
Ruby/PHP/whatever.

Erik
--
-----------------------------------------------------------------
Erik de Castro Lopo
-----------------------------------------------------------------
"Windows was created to keep stupid people away from UNIX."
-- Tom Christiansen

skaller

unread,

Oct 8, 2007, 7:06:39 PM10/8/07

to caml...@inria.fr

On Tue, 2007-10-09 at 08:21 +1000, Erik de Castro Lopo wrote:
> skaller wrote:

> While I agree that yes, it is possible to write slow code in Ocaml
> (or any other language), I suspect that idiomatic Ocaml string handling
> compiled to a binary is just as fast if not faster than Java/Perl/Python/
> Ruby/PHP/whatever.

Fraid not. Python eats Ocaml alive. Python:

s= "a"
x = ""
for i in xrange(0,10000000):
x = x+s
print "done"

Time: 6 seconds. Without optimisation switched on.

Ocaml:
let x = ref "";;
let s = "a";;
for i = 0 to 100000 do
x:= !x ^ s
done;;
print_endline "done";;

Time: 4.5 seconds.

Notice one TINY difference ... Ocaml is processing only 100K strings.
Python is processing 10 MILLION strings in about the same time.

I cannot measure Ocaml's performance when the number is increased
to even 1 million because I have run out of coffee.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

skaller

unread,

Oct 8, 2007, 7:21:11 PM10/8/07

to caml...@inria.fr

On Tue, 2007-10-09 at 09:05 +1000, skaller wrote:
> On Tue, 2007-10-09 at 08:21 +1000, Erik de Castro Lopo wrote:
> > skaller wrote:
>
> > While I agree that yes, it is possible to write slow code in Ocaml
> > (or any other language), I suspect that idiomatic Ocaml string handling
> > compiled to a binary is just as fast if not faster than Java/Perl/Python/
> > Ruby/PHP/whatever.
>
> Fraid not. Python eats Ocaml alive. Python:
>
> s= "a"
> x = ""
> for i in xrange(0,10000000):
> x = x+s
> print "done"
>
> Time: 6 seconds. Without optimisation switched on.

And here is the Felix (C++) version:

int i;
var x = "";
s := "a";
forall i in 1 upto 10_000_000 do
x += s;
done;
println$ len x;

Time: 0m0.198s

Which eats Python for breakfast .. forget about Ocaml.
For 100 million strings, time: 0m1.795s.
I don't have enough RAM to test the next decimal O.

Arnaud Spiwack

unread,

Oct 8, 2007, 7:24:31 PM10/8/07

to Caml List

> And here is the Felix (C++) version:
>
> int i;
> var x = "";
> s := "a";
> forall i in 1 upto 10_000_000 do
> x += s;
> done;
> println$ len x;
>
> Time: 0m0.198s

Out of curiosity, does it work as well (meaning as fast) if you write "x
= s+x" instead ?

Arnaud Spiwack

skaller

unread,

Oct 8, 2007, 7:43:24 PM10/8/07

to caml...@inria.fr

On Tue, 2007-10-09 at 08:21 +1000, Erik de Castro Lopo wrote:

> skaller wrote:
>
> > Now Gerd, I would not call the claim nonsense. If you can't
> > use a data structure in a natural way, I'd say the claim indeed
> > has some weight.
>
> The original claim was:
>
> >> I heard that OCaml is particularly slow (and probably memory-inefficient)
> >> when it comes to string manipulation. What is the preferred way in handling
> >> strings (building long strings from short parts - something StringBuilder
> >> would be used in Java)? Does anybody have any experience concerning this
> >> kind of applications?
>
> ie comparing Ocaml string handling to Java and other web languages like
> php, perl, ruby and python.
>
> While I agree that yes, it is possible to write slow code in Ocaml
> (or any other language), I suspect that idiomatic Ocaml string handling
> compiled to a binary is just as fast if not faster than Java/Perl/Python/
> Ruby/PHP/whatever.

So, as shown in other posts, Ocaml really is SLOW with strings.
But here Erik says 'idiomatic'. I haven't tested this, but
again, this is probably wrong.

If you use Buffer for concatenation you'll get faster times than
Ocaml (^) operator on strings, but what this misses is that other
operations on strings (such as searching, substring etc etc)
aren't available for Buffer. So in order to use these you'd have to

a) make a Buffer and add string into it
b) get the string OUT of the buffer
c) call the functions
d) puts stuff back into some Buffer

This is not only extremely ugly because it is mixing functional
and imperative code .. it is probably as slow as two wet weeks
because of the conversions back and forth.

C++ strings on the other hand combine access to many operations,
both functional and mutations, and automatically provide
the 'Buffer'ing functionality as well. Unlike Buffer, however,
they're passed by value so that 'string const' data type is
purely functional.

Note that Python strings are immutable, so surprisingly of all
the languages I considered .. Python's string operations are
actually purely functional.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

skaller

unread,

Oct 8, 2007, 7:54:34 PM10/8/07

to Arnaud Spiwack, Caml List

On Tue, 2007-10-09 at 01:23 +0200, Arnaud Spiwack wrote:
> > And here is the Felix (C++) version:
> >
> > int i;
> > var x = "";
> > s := "a";
> > forall i in 1 upto 10_000_000 do
> > x += s;
> > done;
> > println$ len x;
> >
> > Time: 0m0.198s
>
> Out of curiosity, does it work as well (meaning as fast) if you write "x
> = s+x" instead ?

I checked:

x = x + s

and that's slow, I guess x = s + x is also slow. In Felix I would be
able to fix the x = x + s case with an imperative reduction,
however Felix only supports functional reductions at the moment.
[A reduction is a user defined term rewriting rule].

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

David Teller

unread,

Oct 9, 2007, 1:50:51 AM10/9/07

to skaller, caml...@inria.fr

Here, the obligatory reference to the Shootout:

The two string manipulation benchmarks are
http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all
(Not quite as much string manipulation as Regexps, OCaml is among the
best here, but Python is about 2x faster -- I've tried improving it with
PCRE but the final result is not as fast as with Str)

http://shootout.alioth.debian.org/gp4/benchmark.php?test=fasta&lang=all
(Not quite as much string manipulation as outputting strings, OCaml is
still among the best here, and Python is about 25x slower)

Fwiw.

Cheers,
David

On Tue, 2007-10-09 at 09:05 +1000, skaller wrote:

> On Tue, 2007-10-09 at 08:21 +1000, Erik de Castro Lopo wrote:
> > skaller wrote:
>
> > While I agree that yes, it is possible to write slow code in Ocaml
> > (or any other language), I suspect that idiomatic Ocaml string handling
> > compiled to a binary is just as fast if not faster than Java/Perl/Python/
> > Ruby/PHP/whatever.
>
> Fraid not. Python eats Ocaml alive. Python:

_______________________________________________

Christophe TROESTLER

unread,

Oct 9, 2007, 6:16:06 AM10/9/07

to caml...@inria.fr

On Tue, 09 Oct 2007 09:05:03 +1000, skaller wrote:
>
> On Tue, 2007-10-09 at 08:21 +1000, Erik de Castro Lopo wrote:
> > skaller wrote:
>
> > While I agree that yes, it is possible to write slow code in Ocaml
> > (or any other language), I suspect that idiomatic Ocaml string handling
> > compiled to a binary is just as fast if not faster than Java/Perl/Python/
> > Ruby/PHP/whatever.
>
> Fraid not. Python eats Ocaml alive. Python:

Are you sure you are comparing string manipulation and languages here?

> s= "a"
> x = ""
> for i in xrange(0,10000000):
> x = x+s
> print "done"
>
> Time: 6 seconds. Without optimisation switched on.

Time: 6.238s Without optimisation switched on.

> Ocaml:

let x = ref(Rope.of_string "")
let s = Rope.of_string "a";;
for i = 0 to 10_000_000 do
x := Rope.concat2 !x s
done;;
print_endline "done"

Time: 2.047s Without optimisation switched on.

Cheers,
ChriS

Christophe TROESTLER

unread,

Oct 9, 2007, 6:21:12 AM10/9/07

to O'Caml Mailing List

On Tue, 09 Oct 2007 09:37:49 +1000, skaller wrote:
>
> If you use Buffer for concatenation you'll get faster times than
> Ocaml (^) operator on strings, but what this misses is that other
> operations on strings (such as searching, substring etc etc)
> aren't available for Buffer.

The other operations are implemented for ropes (except regular
expressions which will happen when I have some time or some help!)

> Note that Python strings are immutable,

So are ropes.

Cheers,
ChriS

Gerd Stolpmann

unread,

Oct 9, 2007, 6:27:50 AM10/9/07

to skaller, Caml-list List

Am Dienstag, den 09.10.2007, 07:37 +1000 schrieb skaller:
> On Mon, 2007-10-08 at 18:04 +0200, Gerd Stolpmann wrote:
>
> > > I heard that OCaml is particularly slow (and probably
> > > memory-inefficient) when it comes to string manipulation.
> >
> > No, this is nonsense. Of course, you can slow everything down by using
> > strings in an inappropriate way, like
> >
> > let rec concat_list l =
> > match l with
> > [] -> ""
> > | s :: l' -> s ^ concat_list l'
>
> Now Gerd, I would not call the claim nonsense. If you can't
> use a data structure in a natural way, I'd say the claim indeed
> has some weight.

I don't know what the "nature" of strings is. I'm rather to believe they
are artifacts, and that there are several ways of defining strings
mostly resulting in different runtime behaviour.

The point is here that ^ always copies strings, and this is generally
expensive, especially in this example, because the same bytes are copied
every time the result string is extended.

I'm fully aware that you can get rid of this copying in the definition
of strings, but this has a price for some other operations. As said, you
can implement strings in various ways.

> If you consider an imperative loop to concat the strings
> in an array
>
> let s = ref "" in
> for i = 0 to Array.length a do
> s := !s ^ a.[i]
> done;

I would call this version even uglier... But taste differs.

The point is that neither the O'Caml runtime representation of strings
nor the compiler (it could recognize the specific use of ^ and
implicitly convert the code so it uses a buffer) do anything for
avoiding this trap.

But we have to be fair. It is simply nonsense to call the whole O'Caml
string manipulation slow. You have access to all operations you need to
do it fast. You just have to know how to code it.

> then Ocaml is likely to do this slowly. C++ on the other
> hand will probably do this faster, especially if you reserve
> enough storage first.

> Now, back to the issue: in the Felix compiler itself, the

> code generator is printing out C++ code. This is primarily
> done with Ocaml string concatenation of exactly the kind which
> one might call 'inappropriate'. Buffer is used too, but only
> for aggregating large strings.
>
> The reason, trivially, is that it is easier and clearer to write
>
> "class " ^ name ^ " {\n" ^
> catmap "\n" string_of_member members ^
> "\n};\n"
>
>
> than to use the imperative Buffer everywhere. The above gives more
> clue to what the output looks like.

Well, if you only concatenate a few strings isn't going to be a problem,
and is probably as fast as using a buffer (which has also some cost).

Gerd
--
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
ge...@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------

_______________________________________________

Brian Hurt

unread,

Oct 9, 2007, 9:01:46 AM10/9/07

to skaller, Caml-list List

skaller wrote:

>On Mon, 2007-10-08 at 18:04 +0200, Gerd Stolpmann wrote:
>
>
>
>>>I heard that OCaml is particularly slow (and probably
>>>memory-inefficient) when it comes to string manipulation.
>>>
>>>
>>No, this is nonsense. Of course, you can slow everything down by using
>>strings in an inappropriate way, like
>>
>>let rec concat_list l =
>> match l with
>> [] -> ""
>> | s :: l' -> s ^ concat_list l'
>>
>>
>
>Now Gerd, I would not call the claim nonsense. If you can't
>use a data structure in a natural way, I'd say the claim indeed
>has some weight.
>
>The example above is ugly because it isn't tail recursive.
>If you consider an imperative loop to concat the strings
>in an array
>
> let s = ref "" in
> for i = 0 to Array.length a do
> s := !s ^ a.[i]
> done;
>
>then Ocaml is likely to do this slowly. C++ on the other
>hand will probably do this faster, especially if you reserve
>enough storage first.
>
>

And if you don't, and thus have to repeatedly allocate more memory, C++
is likely going to be slower than Ocaml (poor allocation performance).
In fact, I'm willing to bet you can get near C++ speed by doing things
in the C++ way- allocate the string once (with enough space), and then
use String.blit to fill it in.

That said, there are better implementations of strings for Ocaml. So
what? Ocaml isn't a string processing language. Yeah, there are things
which are probably better done in Perl/Python/Ruby. A language doesn't
have to be the perfect language for all purposes in order to be a good
language- in fact, in my experience languages that try to be everything
to everybody end up being useless for all purposes (C++ being example #1
here).

Brian

Jon Harrop

unread,

Oct 9, 2007, 9:50:34 AM10/9/07

to caml...@yquem.inria.fr

On Tuesday 09 October 2007 11:20:04 Christophe TROESTLER wrote:
> On Tue, 09 Oct 2007 09:37:49 +1000, skaller wrote:
> > If you use Buffer for concatenation you'll get faster times than
> > Ocaml (^) operator on strings, but what this misses is that other
> > operations on strings (such as searching, substring etc etc)
> > aren't available for Buffer.
>
> The other operations are implemented for ropes (except regular
> expressions which will happen when I have some time or some help!)

Out of curiosity, do your ropes handle UTF-8 and UTF-16?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

William D. Neumann

unread,

Oct 9, 2007, 10:03:02 AM10/9/07

to skaller, caml...@inria.fr

On Tue, 09 Oct 2007 09:05:03 +1000, skaller wrote

> Fraid not. Python eats Ocaml alive.

Sure. If you want to go about your task in a hideously naive manner. Let's
try something different:

# let time f a = let t0 = Unix.gettimeofday () in let r = f a in r,
(Unix.gettimeofday () -. t0);;
val time : ('a -> 'b) -> 'a -> 'b * float = <fun>

# let a_cat n =
let rec build_as acc = function
| 0 -> acc
| n -> build_as ("a"::acc) (pred n)
in String.concat "" (build_as [] n);;
val a_cat : int -> string = <fun>

# snd (time a_cat 1_000_000);;
- : float = 0.55100011825561523

Now, it's not necessarily the first ting a person might think of for this
task, and it's not applicable to all uses of concatination, but for the task
that started this thread (slapping together bits of text to make a web page)
and for tasks like hard coding concatinations it's very convenient.

--

William D. Neumann

Jon Harrop

unread,

Oct 9, 2007, 10:06:03 AM10/9/07

to caml...@yquem.inria.fr

On Monday 08 October 2007 17:04:49 Gerd Stolpmann wrote:
> > I heard that OCaml is particularly slow (and probably
> > memory-inefficient) when it comes to string manipulation. What is the
> > preferred way in handling strings (building long strings from short
> > parts - something StringBuilder would be used in Java)? Does anybody
> > have any experience concerning this kind of applications?
>

> No, this is nonsense...

In this context, yes. In general, strings are not as efficient as the
equivalent concrete data structure in C. Specifically, using strings as a
byte array and applying arithmetic operations to the elements is
significantly slower in OCaml than C.

The only option you have in OCaml is to blow your memory wad and use an int
array, which is fast but wastes enormous amounts of space and still has
different modulo-arithmetic properties (you might want 8-bit for some apps).
Consequently, OCaml is not very good for arithmetic operations over byte
arrays.

I discovered this on my Sudoku solver and revisited it with the Brainf*ck
interpreter. This has never bitten me in practice though.

Perhaps this is an issue for bioinformaticians or some image processing
applications?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

_______________________________________________

skaller

unread,

Oct 9, 2007, 11:17:20 AM10/9/07

to Gerd Stolpmann, Caml-list List

On Tue, 2007-10-09 at 12:26 +0200, Gerd Stolpmann wrote:
> Am Dienstag, den 09.10.2007, 07:37 +1000 schrieb skaller:

> But we have to be fair. It is simply nonsense to call the whole O'Caml
> string manipulation slow. You have access to all operations you need to
> do it fast. You just have to know how to code it.

No you don't, that's the point. There is no fast way to append using
string. You can use Buffer, but then you can't do (for example)
search. You can convert back and forth, and then you
pay an extra conversion cost.

C++ strings provide all the operations of both String and Buffer
and do not pay this cost.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

William D. Neumann

unread,

Oct 9, 2007, 11:19:14 AM10/9/07

to Jon Harrop, caml...@yquem.inria.fr

On Tue, 9 Oct 2007 14:56:37 +0100, Jon Harrop wrote

> In this context, yes. In general, strings are not as efficient as
> the equivalent concrete data structure in C. Specifically, using
> strings as a byte array and applying arithmetic operations to the
> elements is significantly slower in OCaml than C.
>
> The only option you have in OCaml is to blow your memory wad and use
> an int array, which is fast but wastes enormous amounts of space and
> still has different modulo-arithmetic properties (you might want 8-
> bit for some apps). Consequently, OCaml is not very good for
> arithmetic operations over byte arrays.

I'd moaned about this a few years ago, and Xavier pointed out the following:

"A better alternative is to declare

external get_byte: string -> int -> int = "%string_safe_get"
external set_byte: string -> int -> int -> unit = "%string_safe_set"

and use these two functions to access strings as if they were byte
arrays. set_byte will store the low 8 bits of its third argument, so
you'd save on "land 0xFF" operations too."

It works pretty well for getting and setting bytes of a string. There's
also the int8_* bigarrays, but I've not used them much, so I can't say if
they're of much help, but they certainly weren't horrible.

--

William D. Neumann

skaller

unread,

Oct 9, 2007, 11:26:43 AM10/9/07

to William D. Neumann, caml...@inria.fr

On Tue, 2007-10-09 at 09:02 -0500, William D. Neumann wrote:
> On Tue, 09 Oct 2007 09:05:03 +1000, skaller wrote
>
> > Fraid not. Python eats Ocaml alive.
>
> Sure. If you want to go about your task in a hideously naive manner.

If you allow arbitrary code .. you could use the previously mentioned
Ropes library in Ocaml and possibly do well .. and you could write fast
code in C++ using some other data structure too.

It's not clear then you're using "strings".

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

skaller

unread,

Oct 9, 2007, 11:30:14 AM10/9/07

to Christophe TROESTLER, caml...@inria.fr

Of course that's nice, but Rope isn't the standard data structure.
Maybe it should be ..

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

William D. Neumann

unread,

Oct 9, 2007, 11:32:01 AM10/9/07

to skaller, Gerd Stolpmann, Caml-list List

On Wed, 10 Oct 2007 01:16:31 +1000, skaller wrote

> No you don't, that's the point. There is no fast way to append using
> string. You can use Buffer, but then you can't do (for example)
> search. You can convert back and forth, and then you
> pay an extra conversion cost.

So use buffer.ml with a slightly modified interface to create a rawBuffer
module that gives you direct access to the normally hidden string (and the
position of the end of the buffer). Presto, Bufferlike operations with a
string you can directly touch, search, blit, whatever.

No, it's not the default way the stdlib works, and again, it may not be the
first thing someone thinks of when they facce this problem. But the
language makes this option available with a minimum of work. Is it ideal?
No. Is is awful? Again, no.
--

William D. Neumann

unread,

Oct 9, 2007, 11:33:35 AM10/9/07

to skaller, caml...@inria.fr

On Wed, 10 Oct 2007 01:25:45 +1000, skaller wrote

> It's not clear then you're using "strings".

You'd think it would be if you're using finctions from the String module.

--

William D. Neumann

Vincent Hanquez

unread,

Oct 9, 2007, 11:50:17 AM10/9/07

to skaller, Christophe TROESTLER, caml...@inria.fr

On Wed, Oct 10, 2007 at 01:29:16AM +1000, skaller wrote:
> Of course that's nice, but Rope isn't the standard data structure.
> Maybe it should be ..

it should definitely not be standard, but be available as choice over
ocaml strings. each implementation has some use cases when their perform
better (memory/cpu wise).

--
Vincent Hanquez

Vincent Hanquez

unread,

Oct 9, 2007, 11:57:29 AM10/9/07

to Jon Harrop, caml...@yquem.inria.fr

On Tue, Oct 09, 2007 at 02:40:48PM +0100, Jon Harrop wrote:
> Out of curiosity, do your ropes handle UTF-8 and UTF-16?

Out of curiosity, why would a string implementation (has a handle of
chars bundle together) has to handle UTF-X ?

--
Vincent Hanquez

Jon Harrop

unread,

Oct 9, 2007, 11:58:06 AM10/9/07

to caml...@yquem.inria.fr

On Tuesday 09 October 2007 16:25:45 skaller wrote:
> It's not clear then you're using "strings".

It never was.

The concrete data structures used to represent strings in these languages are
different. So you've just picked a concrete data structure with slow append
and showed that its append is slower than a concrete data structure with slow
random access and worse memory usage.

This is just swings and roundabouts.

You might like to compare the performance of setting a single char in a string
in Python and OCaml...

> C++ strings provide all the operations of both String and Buffer
> and do not pay this cost.

Can C++ escape a string using OCaml syntax?

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

_______________________________________________

Jon Harrop

unread,

Oct 9, 2007, 12:10:35 PM10/9/07

to caml...@yquem.inria.fr

On Tuesday 09 October 2007 16:49:13 Vincent Hanquez wrote:
> On Wed, Oct 10, 2007 at 01:29:16AM +1000, skaller wrote:
> > Of course that's nice, but Rope isn't the standard data structure.
> > Maybe it should be ..
>
> it should definitely not be standard, but be available as choice over
> ocaml strings. each implementation has some use cases when their perform
> better (memory/cpu wise).

Yes indeed. However, I'd like to pattern match over all of them.

Oh, and I'd like pattern matching over strings to be fast. Isn't there some
sexy Gray code or something that isn't just usefully fast but also qualifies
as research? Having the optimizing pattern match compiler generate linear
searches is just silly.

Oh, and I want OpenGL integrated into the language. I don't know why, or what
that's got to do with strings but I think its very important. And it must be
faster than C++. ;-)

--
Dr Jon D Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/products/?e

_______________________________________________

Loup Vaillant

unread,

Oct 9, 2007, 12:43:31 PM10/9/07

to Vincent Hanquez, caml...@yquem.inria.fr

2007/10/9, Vincent Hanquez <t...@snarc.org>:

> On Tue, Oct 09, 2007 at 02:40:48PM +0100, Jon Harrop wrote:
> > Out of curiosity, do your ropes handle UTF-8 and UTF-16?
>
> Out of curiosity, why would a string implementation (has a handle of
> chars bundle together) has to handle UTF-X ?

My 2 cents:

It is more convenient to consider strings as characters arrays. Then,
these characters are handled as atoms, even if they take several bytes
in the chosen encoding. Of course, multi-byte characters must be
supported as well.

Still, I can use byte arrays as strings. But it limits me to ASCII and
Latin-like encodings: if I want to do UTF-X, then I have to worry
about multi-bytes characters myself. Internationalization made hard...

I would find very convenient to have plain unicode strings (and
chars), with appropriate scan, print, byte_array_from_string, and
string_from_byte_array functions, one bundle per supported encoding.
So I don't need to think about the internals of such a string.

Loup Vaillant

Vincent Hanquez

unread,

Oct 9, 2007, 12:57:13 PM10/9/07

to Loup Vaillant, caml...@yquem.inria.fr

On Tue, Oct 09, 2007 at 06:42:32PM +0200, Loup Vaillant wrote:
> 2007/10/9, Vincent Hanquez <t...@snarc.org>:
> > On Tue, Oct 09, 2007 at 02:40:48PM +0100, Jon Harrop wrote:
> > > Out of curiosity, do your ropes handle UTF-8 and UTF-16?
> >
> > Out of curiosity, why would a string implementation (has a handle of
> > chars bundle together) has to handle UTF-X ?
>
> My 2 cents:
>
> It is more convenient to consider strings as characters arrays. Then,
> these characters are handled as atoms, even if they take several bytes
> in the chosen encoding. Of course, multi-byte characters must be
> supported as well.
>
> Still, I can use byte arrays as strings. But it limits me to ASCII and
> Latin-like encodings: if I want to do UTF-X, then I have to worry
> about multi-bytes characters myself. Internationalization made hard...
>
> I would find very convenient to have plain unicode strings (and
> chars), with appropriate scan, print, byte_array_from_string, and
> string_from_byte_array functions, one bundle per supported encoding.
> So I don't need to think about the internals of such a string.

By my question i wasn't suggesting that everybody should do
internationalization by hand.

definitely we also need some UTFstring type library (which can use rope,
string, whatever internally), with all common type of operations
(appending, finding, ...), but it's a just a specific sub case and also
a different type not compatible with strings (in OCaml terminology).

--
Vincent Hanquez

Loup Vaillant

unread,

Oct 9, 2007, 1:33:21 PM10/9/07

to Vincent Hanquez, caml...@yquem.inria.fr

2007/10/9, Vincent Hanquez <t...@snarc.org>:

>
> By my question i wasn't suggesting that everybody should do
> internationalization by hand.

Sorry, I misinterpreted you.

> definitely we also need some UTFstring type library (which can use rope,
> string, whatever internally), with all common type of operations
> (appending, finding, ...), but it's a just a specific sub case and also
> a different type not compatible with strings (in OCaml terminology).

Then, we should have both byte arrays (the native Ocaml strings), and
unicode strings. We will also need proper syntactic sugar for unicode
strings. Operators, and literal values (like #"example"). Only then,
ropes could feel like native strings --and be useful as such.

> [...] it's a just a specific sub case [...]

Internationalization is, mere text crunching is not. (You meant that,
right?) With properly interfaced unicode strings, I can do my text
crunching without worrying about internationalization, and with no
programming overhead. Then, when (if) I have to internationalize, it
is much easier.

About the incompatibility, the two types of strings are incompatible
anyway. Maybe even more than ints and floats. Sure you once tried some
"Obj.magic" conversions of an non-English text with emacs. :-)

Loup Vaillant

ols...@verizon.net

unread,

Oct 9, 2007, 2:35:25 PM10/9/07

to

>
> Fraid not. Python eats Ocaml alive. Python:
>

> s= "a"
> x = ""
> for i in xrange(0,10000000):
> x = x+s
> print "done"
>
> Time: 6 seconds. Without optimisation switched on.
>

And just for the record, that's the 'slow' way to do it in python. It
comes up that it's slow on comp.lang.python all the time. The fast
way is to append the individual elements to a list and then call
"".join(elements) as a homemade string builder. I can't time the
differences at the moment though.

Vincent Hanquez

unread,

Oct 9, 2007, 3:51:56 PM10/9/07

to Loup Vaillant, caml...@yquem.inria.fr

On Tue, Oct 09, 2007 at 07:32:25PM +0200, Loup Vaillant wrote:
> > definitely we also need some UTFstring type library (which can use rope,
> > string, whatever internally), with all common type of operations
> > (appending, finding, ...), but it's a just a specific sub case and also
> > a different type not compatible with strings (in OCaml terminology).
>
> Then, we should have both byte arrays (the native Ocaml strings), and
> unicode strings. We will also need proper syntactic sugar for unicode
> strings. Operators, and literal values (like #"example"). Only then,
> ropes could feel like native strings --and be useful as such.

not sure If i see your point here, since your are mixing rope and
unicode. however I think we are missing some other type of string
implementation (maybe rope) *along* the current implementation of
string.

while we also miss unicode support somehow integrated, what
implementation of the underlaying basic byte string is used, is
irrevelant.

> > [...] it's a just a specific sub case [...]
>
> Internationalization is, mere text crunching is not. (You meant that,
> right?) With properly interfaced unicode strings, I can do my text
> crunching without worrying about internationalization, and with no
> programming overhead. Then, when (if) I have to internationalize, it
> is much easier.

Absolutely. What I meant basicly resume into, that unicode strings are
just a subset of strings (as array of bytes). you can store a unicode
string in a byte string, whereas you can't store a byte string into a
unicode string.

i want a UTF library to be able to do something like:

type ustring = unicode_type * string
of_string: string -> ustring (* raise if not unicode compliant *)
to_string: ustring -> string
append: ustring -> ustring -> ustring
..etc

that way when I'm manipulating unicode string, i won't try to append a
binary string to a unicode string. I can code safely with my unicode
string (whatever the format utf-{8..32}), and certainly expect the type
system to complain loudly when doing something that might break unicode.

> About the incompatibility, the two types of strings are incompatible
> anyway.
>
> Maybe even more than ints and floats. Sure you once tried some
> "Obj.magic" conversions of an non-English text with emacs. :-)

I use vim ;), but heh after using Obj.magic you're on your own :)

--
Vincent Hanquez

Loup Vaillant

unread,

Oct 9, 2007, 5:07:34 PM10/9/07

to Vincent Hanquez, caml...@yquem.inria.fr

2007/10/9, Vincent Hanquez <t...@snarc.org>:

> On Tue, Oct 09, 2007 at 07:32:25PM +0200, Loup Vaillant wrote:
> > > definitely we also need some UTFstring type library (which can use rope,
> > > string, whatever internally), with all common type of operations
> > > (appending, finding, ...), but it's a just a specific sub case and also
> > > a different type not compatible with strings (in OCaml terminology).
> >
> > Then, we should have both byte arrays (the native Ocaml strings), and
> > unicode strings. We will also need proper syntactic sugar for unicode
> > strings. Operators, and literal values (like #"example"). Only then,
> > ropes could feel like native strings --and be useful as such.
>
> not sure If i see your point here, since your are mixing rope and
> unicode. however I think we are missing some other type of string
> implementation (maybe rope) *along* the current implementation of
> string.

Sorry. I just saw ropes as a possible implementation for unicode
strings (because there is no standard yet, we have the choice). By the
way, there is a good chance they are better than flat string for most
purposes.

> while we also miss unicode support somehow integrated, what
> implementation of the underlaying basic byte string is used, is
> irrevelant.

I agree.

> > > [...] it's a just a specific sub case [...]
> >
> > Internationalization is, mere text crunching is not. (You meant that,
> > right?) With properly interfaced unicode strings, I can do my text
> > crunching without worrying about internationalization, and with no
> > programming overhead. Then, when (if) I have to internationalize, it
> > is much easier.
>
> Absolutely. What I meant basicly resume into, that unicode strings are
> just a subset of strings (as array of bytes). you can store a unicode
> string in a byte string, whereas you can't store a byte string into a
> unicode string.
>
> i want a UTF library to be able to do something like:
>
> type ustring = unicode_type * string
> of_string: string -> ustring (* raise if not unicode compliant *)
> to_string: ustring -> string
> append: ustring -> ustring -> ustring

> ...etc

Err, I would prefer this:

type ustring (* the type should be abstract, I think *)
of_string_utf8: string -> ustring (* raise if not utf8 compliant *)
to_string_utf8: ustring -> string
of_string_utf16: string -> ustring (* raise if not utf16 compliant *)
to_string_utf16: ustring -> string
of_string_latin1: string -> ustring (* raise if not latin1 compliant *)
to_string_latin1: ustring -> string (* raise if characters not encoded
in Latin1 (the exeption should contain a usefull result) *)
(* etc for each encoding *)
scan_utf8 (* raise if not utf8 compliant (but do not lose a possible
partial result) *)
print_utf8
(* etc for each encoding *)

append: ustring -> ustring -> ustring

(* etc *) (* I agree on these ones *)

So I can mix-up different encodings:

print
(append
scan_Latin1
(of_string text))
(* this is not Lisp *)

Just that the sample code you wrote suggest you could have different
types of unicode strings. I want only one type, so I don't mind the
encoding, except when reading and printing (to files and native
strings). To mind even less, you could have a general scan and
from_string functions which guess which encoding is used (not very
safe, but cool)

> that way when I'm manipulating unicode string, i won't try to append a
> binary string to a unicode string. I can code safely with my unicode
> string (whatever the format utf-{8..32}), and certainly expect the type
> system to complain loudly when doing something that might break unicode.

The exception system can do that, but how could the type system?
(Would be better if it could.)

Loup Vaillant

Chris King

unread,

Oct 9, 2007, 6:07:17 PM10/9/07

to Vincent Hanquez, caml...@yquem.inria.fr

On 10/9/07, Vincent Hanquez <t...@snarc.org> wrote:
> i want a UTF library to be able to do something like:
>
> type ustring = unicode_type * string
> of_string: string -> ustring (* raise if not unicode compliant *)
> to_string: ustring -> string
> append: ustring -> ustring -> ustring

> ...etc

Have you checked out Camomile [1]? It handles such things quite nicely.

[1] http://camomile.sourceforge.net/

Vincent Hanquez

unread,

Oct 10, 2007, 3:36:01 AM10/10/07

to Loup Vaillant, caml...@yquem.inria.fr

On Tue, Oct 09, 2007 at 11:06:59PM +0200, Loup Vaillant wrote:
> Err, I would prefer this:
>
> type ustring (* the type should be abstract, I think *)
> of_string_utf8: string -> ustring (* raise if not utf8 compliant *)
> to_string_utf8: ustring -> string
> of_string_utf16: string -> ustring (* raise if not utf16 compliant *)
> to_string_utf16: ustring -> string
> of_string_latin1: string -> ustring (* raise if not latin1 compliant *)
> to_string_latin1: ustring -> string (* raise if characters not encoded
> in Latin1 (the exeption should contain a usefull result) *)
> (* etc for each encoding *)
> scan_utf8 (* raise if not utf8 compliant (but do not lose a possible
> partial result) *)
> print_utf8
> (* etc for each encoding *)

yes I agree, that what I had in mind, but didn't want to clutter my example.
internally the ustring can hold the format. it should look something like:

type unicode_type = UTF8 | UTF16 | UTF32 | Latin | .............

type ustring = unicode_type * string

of_string : string -> unicode_type -> ustring (* raise if not correct type *)

which is the same as what you describe, except that you have one parsing
function for every type ;)

> append: ustring -> ustring -> ustring
> (* etc *) (* I agree on these ones *)
>
> So I can mix-up different encodings:

yes it would be ideal.

> print
> (append
> scan_Latin1
> (of_string text))
> (* this is not Lisp *)
>
> Just that the sample code you wrote suggest you could have different
> types of unicode strings. I want only one type, so I don't mind the
> encoding, except when reading and printing (to files and native
> strings). To mind even less, you could have a general scan and
> from_string functions which guess which encoding is used (not very
> safe, but cool)

or have a autodetect format in the of_string format wanted along with
the other encoding ;)

> > that way when I'm manipulating unicode string, i won't try to append a
> > binary string to a unicode string. I can code safely with my unicode
> > string (whatever the format utf-{8..32}), and certainly expect the type
> > system to complain loudly when doing something that might break unicode.
>
> The exception system can do that, but how could the type system?
> (Would be better if it could.)

well it does now if you define ustring as an opaque type. you do your
parsing at one place (from string to ustring), at this place it can
raise exception if not a proper format. but once you're manipulating
ustring, it's safe to do whatever you want with them.

--
Vincent Hanquez

Loup Vaillant

unread,

Oct 10, 2007, 4:05:58 AM10/10/07

to Vincent Hanquez, caml...@yquem.inria.fr

2007/10/10, Vincent Hanquez <t...@snarc.org>:

> On Tue, Oct 09, 2007 at 11:06:59PM +0200, Loup Vaillant wrote:
> > Err, I would prefer this:
> >

> >(* big snipet *)

>
> yes I agree, that what I had in mind, but didn't want to clutter my example.
> internally the ustring can hold the format. it should look something like:
>
> type unicode_type = UTF8 | UTF16 | UTF32 | Latin | .............
> type ustring = unicode_type * string
>
> of_string : string -> unicode_type -> ustring (* raise if not correct type *)
>
> which is the same as what you describe, except that you have one parsing
> function for every type ;)

That's even better.

> > print
> > (append
> > scan_Latin1
> > (of_string text))
> > (* this is not Lisp *)
> >
> > Just that the sample code you wrote suggest you could have different
> > types of unicode strings. I want only one type, so I don't mind the
> > encoding, except when reading and printing (to files and native
> > strings). To mind even less, you could have a general scan and
> > from_string functions which guess which encoding is used (not very
> > safe, but cool)
>
> or have a autodetect format in the of_string format wanted along with
> the other encoding ;)

But of course.

> > > that way when I'm manipulating unicode string, i won't try to append a
> > > binary string to a unicode string. I can code safely with my unicode
> > > string (whatever the format utf-{8..32}), and certainly expect the type
> > > system to complain loudly when doing something that might break unicode.
> >
> > The exception system can do that, but how could the type system?
> > (Would be better if it could.)
>
> well it does now if you define ustring as an opaque type. you do your
> parsing at one place (from string to ustring), at this place it can
> raise exception if not a proper format. but once you're manipulating
> ustring, it's safe to do whatever you want with them.

OK.

So, it seem we agree more or less on the interface. Now what about the
implementation? Ropes? Flat? I like ropes, personally: catenation is
made fast, and look-up are still sub-linear. In general Ropes look
cool for functional strings. Last but not the least, they were almost
mandatory for saving Endo at the ICFP this year. ;-)

Loup Vaillant

Vincent Hanquez

unread,

Oct 11, 2007, 9:04:31 AM10/11/07

to Chris King, caml...@yquem.inria.fr

On Tue, Oct 09, 2007 at 06:04:57PM -0400, Chris King wrote:
> On 10/9/07, Vincent Hanquez <t...@snarc.org> wrote:
> > i want a UTF library to be able to do something like:
> >
> > type ustring = unicode_type * string
> > of_string: string -> ustring (* raise if not unicode compliant *)
> > to_string: ustring -> string
> > append: ustring -> ustring -> ustring
> > ...etc
>
> Have you checked out Camomile [1]? It handles such things quite nicely.

The problem with camomile as it's actually so massive (27 modules) that
it's scary ;)

What i have in mind, in what should be the standard library, is a very
simple module that define something like 20 functions at most, to
have valid unicode strings type (of all common unicode types) and do
common operations (append, find, ...) on it.

--
Vincent Hanquez

Vincent Hanquez

unread,

Oct 11, 2007, 9:24:25 AM10/11/07

to Loup Vaillant, caml...@yquem.inria.fr

On Wed, Oct 10, 2007 at 10:05:11AM +0200, Loup Vaillant wrote:
> So, it seem we agree more or less on the interface. Now what about the
> implementation? Ropes? Flat? I like ropes, personally: catenation is
> made fast, and look-up are still sub-linear. In general Ropes look
> cool for functional strings. Last but not the least, they were almost
> mandatory for saving Endo at the ICFP this year. ;-)

well I think that plain ocaml string are not the way to go, i'ld rather
use some immutable string for that. I didn't look in detail the ropes
code, but that's probably fine as well.

--
Vincent Hanquez

skaller

unread,

Oct 11, 2007, 9:55:17 AM10/11/07

to Vincent Hanquez, caml...@yquem.inria.fr, Chris King

On Thu, 2007-10-11 at 15:03 +0200, Vincent Hanquez wrote:
> On Tue, Oct 09, 2007 at 06:04:57PM -0400, Chris King wrote:
> > On 10/9/07, Vincent Hanquez <t...@snarc.org> wrote:
> > > i want a UTF library to be able to do something like:
> > >
> > > type ustring = unicode_type * string
> > > of_string: string -> ustring (* raise if not unicode compliant *)
> > > to_string: ustring -> string
> > > append: ustring -> ustring -> ustring
> > > ...etc
> >
> > Have you checked out Camomile [1]? It handles such things quite nicely.
>
> The problem with camomile as it's actually so massive (27 modules) that
> it's scary ;)
>
> What i have in mind, in what should be the standard library, is a very
> simple module that define something like 20 functions at most, to
> have valid unicode strings type (of all common unicode types) and do
> common operations (append, find, ...) on it.

You can't: Camomile is massive for a reason.. the problem it
aims to solve is complex and hard to do efficiently without
a large set of specialised functions.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

Vincent Hanquez

unread,

Oct 11, 2007, 10:22:38 AM10/11/07

to skaller, caml...@yquem.inria.fr, Chris King

On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
> You can't: Camomile is massive for a reason.. the problem it
> aims to solve is complex and hard to do efficiently without
> a large set of specialised functions.

You are assuming that i want efficiency where i want to print few
unicode string in an ui here and there. I *DON'T* want to be exposed to
full unicode, i need something like 1/100 of camomile library.

If i need to do something complex with unicode or control everything the
library is doing, i'ld use camomile.

--
Vincent Hanquez

Benjamin Monate

unread,

Oct 11, 2007, 10:28:20 AM10/11/07

to caml...@yquem.inria.fr

Vincent Hanquez a écrit :

> On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
>> You can't: Camomile is massive for a reason.. the problem it
>> aims to solve is complex and hard to do efficiently without
>> a large set of specialised functions.
>
> You are assuming that i want efficiency where i want to print few
> unicode string in an ui here and there. I *DON'T* want to be exposed to
> full unicode, i need something like 1/100 of camomile library.
>
> If i need to do something complex with unicode or control everything the
> library is doing, i'ld use camomile.
>

If your ui happens to be using labltk2, you might consider using Glib.Utf8 module.

skaller

unread,

Oct 11, 2007, 10:48:57 AM10/11/07

to Vincent Hanquez, caml...@yquem.inria.fr, Chris King

On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote:
> On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
> > You can't: Camomile is massive for a reason.. the problem it
> > aims to solve is complex and hard to do efficiently without
> > a large set of specialised functions.
>
> You are assuming that i want efficiency where i want to print few
> unicode string in an ui here and there. I *DON'T* want to be exposed to
> full unicode, i need something like 1/100 of camomile library.

In that case, you can use an int Array.t for Unicode provided
it is only 31 bit OR you have a 64 bit machine. These routines
should help converting to and from UTF-8:

(* parse the first utf8 encoded character of a string s
starting at index position i, return a pair
consisting of the decoded integers, and the position
of the first character not decoded.

If the first character is bad, it is returned,
otherwise if the encoding is bad, the result is
an unspecified value.

Fails if the index is past or at
the end of the string.

COMPATIBILITY NOTE: if this function is called
with a SINGLE character string, it will return
the usual value for the character, in range
0 .. 255
*)

let parse_utf8 (s : string) (i : int) : int * int =
let ord = int_of_char
and n = (String.length s) - i
in
if n <= 0 then
failwith
(
"parse_utf8: index "^ string_of_int i^
" >= "^string_of_int (String.length s)^
" = length of '" ^ s ^ "'"
)
else let lead = ord (s.[i]) in
if (lead land 0x80) = 0 then
lead land 0x7F,i+1 (* ASCII *)
else if lead land 0xE0 = 0xC0 && n > 1 then
((lead land 0x1F) lsl 6) lor
(ord(s.[i+1]) land 0x3F),i+2
else if lead land 0xF0 = 0xE0 && n > 2 then
((lead land 0x1F) lsl 12) lor
((ord(s.[i+1]) land 0x3F) lsl 6) lor
(ord(s.[i+2]) land 0x3F),i+3
else if lead land 0xF8 = 0xF0 && n > 3 then
((lead land 0x1F) lsl 18) lor
((ord(s.[i+1]) land 0x3F) lsl 12) lor
((ord(s.[i+2]) land 0x3F) lsl 6) lor
(ord(s.[i+3]) land 0x3F),i+4
else if lead land 0xFC = 0xF8 && n > 4 then
((lead land 0x1F) lsl 24) lor
((ord(s.[i+1]) land 0x3F) lsl 18) lor
((ord(s.[i+2]) land 0x3F) lsl 12) lor
((ord(s.[i+3]) land 0x3F) lsl 6) lor
(ord(s.[i+4]) land 0x3F),i+5
else if lead land 0xFE = 0xFC && n > 5 then
((lead land 0x1F) lsl 30) lor
((ord(s.[i+1]) land 0x3F) lsl 24) lor
((ord(s.[i+2]) land 0x3F) lsl 18) lor
((ord(s.[i+3]) land 0x3F) lsl 12) lor
((ord(s.[i+4]) land 0x3F) lsl 6) lor
(ord(s.[i+5]) land 0x3F),i+6
else lead, i+1 (* error, just use bad character *)

(* convert an integer into a utf-8 encoded string of bytes *)
let utf8_of_int i =
let chr x = String.make 1 (Char.chr x) in
if i < 0x80 then
chr(i)
else if i < 0x800 then
chr(0xC0 lor ((i lsr 6) land 0x1F)) ^
chr(0x80 lor (i land 0x3F))
else if i < 0x10000 then
chr(0xE0 lor ((i lsr 12) land 0xF)) ^
chr(0x80 lor ((i lsr 6) land 0x3F)) ^
chr(0x80 lor (i land 0x3F))
else if i < 0x200000 then
chr(0xF0 lor ((i lsr 18) land 0x7)) ^
chr(0x80 lor ((i lsr 12) land 0x3F)) ^
chr(0x80 lor ((i lsr 6) land 0x3F)) ^
chr(0x80 lor (i land 0x3F))
else if i < 0x4000000 then
chr(0xF8 lor ((i lsr 24) land 0x3)) ^
chr(0x80 lor ((i lsr 18) land 0x3F)) ^
chr(0x80 lor ((i lsr 12) land 0x3F)) ^
chr(0x80 lor ((i lsr 6) land 0x3F)) ^
chr(0x80 lor (i land 0x3F))
else chr(0xFC lor ((i lsr 30) land 0x1)) ^
chr(0x80 lor ((i lsr 24) land 0x3F)) ^
chr(0x80 lor ((i lsr 18) land 0x3F)) ^
chr(0x80 lor ((i lsr 12) land 0x3F)) ^
chr(0x80 lor ((i lsr 6) land 0x3F)) ^
chr(0x80 lor (i land 0x3F))

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

Alain Frisch

unread,

Oct 11, 2007, 5:19:36 PM10/11/07

to skaller, Caml-list ml

skaller wrote:
> In that case, you can use an int Array.t for Unicode provided
> it is only 31 bit OR you have a 64 bit machine. These routines
> should help converting to and from UTF-8:

Unicode code points will always fit in 31 bits and 23 bits are actually
enough for now (the largest valid assigned value is 0x10ffff).

-- Alain

Julien Moutinho

unread,

Oct 15, 2007, 4:34:51 PM10/15/07

to caml...@yquem.inria.fr

On Fri, Oct 12, 2007 at 12:48:16AM +1000, skaller wrote:
> On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote:
> > On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
> > > You can't: Camomile is massive for a reason.. the problem it
> > > aims to solve is complex and hard to do efficiently without
> > > a large set of specialised functions.
> >
> > You are assuming that i want efficiency where i want to print few
> > unicode string in an ui here and there. I *DON'T* want to be exposed to
> > full unicode, i need something like 1/100 of camomile library.
>
> In that case, you can use an int Array.t for Unicode provided
> it is only 31 bit OR you have a 64 bit machine. These routines
> should help converting to and from UTF-8:

> [...]

Just in case someone would want to use this parse_utf8,
be aware that depending on the trust you have in your input,
it may be sorely discouraged to do so.
Indeed, this code does not check comprehensively for invalid characters.

eg. for characters with an overlong form [1]:

# let mk = List.fold_left
(fun acc c -> acc ^ String.make 1 (Char.chr c)) "";;
val mk : int list -> string = <fun>
# let p l = parse_utf8 (mk l) 0;;
val p : int list -> int * int = <fun>

(* unicode 0 coded into an overlong utf-8 form *)
# p [0b11_000000; 0b10_000000];;
- : int * int = (0, 2)

Nor does it checks for invalid trailing bytes :

(* unicode 64 (@) with and invalid trailing byte,
* which happens to be a zero *)
# p [0b11_000001; 0b00_00000];;
- : int * int = (64, 2)

Besides "now" an unicode value needs only 21 bits
and "therefore" an utf-8 char holds into at most 4 bytes,
not 6 as the code handles.

[1] http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations

skaller

unread,

Oct 15, 2007, 8:00:55 PM10/15/07

to Julien Moutinho, caml...@yquem.inria.fr

On Mon, 2007-10-15 at 22:35 +0200, Julien Moutinho wrote:
> On Fri, Oct 12, 2007 at 12:48:16AM +1000, skaller wrote:
> > On Thu, 2007-10-11 at 16:21 +0200, Vincent Hanquez wrote:
> > > On Thu, Oct 11, 2007 at 11:54:24PM +1000, skaller wrote:
> > > > You can't: Camomile is massive for a reason.. the problem it
> > > > aims to solve is complex and hard to do efficiently without
> > > > a large set of specialised functions.
> > >
> > > You are assuming that i want efficiency where i want to print few
> > > unicode string in an ui here and there. I *DON'T* want to be exposed to
> > > full unicode, i need something like 1/100 of camomile library.
> >
> > In that case, you can use an int Array.t for Unicode provided
> > it is only 31 bit OR you have a 64 bit machine. These routines
> > should help converting to and from UTF-8:
> > [...]
>
> Just in case someone would want to use this parse_utf8,
> be aware that depending on the trust you have in your input,
> it may be sorely discouraged to do so.
> Indeed, this code does not check comprehensively for invalid characters.

That is correct. It is specifically designed NOT to do so.
The last thing you want in 99% of codec use is to abort due
to an error.

Try switching codecs on Firefox.. do you really want to abort
if you have bad input or the wrong codec?

UTF-8 is primarily used for Unicode which is human readable text.
Errors and faults in the text aren't important most of the time.

It has nothing to do with a 'trusted' source. It has to do with
the fact that the human text is an approximation in the first
place.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

Julien Moutinho

unread,

Oct 15, 2007, 10:20:53 PM10/15/07

to caml...@inria.fr

On Tue, Oct 16, 2007 at 09:51:16AM +1000, skaller wrote:
> On Mon, 2007-10-15 at 22:35 +0200, Julien Moutinho wrote:
> > Just in case someone would want to use this parse_utf8,
> > be aware that depending on the trust you have in your input,
> > it may be sorely discouraged to do so.
> > Indeed, this code does not check comprehensively for invalid characters.
>
> That is correct. It is specifically designed NOT to do so.

At your own risk, there was no offense.

> The last thing you want in 99% of codec use is to abort due
> to an error.
>
> Try switching codecs on Firefox.. do you really want to abort
> if you have bad input or the wrong codec?

I would say, whatever Firefox does I want it to be a minimun safe.

> UTF-8 is primarily used for Unicode which is human readable text.
> Errors and faults in the text aren't important most of the time.

Unless they are voluntarily put by a malicious assailant.
cf. [1] where a backslash is used with the overlong UTF-8 form "\xC1\x9C"
instead of "\x5C", fooling IIS' string search algorithm.

> It has nothing to do with a 'trusted' source. It has to do with
> the fact that the human text is an approximation in the first
> place.

[1] http://www.securityfocus.com/bid/1806/discuss

Julien Moutinho

unread,

Oct 16, 2007, 2:45:48 PM10/16/07

to caml...@inria.fr

Here, I have reused some old code of mine to secure and extend J.Skaller's:
unicode_of_utf8 ~ parse_utf8
utf8_of_unicode ~ utf8_of_int
May it help, and may it not be too bugged.

exception Bad_utf8 of string * (string * int * int * int)
(* raised with an error description and its location:
* bytes
* start (0 < start <= String.length bytes)
* size (0 < size <= String.length bytes)
* position (0 <= position <= size) *)
exception Insufficient of int
(* raised when more bytes are needed.
* The absolute value of the integer is the minimal amount of bytes needed.
* A positive sign means that they have to be appended.
* A negative sign means that they have to be prepended. *)

let in_bounds
~(size: int)
~(pos: int) =
if size <> 0 then begin
if pos < 0 then begin
let i = size - ((- pos) mod size) in
if i = size then 0 else i
end else (pos mod size)
end else 0

let position__char_size__offset
(bytes: string)
?(start = 0)
?(size = String.length bytes)
~(pos: int) : int * int * int =
if size <= 0 then (0, 0, 0)
else begin
let pos = in_bounds ~size ~pos in
let char_pos = start + pos in
let char_start = ref char_pos in
let on_tail = ref true in
let loc = (bytes, start, size, pos) in

(* go backward to find a head *)
while !on_tail do
if char_pos - !char_start > 3
then raise (Bad_utf8 ("cannot find a head nearby", loc))
else if !char_start < start
then raise (Insufficient (-1))
else begin
let cod = Char.code bytes.[!char_start] in
if (cod land 0b1100_0000) = 0b1000_0000 (* on a trailing byte *)
then decr char_start
else on_tail := false
end
done;
let char_start = !char_start in

(* decode the head *)
let head = Char.code bytes.[char_start] in
let overlong boo =
(* check for overlong forms (when a character uses more trailing bytes than needed),
* see http://en.wikipedia.org/wiki/UTF-8#Overlong_forms.2C_invalid_input.2C_and_security_considerations *)
if boo then raise (Bad_utf8 ("overlong form", loc))
in
let may_be_overlong = ref false in
let char_size = (* get the size of the character *)
(* 0zzzzzzz -> 0zzzzzzz = 7 bits *)
if (head land 0b1_0000000) = 0b0_0000000 then 1
(* 110YYYYy 10zzzzzz -> 00000yyy yyzzzzzz = 11 bits *)
else if (head land 0b111_00000) = 0b110_00000
then (overlong ((head land 0b000_11110) = 0); 2)
(* 1110XXXX 10Yyyyyy 10zzzzzz -> xxxxyyyy yyzzzzzz = 16 bits *)
else if (head land 0b1111_0000) = 0b1110_0000
then (may_be_overlong := ((head land 0b0000_1111) = 0); 3)
(* 11110WWW 10XXxxxx 10yyyyyy 10zzzzzz -> 000wwwxx xxxxyyyy yyzzzzzz = 21 bits *)
else if (head land 0b1111_1000) = 0b1111_0000
then (may_be_overlong := ((head land 0b00000_111) = 0); 4)
(* 4 bytes is the maximun size of an UTF-8 character by now *)
else raise (Bad_utf8 ("invalid head", loc))
in

(* decode the tail *)
let off = ref (char_start + 1) in
let t_end = start + size in
let char_end = char_start + char_size in
let max_off = min char_end t_end in
(* check whether the trailing bytes of a character
* are of the form 0b10_xxxxxx *)
while !off < max_off do
let cod = (Char.code bytes.[!off]) in
if (cod land 0b11_000000) <> 0b10_000000
then raise (Bad_utf8 ("invalid tail", loc));
incr off
done;
(* complete the overlong check *)
if max_off >= char_start + 1 (* if there is a second byte *)
&& !may_be_overlong
then overlong
( (char_size = 3
&& ((Char.code bytes.[char_start + 1]) land 0b00_100000) = 0)
|| (char_size = 4
&& ((Char.code bytes.[char_start + 1]) land 0b00_110000) = 0) );
(* check the tail length *)
if char_end > t_end
then raise (Insufficient (char_end - (char_pos + 1)));

(pos, char_size, char_pos - char_start)
end

let unicode_of_utf8
(bytes: string)
?(start = 0)
?(size = String.length bytes)
(pos: int) : int * int =
let pos, char_size, offset =
position__char_size__offset bytes ~start ~size ~pos in
let char_start = pos - offset in
let unicode =
match char_size with
| 1 -> (* 0zzzzzzz -> 0zzzzzzz *)
Char.code bytes.[char_start]
| 2 -> (* 110yyyyy 10zzzzzz -> 00000yyy yyzzzzzz *)
let cod0 = Char.code bytes.[char_start] in
let cod1 = Char.code bytes.[char_start + 1]
in ((cod0 land 0b000_11111) lsl 6)
lor (cod1 land 0b00_111111)
| 3 -> (* 1110xxxx 10yyyyyy 10zzzzzz -> xxxxyyyy yyzzzzzz *)
let cod0 = Char.code bytes.[char_start] in
let cod1 = Char.code bytes.[char_start + 1] in
let cod2 = Char.code bytes.[char_start + 2]
in ((cod0 land 0b0000_1111) lsl 12)
lor ((cod1 land 0b00_111111) lsl 6)
lor (cod2 land 0b00_111111)
| 4 -> (* 11110www 10xxxxxx 10yyyyyy 10zzzzzz -> 000wwwxx xxxxyyyy yyzzzzzz *)
let cod0 = Char.code bytes.[char_start] in
let cod1 = Char.code bytes.[char_start + 1] in
let cod2 = Char.code bytes.[char_start + 2] in
let cod3 = Char.code bytes.[char_start + 3]
in ((cod0 land 0b00000_111) lsl 18)
lor ((cod1 land 0b00_111111) lsl 12)
lor ((cod2 land 0b00_111111) lsl 6)
lor (cod3 land 0b00_111111)
| _ -> assert false
in
match unicode with
| cod when cod >= 0xD800 && cod <= 0xDFFF ->
(* The definition of UTF-8 prohibits encoding character numbers between
* U+D800 and U+DFFF, which are reserved for use with the UTF-16
* encoding form (as surrogate pairs) and do not directly represent characters. *)
raise (Bad_utf8 ("prohibited code point", (bytes, start, size, pos)))
| cod when cod > 0x10FFFF ->
raise (Bad_utf8 ("invalid code point", (bytes, start, size, pos)))
| _ -> (unicode, (char_size - offset))

exception Bad_unicode of string * int
(* raised with an error description and an integer
* which is either a prohibited or an invalid unicode code point *)

let utf8_of_unicode :
int -> string =
function
| cod when cod >= 0x00 && cod <= 0x7F -> (* 0zzzzzzz -> 0zzzzzzz *)
String.make 1 (Char.chr cod)
| cod when cod <= 0x07FF -> (* 00000yyy yyzzzzzz -> 110yyyyy 10zzzzzz *)
let str = String.create 2 in
str.[0] <- Char.chr (0b110_00000 lor (cod lsr 6));
str.[1] <- Char.chr (0b10_000000 lor (cod land 0b00_111111));
str
| cod when cod >= 0xD800 && cod <= 0xDFFF ->
(* The definition of UTF-8 prohibits encoding character numbers between
* U+D800 and U+DFFF, which are reserved for use with the UTF-16
* encoding form (as surrogate pairs) and do not directly represent characters. *)
raise (Bad_unicode ("prohibited code point", cod))
| cod when cod <= 0xFFFF -> (* xxxxyyyy yyzzzzzz -> 1110xxxx 10yyyyyy 10zzzzzz *)
let str = String.create 3 in
str.[0] <- Char.chr (0b1110_0000 lor (cod lsr 12));
str.[1] <- Char.chr (0b10_000000 lor ((cod lsr 6) land 0b00_111111));
str.[2] <- Char.chr (0b10_000000 lor ( cod land 0b00_111111));
str
| cod when cod <= 0x10FFFF -> (* 000wwwxx xxxxyyyy yyzzzzzz -> 11110www 10xxxxxx 10yyyyyy 10zzzzzz *)
let str = String.create 4 in
str.[0] <- Char.chr (0b11110_000 lor ( cod lsr 18));
str.[1] <- Char.chr (0b10_000000 lor ((cod lsr 12) land 0b00_111111));
str.[2] <- Char.chr (0b10_000000 lor ((cod lsr 6) land 0b00_111111));
str.[3] <- Char.chr (0b10_000000 lor ( cod land 0b00_111111));
str
| cod -> raise (Bad_unicode ("invalid code point", cod))

Julien Moutinho

unread,

Oct 16, 2007, 2:50:14 PM10/16/07

to caml...@inria.fr

On Tue, Oct 16, 2007 at 08:46:21PM +0200, Julien Moutinho wrote:

> exception Bad_utf8 of string * (string * int * int * int)
> (* raised with an error description and its location:
> * bytes
> * start (0 < start <= String.length bytes)
> * size (0 < size <= String.length bytes)
> * position (0 <= position <= size) *)

Typo in last minute comment:
* start (0 <= start <= String.length bytes)
* size (0 <= size <= String.length bytes)

* position (0 <= position <= size) *)

_______________________________________________

skaller

unread,

Oct 16, 2007, 10:56:51 PM10/16/07

to Julien Moutinho, caml...@inria.fr

On Tue, 2007-10-16 at 20:46 +0200, Julien Moutinho wrote:
> Here, I have reused some old code of mine to secure and extend J.Skaller's:
> unicode_of_utf8 ~ parse_utf8
> utf8_of_unicode ~ utf8_of_int
> May it help, and may it not be too bugged.

The UTF-8 to UCS4 string conversion probably needs an option NOT
to throw exceptions. Almost all uses are not serviced well by
throwing exceptions.

* Exceptions are a very bad idea in the first place :)

* An alternative strategy using a replacement code may be
worth considering, possibly using an argument, and possibly
by another technique such as a wrapper function which
continues on after inserting the replacement.

I think you will find that covering all the real uses isn't
so easy .. one reason for a whole framework like Camomile.

In particular, most Standards will say things like "behaviour
is undefined if such and such" for example an invalid code.

This does NOT mean using such a code is an error, it means
the Standard leaves the behaviour open in such cases.
In particular, open to vendor extensions.

It's perfectly legitimate, for example, for an application
to encode colour and font information in the 31-21=10 remaining
bits**, but your codec will throw an exception here, instead
of just translating the codes. What the application is doing
isn't portable -- but that doesn't make it incorrect if the
application has control over the context.

In particular, the application can even define an extension
to the Standard and send such codes over file systems and
networks. Other applications may decide to support the
extensions.

In fact this is how Standards are made: the ISO mantra is that
standards should encode existing practice, and that specifically
implies ALLOWING practice outside the existing Standards.

So my routine wasn't quite so 'home grown' .. I've actually
been a participant in ISO Standardisation processes with a
special National Body interest in I18n issues (since Australia
is highly multi-cultural and has people speaking many languages).
I18n is a real quagmire of complexity... for example there are
no known commercial text rendering routines that actually
comply with the Standard -- bidirectional rendering is
extremely difficult to do efficiently and it isn't clear that
the Standard requirements are all that useful if you happen
to be mixing English, Arabic, and Chinese in the same document.

** there is a real use of the extra bits by some Egyptologists
encoding hieroglyphics.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________