[Caml-list] ocaml+twt v0.90

Mike Lin

unread,

Jan 16, 2007, 3:53:51 PM1/16/07

to caml...@inria.fr

I just posted a new version of ocaml+twt, a preprocessor that lets you use
indentation to avoid multi-line parenthesization (like Python or Haskell).

http://people.csail.mit.edu/mikelin/ocaml+twt

This version introduces a major backwards-incompatible change: the
eradication of "in" from let expressions, and the need to indent the let
body (as suggested by the F# lightweight syntax). This reduces the
familiar phenomenon of long function bodies getting progressively more
indented as they go along. That is, before where you had:

let x = 5 in
printf "%d\n" x
let y = x+1 in
printf "%d\n" y

You'd now just write:

let x = 5
printf "%d\n" x
let y = x+1
printf "%d\n" y

I was hesitant to introduce this feature because it's extra hackish in
implementation (even moreso than the rest of this house of cards). It also
removes some programmer freedom, because you cannot have the let body on the
same line as the let, and you cannot have a statement sequentially following
the let, outside the scope of the binding. But after playing with it, I
think it is worthwhile. Please let me know what you think. I am still not
completely sure that I haven't broken something profound that will force me
to totally backtrack this change, but let's give it a try. I will obviously
keep the 0.8x versions around for those who prefer it and for existing code
(including a lot of my own).

Standard disclaimer: ocaml+twt is a flagrant, stupendous,
borderline-ridiculous hack, but it works quite well, I write all my new code
using it, and I recommend it if you like this style. On the other hand, if
someone with more free time and knowledge of camlp4 wants to step up, I have
a couple ideas about how you might do it right...

Mike

Sebastien Ferre

unread,

Jan 17, 2007, 4:19:32 AM1/17/07

to caml...@inria.fr

Hi,

I get a segmentation fault when marshalling
a large data structure. I could produce a file
of ~30MB, but for a larger data structure of
the same kind, I get a seg fault.

Do you know of any limit in the marshalling
functions w.r.t. size ?

Some part of my data structure are big doubly linked
graphs.

---
Sébastien Ferré

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Olivier Andrieu

unread,

Jan 17, 2007, 4:41:02 AM1/17/07

to Sébastien Ferre

On 1/17/07, Sebastien Ferre <fe...@irisa.fr> wrote:
> Hi,
>
> I get a segmentation fault when marshalling
> a large data structure. I could produce a file
> of ~30MB, but for a larger data structure of
> the same kind, I get a seg fault.
>
> Do you know of any limit in the marshalling
> functions w.r.t. size ?

Indeed, the marshalling/unmarshalling functions can overflow the
execution stack. You could try to increase maximum stack size for your
process (ulimit -s with a Unix shell).

--
Olivier

Frédéric Gava

unread,

Jan 17, 2007, 10:38:11 AM1/17/07

to Sébastien Ferre, caml...@yquem.inria.fr

Salut,

cela provient du fait que tu passes par le Marshaling c'est-à-dire que
tu transformes ta donnée en une chaîne de caractères. Or, celles-ci ont
une taille limite (voir module Sys pour la valeur exacte) d'où le seg fault.

A mon avis essaye d'écrire directement ta valeur dans le fichier avec un
output_value ou bien utilise "ocaml xml" pour lire/écrire des données
sous le format xml (c'est plus bcp lent mais cela passera à coup sûr la
limitation des 30 Mo)

Amicalement,
Frédéric Gava

Sebastien Ferre a écrit :

Sebastien Ferre

unread,

Jan 17, 2007, 10:46:22 AM1/17/07

to caml...@yquem.inria.fr

pourtant, je passe bien par un appel a output_value
dans un fichier, sans passer par une chaine intermediaire.

Amicalement,
Sebastien

Daniel Bünzli

unread,

Jan 17, 2007, 11:16:36 AM1/17/07

to Sebastien Ferre

Le 17 janv. 07 à 16:41, Sebastien Ferre a écrit :

> pourtant, je passe bien par un appel a output_value
> dans un fichier, sans passer par une chaine intermediaire.

Maybe output_value uses a string internally. Try with a bytecode
version of your executable, an exception should be raised (or have a
look at the implementaiton of output_value).

Best,

Daniel

Olivier Andrieu

unread,

Jan 17, 2007, 11:36:45 AM1/17/07

to Daniel Bünzli

On 1/17/07, Daniel Bünzli <daniel....@epfl.ch> wrote:
>
> Le 17 janv. 07 ŕ 16:41, Sebastien Ferre a écrit :

>
> > pourtant, je passe bien par un appel a output_value
> > dans un fichier, sans passer par une chaine intermediaire.
>
> Maybe output_value uses a string internally. Try with a bytecode
> version of your executable, an exception should be raised (or have a
> look at the implementaiton of output_value).

output_value doesn't use a string internally, it uses malloc. Anyway,
if the marshalling function runs out of memory (wether because malloc
returns NULL or because the caml string is too large), an
Out_of_memory exception is raised.

If it segfaults, that's most probably because the marshalling runs out
of executable stack (because of too much recursion). I've seen it do
this before. The "fix" is to increase the maximum size of the
executable stack.

The behavior is the same with bytecode or native code since it's not
the interpreter's stack that overflows, it's the C one.

Sebastien Ferre

unread,

Jan 17, 2007, 11:39:01 AM1/17/07

to caml...@yquem.inria.fr

Daniel Bünzli wrote:

>> pourtant, je passe bien par un appel a output_value
>> dans un fichier, sans passer par une chaine intermediaire.
>
> Maybe output_value uses a string internally. Try with a bytecode
> version of your executable, an exception should be raised (or have a
> look at the implementaiton of output_value).

I used a bytecode version.

I checked the code of output_value, and it uses an internal
string. So it won't work.

Anyway, I knew I would have to go for a more serious
solution as soon as data get really large. I think of
using something like GDBM.

Thanks for the help.
Sebastien

Jonathan Roewen

unread,

Jan 17, 2007, 2:42:26 PM1/17/07

to Sebastien Ferre

I'm sure one of the marshalling functions uses malloc internally. Have
you tried Marshal.to_channel? That _should_ avoid using ocaml strings.

Jonathan

Yaron Minsky

unread,

Jan 17, 2007, 2:54:56 PM1/17/07

to Sebastien Ferre

Don't quote me on this, but I believe that marshal uses a string in bytecode
with threads, uses straight malloc with bytecode and no threads, and never
uses strings in native code. I'm /very/ unsure about that last one, but I
am pretty confident that in some cases, whether it uses strings depends on
whether threads are involved.

y

On 1/17/07, Sebastien Ferre <fe...@irisa.fr> wrote:
>
>

Markus Mottl

unread,

Jan 17, 2007, 5:57:43 PM1/17/07

to Yaron Minsky

On 1/17/07, Yaron Minsky <ymi...@cs.cornell.edu> wrote:
>
> Don't quote me on this, but I believe that marshal uses a string in
> bytecode with threads, uses straight malloc with bytecode and no threads,
> and never uses strings in native code. I'm /very/ unsure about that last
> one, but I am pretty confident that in some cases, whether it uses strings
> depends on whether threads are involved.
>

I think the question is more along the lines "byte code threads" vs. native
(e.g. POSIX) threads rather than "byte vs. native code". It's true that
byte code threads, which can naturally only be used with byte code, require
an intermediate copy step to OCaml-strings if you want to write to
channels. That's bad on 32bit platforms due to the size limitations on
strings (< 16MB).

I'd recommend using Bigarrays of characters to marshal out data in cases
where OCaml-strings don't suffice. The code for this is extremely simple:

extern CAMLprim int
caml_output_value_to_block(value v, value v_flags, char *bstr, int len);

CAMLprim value bigstring_marshal_stub(value v, value v_flags)
{
char *buf;
long len;
int alloc_flags = BIGARRAY_UINT8 | BIGARRAY_C_LAYOUT | BIGARRAY_MANAGED;
caml_output_value_to_malloc(v, v_flags, &buf, &len);
return alloc_bigarray(alloc_flags, 1, buf, &len);
}

The signature of the OCaml-function is:

external marshal : 'a -> Marshal.extern_flags list -> t =
"bigstring_marshal_stub"

Where type "t" is a bigarray of characters with C-layout.

You can even do without the intermediate copying if you know the maximum
size of the marshalled data and preallocate a bigarray for that. Use
"caml_output_value_to_block" for that purpose. It's defined in
"byterun/extern.c" of the OCaml-distribution.

Regards,
Markus

--
Markus Mottl http://www.ocaml.info markus...@gmail.com

Sebastien Ferre

unread,

Jan 18, 2007, 3:19:59 AM1/18/07

to caml...@yquem.inria.fr

Olivier Andrieu wrote:
> On 1/17/07, Daniel Bünzli <daniel....@epfl.ch> wrote:
>
>>

>> Le 17 janv. 07 à 16:41, Sebastien Ferre a écrit :

>>
>> > pourtant, je passe bien par un appel a output_value
>> > dans un fichier, sans passer par une chaine intermediaire.
>>
>> Maybe output_value uses a string internally. Try with a bytecode
>> version of your executable, an exception should be raised (or have a
>> look at the implementaiton of output_value).

> If it segfaults, that's most probably because the marshalling runs out

> of executable stack (because of too much recursion). I've seen it do
> this before. The "fix" is to increase the maximum size of the
> executable stack.

Indeed, you're right.
I could solve the problem by using the 'ulimit -s' command.

> The behavior is the same with bytecode or native code since it's not
> the interpreter's stack that overflows, it's the C one.

I didn't know the existence of this C stack.
How can I have an idea of the necessary size ?
Is it related to the depth of data structures to
be marshaled ?

Thanks !

Sébastien

Ingo Bormuth

unread,

Jan 23, 2007, 4:50:16 PM1/23/07

to caml...@yquem.inria.fr, mik...@mit.edu

On 2007-01-16 15:48, Mike Lin wrote:
> This version introduces a major backwards-incompatible change: the
> eradication of "in" from let expressions, and the need to indent the let
> body (as suggested by the F# lightweight syntax).

I downloaded the new version some day ago and immediately fell in love
with the compact syntax. In my opinion it feels much more natural.
I especially realized that it took me more effort to convert old
ocaml+twt code (lots of semantically relevant indentation changes) then
it did to convert vanilla ocaml code (essentially s/ *$ in\|;$$//g
plus some optional parentheses removal).

> I was hesitant to introduce this feature because it's extra hackish in
> implementation (even moreso than the rest of this house of cards). It also
> removes some programmer freedom, because you cannot have the let body on the
> same line as the let, and you cannot have a statement sequentially following
> the let, outside the scope of the binding.

A let body beginning in the first line is no problem if you add an
additional semicolon:

let print x y = print_string x ; (* <-- note the semicolon *)
print_string " "
print_string y
print "Hello" "World"

If you need a function in private scope you can easily declare and call
it inside a 'let _ =' block:

let x = 5
printf "%d\n" x

let _ =

let y = x+1
printf "%d\n" y

printf "no y here"

I ran into some minor problems due to ocaml+twt not recognizing the
object related syntax. As I personally use it only in rare cases, I
ended up with just putting the critical section in one long line.

I suggest to implement the '#light' pragma (as in f#) which would allow
to swith on and off indentation awareness on the fly. This would also
enable me to replace all ocaml compilers by wrappers calling ocaml+twt
implicitly. If you want I can prepare a little patch.

Thanks for your effort -- keep going on

Ingo

--
Ingo Bormuth, voicebox & fax: +49-(0)-12125-10226517
public key 86326EC9, http://ibormuth.efil.de/contact

Ingo Bormuth

unread,

Jan 24, 2007, 11:19:54 AM1/24/07

to Mike Lin

On 2007-01-23 16:22, Mike Lin wrote:
> Do you have any examples of this lying around? Objects are "supposed" to
> work, although I have not tested it in any project of appreciable size. I
> definitely want to fix it where it is broken.

You're right. I isolated the problem to the following piece of code:

let _x = ref 0
_x := 1

ocaml+twt complains: 'syntax error at line 2'

I think you should add a '_' to the regular expression for identifiers
in line 218 of ocaml+twt.ml.

Sorry for the false alarm about object orientation (in my code if
had 'val __dbg' inside a class definition).

Anyway I'd regard the #light pragma as very desirable.