[Caml-list] mboxlib reloaded ;-)

Oliver Bandel

unread,

Apr 27, 2007, 9:56:26 AM4/27/07

to caml...@inria.fr

Hello,

after two years of doing nothing on it,
I today found my mboxlib, I started to
write in 2005.

I have put the mli-file on the web and
maybe the library itself will follow
during the next time.

Any feedback, questions and suggestions are welcome.

http://me.in-berlin.de/~first/software/libraries/mboxlib/

TIA,
Oliver Bandel

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Richard Jones

unread,

Apr 27, 2007, 12:30:37 PM4/27/07

to Oliver Bandel

On Fri, Apr 27, 2007 at 03:54:25PM +0200, Oliver Bandel wrote:
> Hello,
>
> after two years of doing nothing on it,
> I today found my mboxlib, I started to
> write in 2005.
>
> I have put the mli-file on the web and
> maybe the library itself will follow
> during the next time.
>
> Any feedback, questions and suggestions are welcome.
>
> http://me.in-berlin.de/~first/software/libraries/mboxlib/

The source for COCANWIKI[1] contains extensive support for threading
of mail messages, based on JWZ's algorithm:

http://www.jwz.org/doc/threading.html

You are of course welcome to copy this. If there are any license
issues let me know & I can fix them.

I'd also like to point you to another useful JWZ doc:

http://www.jwz.org/doc/mailsum.html

Rich.

[1] http://sandbox.merjis.com/

--
Richard Jones
Red Hat

Oliver Bandel

unread,

Apr 27, 2007, 7:14:13 PM4/27/07

to caml...@inria.fr

Hi,

only a short note, because I tonight will not explore it in detail...

On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote:
> On Fri, Apr 27, 2007 at 03:54:25PM +0200, Oliver Bandel wrote:
> > Hello,
> >
> > after two years of doing nothing on it,
> > I today found my mboxlib, I started to
> > write in 2005.
> >
> > I have put the mli-file on the web and
> > maybe the library itself will follow
> > during the next time.
> >
> > Any feedback, questions and suggestions are welcome.
> >
> > http://me.in-berlin.de/~first/software/libraries/mboxlib/
>
> The source for COCANWIKI[1] contains extensive support for threading
> of mail messages, based on JWZ's algorithm:
>
> http://www.jwz.org/doc/threading.html

Nice... you speak of an optimized algorithm for threading.
I didn't explored your solution nor did I explored your
paper in detail (tomorrow I think I have the time to do it),
but IMHO the best thing for handling message-threads
is to use tries-datastructure with messgae-id's
as identifers (instead of char's, as they are used normally).

So: did you reimplemented the tries-datastructure
as abstraction on message ID's, or did you
made it different?

>
> You are of course welcome to copy this. If there are any license
> issues let me know & I can fix them.
>
> I'd also like to point you to another useful JWZ doc:
>
> http://www.jwz.org/doc/mailsum.html

Well, the same here: tomorrow I can look at itin more detail;
but the problem of fast mbox-usage I today also found out as
a problem, as I first time used a test-mbox of about 100 MB.
Normally I would use some MB's of size, because I think
ths is the normal size; but I had some dscussions on the
berlin Linux user group, and some people were anbnoyed that
mutt needs some seconds to read in mbox-files of about
80 MB's.

So, I then checked my mboxlib and saw that it is quite slow,
compared to what I expected ( expect! I did not tried it
on my development machine because I have nomutt installed there)
and even if native-code smuch faster, it's nevertheless slow...
..so I thought I have to redesign my scanner-stage.
(I use Str-module and ocamnllex mixed together; maybe
using a plain selfwritten OCaml-scanner might be better here).

Ciao,
Oliver

P.S.: 12 seconds for 100 MB seems tobe quite slow...
I very often call the lexer, and that might be done
smarter.
Maybe your pages will show some useful attempts.

skaller

unread,

Apr 27, 2007, 8:55:29 PM4/27/07

to Oliver Bandel

On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote:

> So, I then checked my mboxlib and saw that it is quite slow,
> compared to what I expected ( expect! I did not tried it
> on my development machine because I have nomutt installed there)
> and even if native-code smuch faster, it's nevertheless slow...

> ...so I thought I have to redesign my scanner-stage.

> (I use Str-module and ocamnllex mixed together; maybe
> using a plain selfwritten OCaml-scanner might be better here).

Ocamllex generates very fast scanner: it is using
a very high-tech tagged deterministic finite state automaton
with a driver written in C (so no boxing etc processing
text buffers). I doubt you can hand code anything as
fast as Ocamllex in C, let alone in Ocaml.

You should check the size (number of states) of the generated
lexer. It will run faster with small number of states where
the matrix fits easily in the cache.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

Richard Jones

unread,

Apr 28, 2007, 3:57:48 AM4/28/07

to Oliver Bandel

On Sat, Apr 28, 2007 at 01:12:20AM +0200, Oliver Bandel wrote:
> On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote:
> > The source for COCANWIKI[1] contains extensive support for threading
> > of mail messages, based on JWZ's algorithm:
> >
> > http://www.jwz.org/doc/threading.html
>
> Nice... you speak of an optimized algorithm for threading.
> I didn't explored your solution nor did I explored your
> paper in detail (tomorrow I think I have the time to do it),

I should point out that the algorithm is due to esteemed hacker[1]
Jamie Zawinski (http://jwz.org) who used it in Netscape versions 1
through 3. They got C++/OO group-think disease from Netscape 4 and
above (the rot continues to this day).

Rich.

[1] Perhaps not as esteemed as Xavier L though.

--
Richard Jones
Red Hat

_______________________________________________

Oliver Bandel

unread,

Apr 28, 2007, 6:49:13 AM4/28/07

to caml...@inria.fr

On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote:
> On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote:
>
> > So, I then checked my mboxlib and saw that it is quite slow,
> > compared to what I expected ( expect! I did not tried it
> > on my development machine because I have nomutt installed there)
> > and even if native-code smuch faster, it's nevertheless slow...
> > ...so I thought I have to redesign my scanner-stage.
> > (I use Str-module and ocamnllex mixed together; maybe
> > using a plain selfwritten OCaml-scanner might be better here).
>
> Ocamllex generates very fast scanner: it is using
> a very high-tech tagged deterministic finite state automaton
> with a driver written in C (so no boxing etc processing
> text buffers). I doubt you can hand code anything as
> fast as Ocamllex in C, let alone in Ocaml.

I know that ocamllexis fast.

But I call ocamllex many many times from my
own functions, and this maybe could be done
more elegant / with less calls toocamllex,
or maybe I should not lex directly from the channel
and better read in a bigger chunk of data
into memory and then lex on that.
Or maybe I should first scan the whole header and
then the body for each mail, and only afterwards
scan again the header into seperated lines,
when it is already in the RAM.

>
> You should check the size (number of states) of the generated
> lexer.

How?

> It will run faster with small number of states where
> the matrix fits easily in the cache.

I think that tehere are not so much states, but so many calls.

And maybe creating a list of header-entreies is faster than
creating strings with buffer module, because I always call
Buffer.add_string and so on and so on, instead of puttng
the line onto alist.

For the about 100MB mbox there are 2.5 * 10^6 calls to
to Buffer.add_string for the header and 1.6 * 10^6 calls
to Buffer.add_string for the body, 2.6*10^6 calls to the
function lexing.engine, ...

I better should not read linewise, it seems.

And there are maybe other problems, why it might be slow.
I let the lexer read in linewise and count the line-number.
That is, because I throw an exception, when I detect a
broken mbox file (when a mbox-file ends in the middle
of a header).

So maybe I do too much and to often.
I think there are tooo many calls, not too much
states of the lexer.

(But you could argue that both things are closely related).

Ciao,
Oliver

Oliver Bandel

unread,

Apr 28, 2007, 6:59:58 AM4/28/07

to caml...@inria.fr

On Sat, Apr 28, 2007 at 08:56:28AM +0100, Richard Jones wrote:
> On Sat, Apr 28, 2007 at 01:12:20AM +0200, Oliver Bandel wrote:
> > On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote:
> > > The source for COCANWIKI[1] contains extensive support for threading
> > > of mail messages, based on JWZ's algorithm:
> > >
> > > http://www.jwz.org/doc/threading.html
> >
> > Nice... you speak of an optimized algorithm for threading.
> > I didn't explored your solution nor did I explored your
> > paper in detail (tomorrow I think I have the time to do it),
>
> I should point out that the algorithm is due to esteemed hacker[1]
> Jamie Zawinski (http://jwz.org) who used it in Netscape versions 1
> through 3. They got C++/OO group-think disease from Netscape 4 and
> above (the rot continues to this day).

When I think about netscape, I think about the terrible way,
how it handles newsgroups....
..when doing a resort of the newsgroup-entries it's extremely slow.

Maybe it again checks data on the newsserver again and again.

And when I think about Netscape, I se that it reloads data via network,
when you want to save it, even if it' already on the screen.

So, theese things are the reason, why I don't see why this algorithm
is something special, until I have seen that it really is fast. ;-)

I have thought a while about the threading, and thought I have some good ideas.
A while later, I saw, there already is a datastructure that is useful
for threading, and it is the Tries-datastructure:
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Tree/Trie/

If you use the message ID's instead of the char's of a word,
then this datastructure is well suited for news-/mail-threading, I think.

So, other people already had done, what can be used here.

But maybe thze algorithm you show on your pages is really good.

But I would trust your algorithms more than algorithms of
netscape-programmers ;-)
But maybe it's really good and only the overall-design of Netscape was bad. ;-)

Ciao,
Oliver

Gabriel Kerneis

unread,

Apr 28, 2007, 7:26:50 AM4/28/07

to caml...@yquem.inria.fr

Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel
<oli...@first.in-berlin.de> a écrit :

> > You should check the size (number of states) of the generated
> > lexer.
>
> How?

It's printed out by ocamllex when you run it on you .mll file.
Regards,
--
Gabriel

signature.asc

Oliver Bandel

unread,

Apr 28, 2007, 7:45:59 AM4/28/07

to caml...@yquem.inria.fr

Ah, ok. :)

18 states, 261 transitions, table size 1152 bytes.

Does not loooks very huge ;-)

skaller

unread,

Apr 28, 2007, 9:51:36 AM4/28/07

to Oliver Bandel

On Sat, 2007-04-28 at 13:44 +0200, Oliver Bandel wrote:
> On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote:
> > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel
> > <oli...@first.in-berlin.de> a écrit :
> > > > You should check the size (number of states) of the generated
> > > > lexer.
> > >
> > > How?
> >
> > It's printed out by ocamllex when you run it on you .mll file.
> > Regards,
>
> Ah, ok. :)
>
>
> 18 states, 261 transitions, table size 1152 bytes.
>
> Does not loooks very huge ;-)

Lol, no it is tiny. You are probably right, too many calls,
and too much copying data around. AFAIK Ocaml channels also
add an extra buffer layer (is that right?) so there's even
more copying.

Still, although Ocaml may generate more code than C,
if your code is reasonably tight it should be cached
and be fast: function calls are actually quite cheap.

Here's an idea: you said:

"For the about 100MB mbox there are 2.5 * 10^6 calls to
to Buffer.add_string for the header and 1.6 * 10^6 calls
to Buffer.add_string for the body, 2.6*10^6 calls to the
function lexing.engine, ..."

How about NOT storing the body text. Instead, just store
the integer file offset of the first byte and the length?
Not sure what you application is doing;
perhaps that would work for you?

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

Oliver Bandel

unread,

Apr 28, 2007, 10:19:53 AM4/28/07

to caml...@yquem.inria.fr

[...]

This is what I have foreseen for different kinds of
application. In the parts I have implemented already,
I use completely reading of the data, because
this would be the most general way of handling.
I will be able to do fulltext-research and possibly
very complex analysis of the mailtext, and this is the reason
why I read all things into memory.

To use file-oiffsets would make sense, when the main
functionality would be that of a mailreader, where
you might look into the one or the other mail.
Then reading from disk is not a problem.

But when you want to be able to do a lot of different things,
then always again and again reading from disk might
be much slower than once reading data into memory
and then working on it.

But maybe I provide such a functionality also,
and only read to memory, when I need it.

The main intend of the lib (ys, a library, not a certain
application) was to have a tool at hand, that makes
complex things achievable easy.

Example:

=========================================================
open Mbox

let filename = "./MB1"

let _ =
print_endline "OK, let's start!";
let sm = read filename in
let mymatch = match_header_regexp ".*richard" NoCase
in
let result = filter_rename mymatch sm "./MB1.ready"
in
write_force result;

print_endline "OK, ready! :)"

=========================================================

Opens the mbox-file "MB1" and writes
all mails with the "richard"-string in the header
to the mbox-file "MB1.ready".

It could also be used for spam-detection,
word-counts, textanalysis, automatic rearranging of
mbox-files, throwing away multiple mails,
or saving the same mail in multiple folders,
if they belong to more than one theme.
Or doing complex data analysis on the mails.
E.g. for statistical reasons of text-analysis,
maybe similarity-calculations of texts, ...

ocamlc: 31.4 seconds
ocamopt: 7.7 seconds

Doing an exit after reading-stage:
ocamlc: 7.0 seconds
ocamlopt: 2.5 seconds

So, the pattern-matching (using Str-module)
takes much more time than the reading-/scanning.

It's about 100 MB mbox with mostly short mails.
(I could provide more statistical infos, using this lib :))

Doing a
$ time cat MB1 > MB2
needs 0.2 seconds

$ time grep -i richard MB1 > MB2
needs 0.148 seconds.

OK, such flat-file applications are always faster,
as they do not read the data to memory and they do no
checks on the validity of the files (no exception on
broken mbox-files).

But maybe it could be done better.

Parsing-stage:
Str-module not seems to be the fastest, and ocamllex
creats ml-files that first have to be compiled....

Is there no certain library in OCaml, which can be used?
Or do it have to be developed?

I think on such a kind of thing:
http://swtch.com/~rsc/regexp/regexp1.html

A thing like a runtime-ocamllex, that creates datastructures
instead of ocaml-code, would make sense, IMHO.

IS No such thing available already?

Ciao,
Oliver

Richard Jones

unread,

Apr 29, 2007, 6:47:03 AM4/29/07

to Oliver Bandel

On Sat, Apr 28, 2007 at 04:18:12PM +0200, Oliver Bandel wrote:
> The main intend of the lib (ys, a library, not a certain
> application) was to have a tool at hand, that makes
> complex things achievable easy.

[...]

> let mymatch = match_header_regexp ".*richard" NoCase

[...]

> Opens the mbox-file "MB1" and writes
> all mails with the "richard"-string in the header
> to the mbox-file "MB1.ready".

There's a whole lots of issues that this doesn't cover. For example,
what about internationalised headers using the (rather crazy) syntax
defined in RFC 2047?

>From the RFC:

From: =?US-ASCII?Q?Keith_Moore?= <mo...@cs.utk.edu>
To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <ke...@dkuug.dk>
CC: =?ISO-8859-1?Q?Andr=E9?= Pirard <PIR...@vm1.ulg.ac.be>
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

Rich.

--
Richard Jones
Red Hat

_______________________________________________

Oliver Bandel

unread,

Apr 29, 2007, 11:43:04 AM4/29/07

to caml...@yquem.inria.fr

On Sun, Apr 29, 2007 at 11:45:11AM +0100, Richard Jones wrote:
> On Sat, Apr 28, 2007 at 04:18:12PM +0200, Oliver Bandel wrote:
> > The main intend of the lib (ys, a library, not a certain
> > application) was to have a tool at hand, that makes
> > complex things achievable easy.
> [...]
> > let mymatch = match_header_regexp ".*richard" NoCase
> [...]
> > Opens the mbox-file "MB1" and writes
> > all mails with the "richard"-string in the header
> > to the mbox-file "MB1.ready".
>
> There's a whole lots of issues that this doesn't cover. For example,
> what about internationalised headers using the (rather crazy) syntax
> defined in RFC 2047?

Aha, ok, you say there must be UTF-8 compatibility....
..well I have not thought about this... well...
..good hint!

..any bindings for UTF-8-libs for OCaml avialable somewhere?
Or other necessary things?

Maybe using already available C-libs and only add an
OCaml-binding might make sense?

Any idea on that?

Ciao,
Oliver

Oliver Bandel

unread,

Apr 29, 2007, 11:44:59 AM4/29/07

to caml...@inria.fr

On Sun, Apr 29, 2007 at 11:39:11AM +0100, Richard Jones wrote:

> On Sat, Apr 28, 2007 at 12:58:16PM +0200, Oliver Bandel wrote:
> > On Sat, Apr 28, 2007 at 08:56:28AM +0100, Richard Jones wrote:
> > > On Sat, Apr 28, 2007 at 01:12:20AM +0200, Oliver Bandel wrote:
> > > > On Fri, Apr 27, 2007 at 05:29:11PM +0100, Richard Jones wrote:
> > > > > The source for COCANWIKI[1] contains extensive support for threading
> > > > > of mail messages, based on JWZ's algorithm:
> > > > >
> > > > > http://www.jwz.org/doc/threading.html
> > > >
> > > > Nice... you speak of an optimized algorithm for threading.
> > > > I didn't explored your solution nor did I explored your
> > > > paper in detail (tomorrow I think I have the time to do it),
> > >
> > > I should point out that the algorithm is due to esteemed hacker[1]
> > > Jamie Zawinski (http://jwz.org) who used it in Netscape versions 1
> > > through 3. They got C++/OO group-think disease from Netscape 4 and
> > > above (the rot continues to this day).
> >
> > When I think about netscape, I think about the terrible way,
> > how it handles newsgroups....

> > ...when doing a resort of the newsgroup-entries it's extremely slow.
>
> That's my point - Netscape <= 3 was very fast.
>

Ah, oh,OK, I see.... you wrote versions 1 through 3 were fast,
and these versions used the algorithm you talked about...
right?

Robert Roessler

unread,

Apr 29, 2007, 2:59:14 PM4/29/07

to Caml-list

Oliver Bandel wrote:
> ...

> Aha, ok, you say there must be UTF-8 compatibility....

> ...well I have not thought about this... well...
> ...good hint!
>
>
> ...any bindings for UTF-8-libs for OCaml avialable somewhere?
> Or other necessary things?

There is always the [LGPL]

http://camomile.sourceforge.net/

"Camomile is a comprehensive Unicode library for objective caml (a. k.
a. OCaml or O'Caml) language. Camomile provides Unicode character
type, UTF-8, UTF-16, UTF-32 strings, conversion to/from about 200
encodings, collation and locale-sensitive case mappings, and more."

Robert Roessler
roes...@rftp.com
http://www.rftp.com

Oliver Bandel

unread,

May 1, 2007, 7:02:13 AM5/1/07

to Caml-list

On Sun, Apr 29, 2007 at 11:51:00AM -0700, Robert Roessler wrote:
> Oliver Bandel wrote:
> >...
> >Aha, ok, you say there must be UTF-8 compatibility....
> >...well I have not thought about this... well...
> >...good hint!
> >
> >
> >...any bindings for UTF-8-libs for OCaml avialable somewhere?
> >Or other necessary things?
>
> There is always the [LGPL]
>
> http://camomile.sourceforge.net/
>
> "Camomile is a comprehensive Unicode library for objective caml (a. k.
> a. OCaml or O'Caml) language. Camomile provides Unicode character
> type, UTF-8, UTF-16, UTF-32 strings, conversion to/from about 200
> encodings, collation and locale-sensitive case mappings, and more."

Ah, fine. :)

But not so fine is:

============================================================
[...]
public/uText.mli:50:38: missing terminating ' character
public/uText.mli:51:35: missing terminating ' character
public/uText.mli:67:29: missing terminating ' character
In file included from public/main.mlp:46:
public/xString.mli:33:29: missing terminating ' character
make: *** [public/main.ml] Error 1
rm internal/uReStrLexer.ml internal/uReStrParser.ml
first:~/Desktop/Camino-DOWNLOADS/camomile-0.7.1 oliver$
============================================================

I use ocaml 3.09.3.

Ciao,
Oliver

Oliver Bandel

unread,

May 1, 2007, 7:15:44 AM5/1/07

to caml...@inria.fr

On Sun, Apr 29, 2007 at 11:45:11AM +0100, Richard Jones wrote:
> On Sat, Apr 28, 2007 at 04:18:12PM +0200, Oliver Bandel wrote:
> > The main intend of the lib (ys, a library, not a certain
> > application) was to have a tool at hand, that makes
> > complex things achievable easy.
> [...]
> > let mymatch = match_header_regexp ".*richard" NoCase
> [...]
> > Opens the mbox-file "MB1" and writes
> > all mails with the "richard"-string in the header
> > to the mbox-file "MB1.ready".
>
> There's a whole lots of issues that this doesn't cover. For example,

I want to provide an nterface, that the above things will be possible.

> what about internationalised headers using the (rather crazy) syntax
> defined in RFC 2047?

This should be done transparently.

So, the above should work,
but I also had to add the things you mentioned (nternationalized headers).

Ciao,
Oliver

Bruno De Fraine

unread,

Sep 24, 2007, 2:23:16 PM9/24/07

to Caml-list ml, Oliver Bandel

Hello,

On 28 Apr 2007, at 01:12, Oliver Bandel wrote:
> So, I then checked my mboxlib and saw that it is quite slow,
> compared to what I expected ( expect! I did not tried it
> on my development machine because I have nomutt installed there)
> and even if native-code smuch faster, it's nevertheless slow...

> ...so I thought I have to redesign my scanner-stage.

> (I use Str-module and ocamnllex mixed together; maybe
> using a plain selfwritten OCaml-scanner might be better here).

I don't know if Oliver ever got to the bottom of this speed problem,
but, I also noticed ocamllex can be quite slow for simple scanning.
For example, I used this ocamllex source:

{ }
rule translate = parse
| "current_directory" { print_endline (Sys.getcwd ()); translate
lexbuf }
| _ { translate lexbuf }
| eof { () }
{
for i = 1 to (Array.length Sys.argv - 1); do
translate (Lexing.from_channel (open_in Sys.argv.(i)))
done ;;
}

And compared it against this version using the Str module:

let re = Str.regexp_string "current_directory" ;;
for i = 1 to (Array.length Sys.argv - 1); do
let ch = open_in Sys.argv.(i) in
try
while true; do
let line = input_line ch in
try
let _ = Str.search_forward re line 0 in
print_endline (Sys.getcwd ())
with Not_found -> ()
done
with End_of_file -> close_in ch
done ;;

Neither version does anything useful, except print the current
directory when it encounters the string "current_directory". I tested
this on a 57M text file (that has only a few "current_directory"
occurrences). The ocamllex-version takes about 3.5s, while the Str-
version takes only 0.35s. What causes this difference? Perhaps there
is a high overhead in calling the translate function for every input
character in such big input files, but I don't know how this can be
avoided.

Thanks,
Bruno

skaller

unread,

Sep 24, 2007, 6:08:24 PM9/24/07

to Bruno De Fraine, Oliver Bandel, Caml-list ml

On Mon, 2007-09-24 at 20:22 +0200, Bruno De Fraine wrote:

> { }
> rule translate = parse
> | "current_directory" { print_endline (Sys.getcwd ()); translate
> lexbuf }
> | _ { translate lexbuf }
> | eof { () }
> {
> for i = 1 to (Array.length Sys.argv - 1); do
> translate (Lexing.from_channel (open_in Sys.argv.(i)))
> done ;;
> }
>

> What causes this difference? Perhaps there
> is a high overhead in calling the translate function for every input
> character in such big input files, but I don't know how this can be
> avoided.

Easily. Rewrite the lexer to

| [a-zA-z_]+ { if lexeme = "current_directory" ... }
| _ { translate lexbuf }

Then it will eat whole words instead of characters,
which should make it about 5 times faster.

--
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net

_______________________________________________

Alain Frisch

unread,

Sep 24, 2007, 7:13:30 PM9/24/07

to Bruno De Fraine, Caml-list ml

Bruno De Fraine wrote:
> Neither version does anything useful, except print the current directory
> when it encounters the string "current_directory". I tested this on a
> 57M text file (that has only a few "current_directory" occurrences). The

> ocamllex-version takes about 3.5s, while the Str-version takes only

> 0.35s. What causes this difference? Perhaps there is a high overhead in
> calling the translate function for every input character in such big
> input files, but I don't know how this can be avoided.

Did you try to compile the same programs with the native compiler?

-- Alain

Bruno De Fraine

unread,

Sep 25, 2007, 4:55:13 AM9/25/07

to Alain Frisch, Caml-list ml

On 24 Sep 2007, at 21:54, Alain Frisch wrote:

> Bruno De Fraine wrote:
>> Neither version does anything useful, except print the current
>> directory when it encounters the string "current_directory". I
>> tested this on a 57M text file (that has only a few
>> "current_directory" occurrences). The ocamllex-version takes about
>> 3.5s, while the Str-version takes only 0.35s. What causes this
>> difference? Perhaps there is a high overhead in calling the
>> translate function for every input character in such big input
>> files, but I don't know how this can be avoided.
>
> Did you try to compile the same programs with the native compiler?

Yes, the timings were for the ocamlopt compiler. Following skaller's
suggestion, I also tried:

rule translate = parse
| [^ ' ' '\n' '\t']+ as word
{ if word = "current_directory" then print_endline (Sys.getcwd ());

translate lexbuf }
| _ { translate lexbuf }
| eof { () }

This brings it down to 1.87s. Of course, only occurrences of
"current_directory" that are white-space separated are picked up.

Regards,
Bruno

Chris King

unread,

Sep 27, 2007, 1:26:41 AM9/27/07

to Bruno De Fraine, Oliver Bandel, Caml-list ml

On 9/24/07, Bruno De Fraine <Bruno.D...@vub.ac.be> wrote:
> I don't know if Oliver ever got to the bottom of this speed problem,
> but, I also noticed ocamllex can be quite slow for simple scanning.

FWIW, regexps in the Str module are interpreted in C.

- Chris