lsearch, regexp, info exists

Jonathan Bromley

unread,

May 21, 2009, 8:21:42 AM5/21/09

to

Here's a question about Tcl performance. I've tried some
experiments already, but it would be good if anyone who
knows the guts of Tcl well could suggest how the results
might scale with problem size, and might vary across
Tcl versions - or if there's a better way that I've missed.

I wish to see if a short (<= 10 character) string is an
exact match for any one of a fairly long list of similarly
short strings. On my sample list of a few dozen strings,
I got the following:

Method 1: about 20us per iteration
set RE [join $list_of_words |]; # just once, of course!
time {regexp $RE $specimen} 10000

Method 2: about 3us
time {lsearch -exact $list_of_words $specimen} 10000

Method 3: about 2us
foreach w $list_of_words {set ::words($w) {} }
time {info exists words($specimen)} 10000

So, perhaps not surprisingly, the hash lookup is fastest,
but maybe it's rather extravagant for a long list?

As I say, I'd be interested to hear opinions on how these
would scale as the list gets bigger (many thousands of items),
and whether there's a better way that I'm too dim to think of.
I'm familiar with many of the classical search techniques and
related data structures, but it seems a bit silly to code it
myself when there's so much such stuff already built in to Tcl.

Thanks
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan...@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

dkf

unread,

May 21, 2009, 11:42:00 AM5/21/09

to

On 21 May, 13:21, Jonathan Bromley <jonathan.brom...@MYCOMPANY.com>
wrote:

> I wish to see if a short (<= 10 character) string is an
> exact match for any one of a fairly long list of similarly
> short strings. On my sample list of a few dozen strings,
> I got the following:

[...]

> So, perhaps not surprisingly, the hash lookup is fastest,
> but maybe it's rather extravagant for a long list?

Not really, but there are more options you could have a go at. You
could also try comparing with [lsearch -sorted] which does a binary
search of the search-space (though of course you'll need to sort the
list of words first) and dictionaries. Regular expressions aren't
fast; they're powerful instead.

Donal.

Jonathan Bromley

unread,

May 21, 2009, 1:17:07 PM5/21/09

to

On Thu, 21 May 2009 08:42:00 -0700 (PDT), dkf wrote:

>could also try comparing with [lsearch -sorted] which does a binary
>search of the search-space (though of course you'll need to sort the
>list of words first)

Yes, that's excellent; thanks for the tip.
O(logN) good, O(N) bad...

>Regular expressions aren't fast; they're powerful instead.

Fast enough for many things, but not this time.

Thanks again.

Neil Madden

unread,

May 21, 2009, 1:31:30 PM5/21/09

to

You could try a trie (http://wiki.tcl.tk/17669), but I doubt it'd beat
the performance of a hash lookup.

-- Neil

Jonathan Bromley

unread,

May 21, 2009, 2:37:51 PM5/21/09

to

On Thu, 21 May 2009 18:31:30 +0100, Neil Madden wrote:

>You could try a trie (http://wiki.tcl.tk/17669), but I doubt it'd beat
>the performance of a hash lookup.

Sure. In fact Donal's suggestion of [lsearch -sorted] is plenty
good enough for the application I have, and it's easy, so I'm done :-)

Thanks to you both for the suggestions.

Kevin Kenny

unread,

May 22, 2009, 8:00:05 AM5/22/09

to

Jonathan Bromley wrote:
> On Thu, 21 May 2009 08:42:00 -0700 (PDT), dkf wrote:
>
>> could also try comparing with [lsearch -sorted] which does a binary
>> search of the search-space (though of course you'll need to sort the
>> list of words first)
>
> Yes, that's excellent; thanks for the tip.
> O(logN) good, O(N) bad...
>
>> Regular expressions aren't fast; they're powerful instead.
>
> Fast enough for many things, but not this time.
>
> Thanks again.

For what it's worth, you asked about how they scale.

regexp - For *this* particular case, I expect O(M), where M is the
length of the input string (not the word list). If, as you say,
the length of the longest word is bounded, then O(1). But the
constant factor is pretty big. Incidentally, you needed ^ and
$ framing the RE, or you'd be looking for whether the specimen
*contains* one of the target words, and that goes up to O(N).

lsearch - O(N).

lsearch -sorted - O(log N).

[info exists] - O(1). (Again, assuming that the length of the longest
word is bounded.)

[dict exists] - O(1). (Same assumption.)

--
73 de ke9tv/2, Kevin

rocket777

unread,

May 25, 2009, 10:22:53 PM5/25/09

to

On May 22, 5:00 am, Kevin Kenny <kenn...@acm.org> wrote:

> For what it's worth, you asked about how they scale.
>
> regexp - For *this* particular case, I expect O(M), where M is the
> length of the input string (not the word list). If, as you say,
> the length of the longest word is bounded, then O(1). But the
> constant factor is pretty big. Incidentally, you needed ^ and
> $ framing the RE, or you'd be looking for whether the specimen
> *contains* one of the target words, and that goes up to O(N).
>
> lsearch - O(N).
>
> lsearch -sorted - O(log N).
>
> [info exists] - O(1). (Again, assuming that the length of the longest
> word is bounded.)
>
> [dict exists] - O(1). (Same assumption.)
>
> --
> 73 de ke9tv/2, Kevin

Arrays now have a statistics parameter, Here's the output of my
English dictionary, one word per array element, over 87k words. Not a
bad average search length :)

% array stat rdict
87271 entries in table, 65536 buckets
number of buckets with 0 entries: 17585
number of buckets with 1 entries: 22881
number of buckets with 2 entries: 15042
number of buckets with 3 entries: 6862
number of buckets with 4 entries: 2333
number of buckets with 5 entries: 646
number of buckets with 6 entries: 156
number of buckets with 7 entries: 26
number of buckets with 8 entries: 5
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 0
average search distance for entry: 1.7

Converting to a list and sorting it takes some time, however,

time {lsort [array names rdict]} 100
394,456 microseconds per iteration

Here's a wild card lookup that is fast enough to use as I type (it can
create a listbox with all matches fast enough to respond as I type in
a word). I ran this on an average speed laptop. Here's 244 words found
beginning with "the".

% llength [array names rdict the*]
244
% time {llength [array names rdict the*]} 1000
64,432 microseconds per iteration

tom.rmadilo

unread,

May 26, 2009, 6:22:49 PM5/26/09

to

On May 21, 10:31 am, Neil Madden <n...@cs.nott.ac.uk> wrote:
> You could try a trie (http://wiki.tcl.tk/17669), but I doubt it'd beat
> the performance of a hash lookup.

In Tcl you can't beat a hash lookup, because any trie would be based
upon another hash for implementation.

AOLserver uses a variation of a trie, which uses fixed boundaries (the
directory structure), whereas a trie breaks on every character.

The best performance, probably equal or better than hash, is critical
bit (crit-bit). Unfortunately you can't do this in Tcl, you have to do
it in C, and provide an interface to Tcl.

I recently wrote a Tcl equivalent of the AOLserver version of a trie,
I'd like to generalize it but if anyone is interested in the current
code, let me know.

lsearch, regexp, info exists - performance?

Jonathan Bromley

dkf

Jonathan Bromley

Neil Madden

Jonathan Bromley

Kevin Kenny

rocket777

tom.rmadilo