I wish to see if a short (<= 10 character) string is an
exact match for any one of a fairly long list of similarly
short strings. On my sample list of a few dozen strings,
I got the following:
Method 1: about 20us per iteration
set RE [join $list_of_words |]; # just once, of course!
time {regexp $RE $specimen} 10000
Method 2: about 3us
time {lsearch -exact $list_of_words $specimen} 10000
Method 3: about 2us
foreach w $list_of_words {set ::words($w) {} }
time {info exists words($specimen)} 10000
So, perhaps not surprisingly, the hash lookup is fastest,
but maybe it's rather extravagant for a long list?
As I say, I'd be interested to hear opinions on how these
would scale as the list gets bigger (many thousands of items),
and whether there's a better way that I'm too dim to think of.
I'm familiar with many of the classical search techniques and
related data structures, but it seems a bit silly to code it
myself when there's so much such stuff already built in to Tcl.
Thanks
--
Jonathan Bromley, Consultant
DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services
Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan...@MYCOMPANY.com
http://www.MYCOMPANY.com
The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
Not really, but there are more options you could have a go at. You
could also try comparing with [lsearch -sorted] which does a binary
search of the search-space (though of course you'll need to sort the
list of words first) and dictionaries. Regular expressions aren't
fast; they're powerful instead.
Donal.
>could also try comparing with [lsearch -sorted] which does a binary
>search of the search-space (though of course you'll need to sort the
>list of words first)
Yes, that's excellent; thanks for the tip.
O(logN) good, O(N) bad...
>Regular expressions aren't fast; they're powerful instead.
Fast enough for many things, but not this time.
Thanks again.
You could try a trie (http://wiki.tcl.tk/17669), but I doubt it'd beat
the performance of a hash lookup.
-- Neil
>You could try a trie (http://wiki.tcl.tk/17669), but I doubt it'd beat
>the performance of a hash lookup.
Sure. In fact Donal's suggestion of [lsearch -sorted] is plenty
good enough for the application I have, and it's easy, so I'm done :-)
Thanks to you both for the suggestions.
For what it's worth, you asked about how they scale.
regexp - For *this* particular case, I expect O(M), where M is the
length of the input string (not the word list). If, as you say,
the length of the longest word is bounded, then O(1). But the
constant factor is pretty big. Incidentally, you needed ^ and
$ framing the RE, or you'd be looking for whether the specimen
*contains* one of the target words, and that goes up to O(N).
lsearch - O(N).
lsearch -sorted - O(log N).
[info exists] - O(1). (Again, assuming that the length of the longest
word is bounded.)
[dict exists] - O(1). (Same assumption.)
--
73 de ke9tv/2, Kevin
> For what it's worth, you asked about how they scale.
>
> regexp - For *this* particular case, I expect O(M), where M is the
> length of the input string (not the word list). If, as you say,
> the length of the longest word is bounded, then O(1). But the
> constant factor is pretty big. Incidentally, you needed ^ and
> $ framing the RE, or you'd be looking for whether the specimen
> *contains* one of the target words, and that goes up to O(N).
>
> lsearch - O(N).
>
> lsearch -sorted - O(log N).
>
> [info exists] - O(1). (Again, assuming that the length of the longest
> word is bounded.)
>
> [dict exists] - O(1). (Same assumption.)
>
> --
> 73 de ke9tv/2, Kevin
Arrays now have a statistics parameter, Here's the output of my
English dictionary, one word per array element, over 87k words. Not a
bad average search length :)
% array stat rdict
87271 entries in table, 65536 buckets
number of buckets with 0 entries: 17585
number of buckets with 1 entries: 22881
number of buckets with 2 entries: 15042
number of buckets with 3 entries: 6862
number of buckets with 4 entries: 2333
number of buckets with 5 entries: 646
number of buckets with 6 entries: 156
number of buckets with 7 entries: 26
number of buckets with 8 entries: 5
number of buckets with 9 entries: 0
number of buckets with 10 or more entries: 0
average search distance for entry: 1.7
Converting to a list and sorting it takes some time, however,
time {lsort [array names rdict]} 100
394,456 microseconds per iteration
Here's a wild card lookup that is fast enough to use as I type (it can
create a listbox with all matches fast enough to respond as I type in
a word). I ran this on an average speed laptop. Here's 244 words found
beginning with "the".
% llength [array names rdict the*]
244
% time {llength [array names rdict the*]} 1000
64,432 microseconds per iteration
In Tcl you can't beat a hash lookup, because any trie would be based
upon another hash for implementation.
AOLserver uses a variation of a trie, which uses fixed boundaries (the
directory structure), whereas a trie breaks on every character.
The best performance, probably equal or better than hash, is critical
bit (crit-bit). Unfortunately you can't do this in Tcl, you have to do
it in C, and provide an interface to Tcl.
I recently wrote a Tcl equivalent of the AOLserver version of a trie,
I'd like to generalize it but if anyone is interested in the current
code, let me know.