Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Optimisation help needed

4 views
Skip to first unread message

Martin DeMello

unread,
Feb 23, 2005, 1:10:07 PM2/23/05
to
I'm developing a set of ruby utilities to work with DAWG (directed
acyclic word graph) files, as produced by Graham Toal's program at
http://www.gtoal.com/wordgames/dawgutils/

Essentially a dawg is a variant of a trie, with a fixed-length (4 byte)
record containing a letter, a pointer to another node (the root of a
subtree of words starting with the current prefix), and bits flagging
the end of node and end of word markers for the current record
(see http://www.wutka.com/dawg.html)

The problem is that this is a very C-oriented data structure - a node is
stored as a consecutive series of records, which you read until you find
the letter you're looking for or you come to the end of node marker. The
pointer is likewise just an offset into the contiguous memory block the
dawg occupies (or into the file, but slurping the whole thing into
memory is one of the first optimisations I tried). All this ends up
being horribly inefficient in ruby - indeed, the bottleneck is the
get_children function, which I've implemented thus:

def children(index)
eon = eil
a = nil
until(eon)
a = memoized_read_record(index)
yield a
eon = a.eon?
index += 1
end
end

The entire code, and the dawg file I'm using to test it are up on
http://rubyforge.org/frs/?group_id=562 - could people please take a look
at it and see what I can do without dropping into C? (It could be that
my anagram algorithm is naive too - suggestions welcomed)

martin

linus sellberg

unread,
Feb 23, 2005, 5:10:41 PM2/23/05
to
Martin DeMello wrote:
> at it and see what I can do without dropping into C? (It could be that
> my anagram algorithm is naive too - suggestions welcomed)

One of the most efficient algorithms for finding anagrams sorts both the
grammar and the word you want to find anagrams for before it starts to
compare anything.

(that is, bar => abr)

Robert Klemme

unread,
Feb 23, 2005, 6:35:35 PM2/23/05
to

"Martin DeMello" <martin...@yahoo.com> schrieb im Newsbeitrag
news:3G3Td.487016$6l.245954@pd7tw2no...

This is what I came up with so far - it's cleaner but unfortunately not
faster. Interestingly enough the StrinIO variant is slower...

Have to go to bed now. HTH

robert

dawg-3.rb

Robert Klemme

unread,
Feb 23, 2005, 7:00:34 PM2/23/05
to

"Robert Klemme" <bob....@gmx.net> schrieb im Newsbeitrag
news:384iaaF...@individual.net...

Ha, I must've been totally blind. Try this version instead!

You probably can make it even faster if you marshal the dictionary once
you've read it.

Good night

robert

dawg-4.rb

Martin DeMello

unread,
Feb 23, 2005, 11:15:53 PM2/23/05
to

That doesn't extend itself well to anagrams with wildcards, though.
DAWGs are nice for those.

martin

Martin DeMello

unread,
Feb 23, 2005, 11:35:15 PM2/23/05
to
Robert Klemme <bob....@gmx.net> wrote:
>
> Ha, I must've been totally blind. Try this version instead!
>
> You probably can make it even faster if you marshal the dictionary once
> you've read it.

Thanks - looks a lot more rubyish than mine, too! Will test it out when
I get home.

martin

Robert Klemme

unread,
Feb 24, 2005, 3:03:58 AM2/24/05
to

"Martin DeMello" <martin...@yahoo.com> schrieb im Newsbeitrag
news:7QcTd.490774$8l.96770@pd7tw1no...

There we some errors still. I did a bit of fixing and tidying up. Most
interesting I find the simplification achieved by "EOW =
DawgNode.new.freeze".

Have fun

robert

dawg-5.rb

Martin DeMello

unread,
Feb 24, 2005, 2:10:24 PM2/24/05
to
Robert Klemme <bob....@gmx.net> wrote:
>
> There we some errors still. I did a bit of fixing and tidying up. Most
> interesting I find the simplification achieved by "EOW =
> DawgNode.new.freeze".

Subtle problem:
current.children[char] = eoword ? DawgNode::EOW : nodes[ptr]

doesn't work because a node could be both EOW *and* have children. But
looks like a very nice clean approach on the whole - will happily
replace my old code with this.

martin

Martin DeMello

unread,
Feb 24, 2005, 3:04:39 PM2/24/05
to

(And yes, it turned out to be a very small fix - merely adding the eow
attribute back to every node and saying

if char != 0
current.children[char] = nodes[ptr]
current.eow = eoword
end

plus a few supporting changes elsewhere.)

martin

Martin DeMello

unread,
Feb 24, 2005, 3:20:34 PM2/24/05
to
Martin DeMello <martin...@yahoo.com> wrote:
>
> (And yes, it turned out to be a very small fix - merely adding the eow
> attribute back to every node and saying
>
> if char != 0
> current.children[char] = nodes[ptr]
> current.eow = eoword
> end

My turn to be too sleepy for this :) Of course it's a bit more tricky
than that.

current.children[char].eow = eoword if ptr > 1

seems to work; I'll bang on it some more in the morning.

martin

0 new messages