Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

The Judy algorithm

5 views

Skip to first unread message

Tim Bunce

unread,

Mar 10, 2003, 5:37:30 AM3/10/03

to perl6-i...@perl.org

I think this might be interesting to some of you...

"Judy is a general purpose dynamic array implemented as a C callable
library. Judy's speed and memory usage are typically better than
other data storage models and improves with very large data sets."

http://judy.sourceforge.net/application/10minutes.htm
http://judy.sourceforge.net/application/
http://sourceforge.net/projects/judy

I've appended a few extracts from the "10minutes.htm" url given above.

Tim.

A (CPU) cache-line fill is additional time required to do a read
reference from RAM when a word is not found in cache. In today's
computers the time for a cache-line fill is in the range of 50..2000
machine instructions. Therefore a cache-line fill should be avoided
when fewer than 50 instructions can do the same job. (Modern machines
tend to pipeline writes to RAM. They often take no additional time
in the Judy design.)

Some of the reasons Judy outperforms binary trees, b-trees, and skip-lists:

* Judy rarely compromises speed/space performance for simplicity
(Judy will never be called simple except at the API).

* Judy is designed to avoid cache-line fills wherever possible.
(This is the main design criteria for Judy.)

* A b-tree requires a search of each node (branch), resulting
in more cache-line fills.

* A binary-tree has many more levels (about 8X), resulting in
more cache-line fills.

* A skip-list is roughly equivalent to a degree-4 (4-ary) tree,
resulting in more cache-line fills

From a speed point of view Judy is chiefly a 256-ary digital tree
or trie (per D. Knuth Volume 3 definitions). A degree of 256-ary
is a somewhat "magic" N-ary for a variety of reasons -- but mostly
because a byte (the least addressable memory unit) is 8 bits. Also
a higher degree means reduced cache-line fills per access. You see
the theme here -- avoid cache-line fills like the plague.

The presence of a CPU cache in modern machines has changed many of
the ways to write a performance algorithm. To take advantage of a
cache, it is important to leverage as much as possible. In a Judy
tree, the presence of a cache results in 1..3 (or more) fewer
cache-line fills per lookup than would be possible without a cache.

Judy adapts efficiently to a wide range of populations and data set
densities. Since the Judy data structure is a tree of trees, each
sub-tree is a static expanse that is optimized to match the "character"
or density of the keys it contains.

Notice that to insert or delete a key is almost as simple as setting
or clearing a bit.

Leopold Toetsch

unread,

Mar 10, 2003, 8:54:35 AM3/10/03

to Tim Bunce, perl6-i...@perl.org

Tim Bunce wrote:

> I think this might be interesting to some of you...
>

> http://sourceforge.net/projects/judy

Indeed. Maybe someone can wrap up list.c and/or hash.c and compare
performance for typical usage patterns - however they look like - but
mainly linear in arrays I presume.
t/src/intlist and t/pmc/intlist have some stress tests.

leo

Elizabeth Mattijsen

unread,

Mar 10, 2003, 9:33:38 AM3/10/03

to Tim Bunce, perl6-i...@perl.org

At 10:37 +0000 3/10/03, Tim Bunce wrote:
>I think this might be interesting to some of you...
> "Judy is a general purpose dynamic array implemented as a C callable
> library. Judy's speed and memory usage are typically better than
> other data storage models and improves with very large data sets."
>
>http://judy.sourceforge.net/application/10minutes.htm
>http://judy.sourceforge.net/application/
>http://sourceforge.net/projects/judy
>I've appended a few extracts from the "10minutes.htm" url given above.

This looks very interesting (particularly for a project I'm working
on now, which was the reason I looked into this right now), but the
project really seems quite silent if not dead.

Some more info:
Only HP-UX and Linux seem to be supported out of the box (only tried
Linux and Mac OS X).

I adapted the indexSL program to just be a filter and piped
/usr/share/dict/words through it. Then let it run with Valgrind.
That reports:

==11948== LEAK SUMMARY:
==11948== definitely lost: 11 bytes in 1 blocks.
==11948== possibly lost: 26 bytes in 2 blocks.
==11948== still reachable: 0 bytes in 0 blocks.

Not a whole lot of leakage, but still.

I got the configure script into believing that MacOS X is really
Linux. Compilation then halts on
byteswap.h being missing. I didn't look any further then.

The forum seems to be missing answers from the primary (only)
developer. Bug reports with patches have not been applied (such as
trivial bashisms in the configure script).

The application directory contains some nice examples that might be
applicable to Parrot: especially the "best of both worlds" approach
in which Judy arrays are used to handle hash value collisions on a
rather small (256 or 64K keys) hash.

Just my 2 eurocents worth (which appear to be worth more than 2.1 US$
cents nowadays ;-)

Liz

Tim Bunce

unread,

Mar 11, 2003, 4:44:29 AM3/11/03

to Elizabeth Mattijsen, Tim Bunce, perl6-i...@perl.org, do...@sourcejudy.com, dougb...@frii.com

On Mon, Mar 10, 2003 at 03:33:38PM +0100, Elizabeth Mattijsen wrote:
> At 10:37 +0000 3/10/03, Tim Bunce wrote:
> >I think this might be interesting to some of you...
> > "Judy is a general purpose dynamic array implemented as a C callable
> > library. Judy's speed and memory usage are typically better than
> > other data storage models and improves with very large data sets."
> >
> >http://judy.sourceforge.net/application/10minutes.htm
> >http://judy.sourceforge.net/application/
> >http://sourceforge.net/projects/judy
> >I've appended a few extracts from the "10minutes.htm" url given above.
>
> This looks very interesting (particularly for a project I'm working
> on now, which was the reason I looked into this right now), but the
> project really seems quite silent if not dead.

I emailed Doug [CC'd] with your concerns. Here's his reply.

Tim.

From: dougb...@frii.com
Subject: Re: (Fwd) Re: The Judy algorithm
Date: Mon, 10 Mar 2003 22:40:25

Tim:

I will try to reply to your concerns below.

> Doug,
>
> I've just flagged Judy to the perl6-internals list (who are developing
> the 'parrot' virtual machine) and they've raised some concerns (below).
>
> Could you tell me about the status of the Judy code.
> Is it being maintained?

Yes, but I am not much of a process person. The versions that install and
compile other platforms are prelimary and need more testing. They are
available in <http://judy.sourceforge.net/downloads> (see README).

>
> Tim.
>
> ----- Forwarded message from Elizabeth Mattijsen <l...@dijkmat.nl> -----
>
> Delivered-To: tim....@pobox.com
> In-Reply-To: <20030310103...@dansat.data-plan.com>
> Date: Mon, 10 Mar 2003 15:33:38 +0100
> To: Tim Bunce <Tim....@pobox.com>, perl6-i...@perl.org
> From: Elizabeth Mattijsen <l...@dijkmat.nl>
> Subject: Re: The Judy algorithm

>
> At 10:37 +0000 3/10/03, Tim Bunce wrote:

> >I think this might be interesting to some of you...
> > "Judy is a general purpose dynamic array implemented as a C callable
> > library. Judy's speed and memory usage are typically better than
> > other data storage models and improves with very large data sets."
> >
> >http://judy.sourceforge.net/application/10minutes.htm
> >http://judy.sourceforge.net/application/
> >http://sourceforge.net/projects/judy
> >I've appended a few extracts from the "10minutes.htm" url given above.

>
> This looks very interesting (particularly for a project I'm working
> on now, which was the reason I looked into this right now), but the
> project really seems quite silent if not dead.
>
>
> Some more info:
> Only HP-UX and Linux seem to be supported out of the box (only tried
> Linux and Mac OS X).
>
> I adapted the indexSL program to just be a filter and piped
> /usr/share/dict/words through it. Then let it run with Valgrind.
> That reports:
>
> ==11948== LEAK SUMMARY:
> ==11948== definitely lost: 11 bytes in 1 blocks.
> ==11948== possibly lost: 26 bytes in 2 blocks.
> ==11948== still reachable: 0 bytes in 0 blocks.
>
> Not a whole lot of leakage, but still.

I agree any leakage is unacceptable. However, Judy is tested carefully
to not have leakage. Memory usage (from malloc()) is kept internally to the
structure and must subtract (free()) to exactly zero when the last element is
deleted from the array. Perhaps there is a problem in the measurement.
I would like to know more about the measurement to be certain that the
problem is not in Judy. JudyL and Judy1 only allocate blocks in multiples
of 4 bytes.

>
>
> I got the configure script into believing that MacOS X is really
> Linux. Compilation then halts on
> byteswap.h being missing. I didn't look any further then.

The versions in the download directory (mentioned above) should solve
this problem. However, I think it requires gmake.

>
>
> The forum seems to be missing answers from the primary (only)
> developer. Bug reports with patches have not been applied (such as
> trivial bashisms in the configure script).
>
>
> The application directory contains some nice examples that might be
> applicable to Parrot: especially the "best of both worlds" approach
> in which Judy arrays are used to handle hash value collisions on a
> rather small (256 or 64K keys) hash.

If a hashing scheme (of strings) is able to solve your problem (just store and
retrieves) then I suggest using JudyL inplace of your normal hash table. This
makes a very scalable hashing method. The performance is better than any known
tree method (including JudySL).

>
>
>
> Just my 2 eurocents worth (which appear to be worth more than 2.1 US$
> cents nowadays ;-)
>
>
> Liz
>

> ----- End forwarded message -----
>

I will be available for your questions and comments.

Doug Baskins <do...@sourcejudy.com>

0 new messages