Low-level, Efficient Ternary Search Trees for Storing Strings?

217 views
Skip to first unread message

dpapathanasiou

unread,
Mar 29, 2007, 5:10:17 PM3/29/07
to
After reading Bentley and Sedgewick's paper on using Ternary Search
Trees for storing strings (http://www.ddj.com/dept/windows/184410528),
I wrote an implementation in CL, using a simple char array as the trie
structure.

It works, and searching is fast, but the resulting tries (arrays) are
massive, especially when saved to disk.

Then I saw this post about packing multiple datapoints into a single
object -- fixnum, in his case, since he's dealing with integers
(http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/
f5b364a601c86987), and I was wondering if there's any similar low-
level, efficient structure that can be used to replace the char array
in my code?

Here's my code:

(defpackage :trie
(:use
:common-lisp)
(:export
:make-trie
:insert-trie
:search-trie)
(:documentation
"Library of ternary trees, inspired by Bentley & Sedgewick
article."))

(in-package :trie)

(defun make-trie ()
"Return a trie object, an array, defined as:
0 = split-char
1 = lo-kid
2 = eq-kid
3 = hi-kid"
(make-array '(4) :element-type 'char :initial-contents '(nil nil nil
nil)))

(defun get-trie-val (trie index)
"Access the trie value defined by index:
0 = split-char
1 = lo-kid
2 = eq-kid
3 = hi-kid"
(aref trie index))

(defun get-split-char (trie)
(get-trie-val trie 0))

(defun get-lo-kid (trie)
(get-trie-val trie 1))

(defun get-eq-kid (trie)
(get-trie-val trie 2))

(defun get-hi-kid (trie)
(get-trie-val trie 3))

(defun set-trie-val (trie index val)
"Set the trie value defined by index:
0 = split-char
1 = lo-kid
2 = eq-kid
3 = hi-kid"
(setf (aref trie index) val))

(defun set-split-char (trie val)
(set-trie-val trie 0 val))

(defun set-lo-kid (trie val)
(set-trie-val trie 1 val))

(defun set-eq-kid (trie val)
(set-trie-val trie 2 val))

(defun set-hi-kid (trie val)
(set-trie-val trie 3 val))

(defun trie-empty (trie)
"A trie is empty if its split-char value is undefined."
(null (get-split-char trie)))

(defun insert-trie (trie str)
"Add the string to the trie."
(when (> (length str) 0)
(let ((c (char-downcase (char str 0))))
(when (trie-empty trie)
(setf trie (make-trie))
(set-split-char trie c)
(set-lo-kid trie (make-trie))
(set-eq-kid trie (make-trie))
(set-hi-kid trie (make-trie)))
(cond ((char< c (get-split-char trie))
(set-lo-kid trie (insert-trie (get-lo-kid trie) str)))
((char= c (get-split-char trie))
(set-eq-kid trie (insert-trie (get-eq-kid trie) (subseq
str 1))))
(t
(set-hi-kid trie (insert-trie (get-hi-kid trie) str))))))
trie)

(defun search-trie (trie str)
"Return t if the string exists in the trie, nil if it does not."
(if (> (length str) 0)
(let ((c (char-downcase (char str 0))))
(cond ((null (get-split-char trie))
nil) ; run off end w/o match
((char< c (get-split-char trie))
(search-trie (get-lo-kid trie) str))
((char> c (get-split-char trie))
(search-trie (get-hi-kid trie) str))
(t
(search-trie (get-eq-kid trie) (subseq str 1)))))
t))

Rainer Joswig

unread,
Mar 29, 2007, 6:34:57 PM3/29/07
to
In article <1175202616.1...@o5g2000hsb.googlegroups.com>,
"dpapathanasiou" <denis.pap...@gmail.com> wrote:

That's not right. You have an array of characters.
NIL is not a character. You can say element type is
character and then initialize the elements to NIL
(which is not a character).

Plus:

Element 0 is a character. What are elements 1, 2 and 3?
They are others TRIEs. Not characters. You
can use a general ARRAY, but not a character ARRAY.

Use a structure instead of an array. See below.


>
> (defun get-trie-val (trie index)
> "Access the trie value defined by index:
> 0 = split-char
> 1 = lo-kid
> 2 = eq-kid
> 3 = hi-kid"
> (aref trie index))
>
> (defun get-split-char (trie)
> (get-trie-val trie 0))
>
> (defun get-lo-kid (trie)
> (get-trie-val trie 1))
>
> (defun get-eq-kid (trie)
> (get-trie-val trie 2))
>
> (defun get-hi-kid (trie)
> (get-trie-val trie 3))
>
> (defun set-trie-val (trie index val)
> "Set the trie value defined by index:
> 0 = split-char
> 1 = lo-kid
> 2 = eq-kid
> 3 = hi-kid"
> (setf (aref trie index) val))

SETF is better used to update data. You would
also want to write your interfaces to provide SETF
functions. You can do it with (defun (setf ... )...).
But usully you get it with DEFSTRUCT for free. See below.

>
> (defun set-split-char (trie val)
> (set-trie-val trie 0 val))
>
> (defun set-lo-kid (trie val)
> (set-trie-val trie 1 val))
>
> (defun set-eq-kid (trie val)
> (set-trie-val trie 2 val))
>
> (defun set-hi-kid (trie val)
> (set-trie-val trie 3 val))

A remark: all this boilerplate code could be replaced
with a single DEFSCTRUCT

(defstruct trie ...)

Then you'd get a structure object when you call MAKE-TREE.
Note that you can also use DEFSTRUCT when you want
an array as the underlying datastructure. There is a :TYPE
option to DEFSTRUCT where you can specify that.

You could also use DEFCLASS if you'd like the flexibility of CLOS
classes.

--
http://lispm.dyndns.org

Rainer Joswig

unread,
Mar 29, 2007, 6:37:53 PM3/29/07
to
In article <joswig-1E03F4....@news-europe.giganews.com>,
Rainer Joswig <jos...@lisp.de> wrote:

I mean "you can't say.."

dpapathanasiou

unread,
Mar 30, 2007, 8:56:14 AM3/30/07
to
Point taken about the array definition; I should *not* use :element-
type 'char since only :split-char is a char and all the :*-kid
components are themselves trie objects, not chars.

But even if I switch from an array to a defstruct, I still wind up
with a very bulky S-expression.

Using this (which, as you say is better, since it offers built-in
make, access and setting functions):

(defstruct trie
split-char
lo-kid
eq-kid
hi-kid)

If I create a trie to hold a two-letter string, e.g. "as", I get this
S-expression:

#S(TRIE
:SPLIT-CHAR #\a
:LO-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
NIL)
:EQ-KID #S(TRIE
:SPLIT-CHAR #\s
:LO-KID #S(TRIE
:SPLIT-CHAR NIL
:LO-KID NIL
:EQ-KID NIL
:HI-KID NIL)
:EQ-KID #S(TRIE
:SPLIT-CHAR NIL
:LO-KID NIL
:EQ-KID NIL
:HI-KID NIL)
:HI-KID #S(TRIE
:SPLIT-CHAR NIL
:LO-KID NIL
:EQ-KID NIL
:HI-KID NIL))
:HI-KID #S(TRIE :SPLIT-CHAR NIL :LO-KID NIL :EQ-KID NIL :HI-KID
NIL))

Serialized to disk, that's 866 bytes.

Versus just 2 bytes (4 counting the double-quotes to denote it as a
string object) if I write "as" just like that.

So what I was getting at was: is there any way of packing the trie
data into a low-level, compact, efficient data structure, similar to
what the other post did for multiple integers into fixnums?

I'd like to be able to use tries for searching, but if I'm stuck
serializing such large S-expressions, I'd be better off saving them as
raw strings and using a fast library like cl-ppcre instead.

truls....@gmail.com

unread,
Mar 30, 2007, 9:33:25 AM3/30/07
to
On Mar 30, 2:56 pm, "dpapathanasiou" <denis.papathanas...@gmail.com>
wrote:

How about storing it as #S(as) and make the reader turn that into a
trie structure?

Also, are the NIL-only parts really necessary? Could your example be
reduced to the following?

#S(TRIE
:SPLIT-CHAR #\a
:LO-KID NIL


:EQ-KID #S(TRIE
:SPLIT-CHAR #\s

:LO-KID NIL
:EQ-KID NIL
:HI-KID NIL)

:HI-KID NIL)

dpapathanasiou

unread,
Mar 30, 2007, 9:40:12 AM3/30/07
to
> How about storing it as #S(as) and make the reader turn that into a
> trie structure?

Yes, I thought of that, but counting the time to generate the trie
plus execute the search, I wonder if it wouldn't just be faster to run
a pattern scan on the raw string with cl-ppcre (cl-ppcre is incredibly
fast).

> Also, are the NIL-only parts really necessary? Could your example be
> reduced to the following?
>
> #S(TRIE
> :SPLIT-CHAR #\a
> :LO-KID NIL
> :EQ-KID #S(TRIE
> :SPLIT-CHAR #\s
> :LO-KID NIL
> :EQ-KID NIL
> :HI-KID NIL)
> :HI-KID NIL)

It's a good point, but even that S-expression is >> than the amount of
space need to contain the raw string.

Thomas F. Burdick

unread,
Mar 30, 2007, 10:50:39 AM3/30/07
to
On Mar 30, 2:56 pm, "dpapathanasiou" <denis.papathanas...@gmail.com>
wrote:

Sure, use an array of (unsigned-byte 32) for your trie. Each 32-bit
entry is a node, (byte 8 0) is the char-code of your split-char, (byte
8 8) is lo-kid, (byte 8 16) eq-kid, (byte 8 24) hi-kid. The kid
values are indexes into trie-array. Since you don't have cycles, re-
use position 0 to mean NIL. You can serialize by writing an 8-bit
byte to give the length of the trie, then the series of 32-bit words
which are the contents of the array.

Now, obviously that limits you to 8-bit characters and 256-node tries,
but it should be obvious how to extend it.

Alex Mizrahi

unread,
Apr 1, 2007, 9:28:23 AM4/1/07
to
(message (Hello 'dpapathanasiou)
(you :wrote :on '(30 Mar 2007 05:56:14 -0700))
(

d> So what I was getting at was: is there any way of packing the trie
d> data into a low-level, compact, efficient data structure, similar to
d> what the other post did for multiple integers into fixnums?

d> I'd like to be able to use tries for searching, but if I'm stuck
d> serializing such large S-expressions, I'd be better off saving them as
d> raw strings and using a fast library like cl-ppcre instead.

i still have no clue what would you like to do, and i suspect you don't know
it for sure too, since a person who perfectly understands the process should
have no problems implementing it.

S-expressions are on-disk, structures are in-memory.
but you want to save raw strings on-disk, and ..cl-ppcre in memory??

how does cl-ppcre relate to s-expressions on disk??

why can't you have raw strings in file, then read them into structures, and
use fast search algorithms?

actually what's optimal depends on your usage patterns. i some cases, when
you have large string sets, it's not feasible to read whole thing into
memory -- you can mmap a file, and operate on it.

)
(With-best-regards '(Alex Mizrahi) :aka 'killer_storm)
"?? ???? ??????? ?????")


dpapathanasiou

unread,
Apr 1, 2007, 12:29:34 PM4/1/07
to

> i still have no clue what would you like to do, and i suspect you don't know
> it for sure too, since a person who perfectly understands the process should
> have no problems implementing it.

The post by Thomas F. Burdick answered my question.


Holger Schauer

unread,
Apr 2, 2007, 11:34:38 AM4/2/07
to
On 4958 September 1993, dpapathanasiou wrote:
> Point taken about the array definition; I should *not* use :element-
> type 'char since only :split-char is a char and all the :*-kid
> components are themselves trie objects, not chars.

I have a toy implementation of ternary trees in CLOS, just in case you're
interested. Actually, I have two, another one for the general case
i.e., in which it's not assumed that you're comparing characters.

I stopped playing around with it when I ran into the exact same
problem of storing the data to disc and when I discovered that I
couldn't beat hash tables at all (on a 200MB word list) -- I checked
on at least three CL implementations. And it's unbelievable slower
than the C code of B&S. IIRC, this is mainly due to their use of
unions to implement the data structure, but it might be that I mix
that up with the amount of memory used (hmm, I think I had put in some
optimization settings that resulted in at least one CL implementation
in slots not being created per default in fresh objects, which
minimized the memory footprint, but in turn meant I had to check
whether a slot really existed, slowing down the code considerably).

> But even if I switch from an array to a defstruct, I still wind up
> with a very bulky S-expression.

Yup.

> I'd like to be able to use tries for searching, but if I'm stuck
> serializing such large S-expressions, I'd be better off saving them as
> raw strings and using a fast library like cl-ppcre instead.

Exactly.

Holger

--
--- http://hillview.bugwriter.net/ ---
"You see, when you upload, the computer has to push the bits upward,
and let me tell you, when you're talking millions of bits, it gets
heavy. When you download, the bits just fall by gravity, so it's much
faster (we provided you with some remarkable cushioning functions that
prevent damage to your data). Non-ADSL? They use helium. That's why
it's more expensive." -- seen on WTF

dpapathanasiou

unread,
Apr 3, 2007, 9:35:07 AM4/3/07
to

> I have a toy implementation of ternary trees in CLOS, just in case you're
> interested. Actually, I have two, another one for the general case
> i.e., in which it's not assumed that you're comparing characters.

Actually, I would be interested in seeing that; I did a "Most Obvious
Way" (tm) implementation after reading the article, but I didn't take
it any further, because of the reasons explained above.


Tim Bradshaw

unread,
Apr 3, 2007, 11:31:21 AM4/3/07
to
On Mar 29, 10:10 pm, "dpapathanasiou" <denis.papathanas...@gmail.com>
wrote:

> After reading Bentley and Sedgewick's paper on using Ternary Search
> Trees for storing strings (http://www.ddj.com/dept/windows/184410528),
> I wrote an implementation in CL, using a simple char array as the trie
> structure.
>
> It works, and searching is fast, but the resulting tries (arrays) are
> massive, especially when saved to disk.
>

A couple of related points about this:
1. The performance of these sorts of structures is heavily influenced
by locality issues. I haven't thought about the structure you've
described but I suspect it may not be very good in this regard. In
particular superficially awful approaches (linear search) do really
rather well because the memory system loves them. Databases learnt
this lesson long ago since it's mattered for disks for longer than
it's mattered for memory.
2. Why store the structure on disk at all? Is it faster or slower to
store the strings and "intern" them as you read them from the disk.

--tim

Szymon

unread,
Apr 17, 2007, 6:36:19 PM4/17/07
to
dpapathanasiou wrote:

> [...]

> [...]

Why not ordinary tree (not byte-level but shorter than yours):

((*) (A (* (S (T)))))

Regards, Szymon.

Reply all
Reply to author
Forward
0 new messages