Tree data structure for URLs

1,214 views
Skip to first unread message

Herbert Fischer

unread,
Jan 27, 2011, 4:02:13 PM1/27/11
to redi...@googlegroups.com
Hi,

I need to store a tree structure like the example bellow. I searched
the web and found nothing really clear to solve this.
Building the tree can be slow, but searching must be really fast.
The initial idea is to store URLs as trees and give a set of values
for each level

http://example.com/level1/level2/level3/
http://example.com/level1/level2/level31/

example.com: x=10, y=20
 |- level1: x=22, y=41
|- level2: x=44, y=65
|- level3: x=44, y=11
|- level31: x=4, y=14

And if I do a query for:

http://example.com/level1/level2/level3/anythingelse/somefile.html

It must return me x=44 and y=11, because it's the ideal path for this URL.

Do you have any ideas on how to use Redis to do this?

I saw something about using sorted lists, but I'm not understanding
how to do it.

thanks in advance!

Josiah Carlson

unread,
Jan 27, 2011, 4:13:21 PM1/27/11
to redi...@googlegroups.com
https://groups.google.com/forum/#!topic/redis-db/IsLJ4PlBo9E

Regards,
 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.


Herbert Fischer

unread,
Jan 27, 2011, 4:38:51 PM1/27/11
to redi...@googlegroups.com
Thanks Josiah,

I already read this thread, but I'm worried about searching the tree.

For example, If I store a tree by using simple KV and some kind of
relationship by IDs I will probably need to do lot's of commands in
Redis to match the URL, right?

I'm new to Redis and don't know if some of it's features can help
doing a fast "tree" search and what's the best way to store a tree to
enable a fast search.

Derek Williams

unread,
Jan 27, 2011, 5:42:23 PM1/27/11
to redi...@googlegroups.com
This is just off the top of my head / crazy / no guarantee of
correctness or performance, just my attempt at minimizing roundtrips:

1) Store each path with a score, doesn't really matter how. Lets go
with a hash with the url as the key, and the fields x and y with the
score, like this:
path:example.com {x: 10, y: 20}
path:example.com/level1 {x: 22, y: 41}
... etc

2) Store an index of each path returning a score in a sorted set
(score 1 so it is sorted alphanumericly), lets call it "paths":
example.com
example.com/level1
example.com/level1/level2
example.com/level1/level2/level3
example.com/level1/level2/level31

3) Each request will need a unique id, a uuid would work. Split the
requested url into paths of different depths. for example:

split http://example.com/level1/level2/level3/anythingelse/somefile.html into:
example.com
example.com/level1
example.com/level1/level2
example.com/level1/level2/level3
example.com/level1/level2/level3/anythingelse
example.com/level1/level2/level3/anythingelse/somefile.html

4) Now send the request to the redis server:

MULTI

// add each path to a unique sorted set
ZADD req:uuid 1 example.com
ZADD req:uuid 1 example.com/level1
ZADD req:uuid 1 example.com/level1/level2
ZADD req:uuid 1 example.com/level1/level2/level3
ZADD req:uuid 1 example.com/level1/level2/level3/anythingelse
ZADD req:uuid 1 example.com/level1/level2/level3/anythingelse/somefile.html

// store all the intersecting paths in another sorted set
ZINTERSTORE res:uuid 2 paths req:uuid

// get the last matching path from the intersection (this is the only
value you need)
ZREVRANK res:uuid 0

//clean up
DEL req:uuid res:uuid

EXEC

The return value for the ZREVRANK command should be the deepest path
that matches the request. From there you can query for the scores. It
might be possible to do this all in one request by doing this instead
of the ZREVRANK:

SORT res:uuid LIMIT 0 1 GET # GET path:*->x GET path:*->y ALPHA DESC

Using SORT might also allow you to just use plain sets instead of
sorted sets since it will be sorting them anyways, which would save on
memory.

--
Derek

Herbert Fischer

unread,
Jan 28, 2011, 11:50:53 AM1/28/11
to redi...@googlegroups.com
Hi Derek!

I learned a lot of Redis by studying your ideas!

I'll try all them (including the ones using SORT) and after that post
the final conclusions here.

Thanks again!

Herbert Fischer

unread,
Feb 2, 2011, 8:45:50 AM2/2/11
to redi...@googlegroups.com
Hello again,

This is the final and tested idea. It worked very well! Thanks Derek!

** Storing the URL tree in Redis:

SET path:domain.com { 'x': 100, 'y': 200 }
SET path:domain.com/a { 'x': 100, 'y': 201 }
SET path:domain.com/b { 'x': 100, 'y': 202 }
SET path:domain.com/b/c { 'x': 100, 'y': 203 }

ZADD paths:domain.com 1 "domain.com"
ZADD paths:domain.com 1 "domain.com/a"
ZADD paths:domain.com 1 "domain.com/b"
ZADD paths:domain.com 1 "domain.com/b/c"

** Querying the URL tree (Ex.: domain.com/b/c/page.html):

MULTI
ZADD req:HASH 1 "domain.com"
ZADD req:HASH 1 "domain.com/b"
ZADD req:HASH 1 "domain.com/b/c"
ZADD req:HASH 1 "domain.com/b/c/page.html"
ZINTERSTORE req:HASH 2 paths:domain.com req:HASH
SORT req:HASH LIMIT 0 1 GET path:* ALPHA DESC
DEL req:HASH
EXEC


I've implemented this in Python and this "transaction" took about 54
secs to query 100k URLs (~1852 tps) in a dataset of about 1k different
"paths", in a desktop (Core 2 Duo E75000 2.93Ghz).

Thanks!

Derek Williams

unread,
Feb 2, 2011, 8:52:21 AM2/2/11
to redi...@googlegroups.com

Oh good, it worked! I'll have to actually give it a try at some point too.

Santiago Perez

unread,
Feb 2, 2011, 11:52:41 PM2/2/11
to redi...@googlegroups.com
Isn't it easier to just have a key for each path, do an MGET with all the searched subpaths, and pick the longest match client side?

SET path:domain.com { 'x': 100, 'y': 200 }
SET path:domain.com/a { 'x': 100, 'y': 201 }
SET path:domain.com/b { 'x': 100, 'y': 202 }
SET path:domain.com/b/c { 'x': 100, 'y': 203 }

1) { 'x': 100, 'y': 200 }
2) { 'x': 100, 'y': 202 }
3) { 'x': 100, 'y': 203 }
4) (nil)

Herbert Fischer

unread,
Feb 3, 2011, 8:02:55 AM2/3/11
to redi...@googlegroups.com
This is actually the fastest idea!! Thanks Santiago!
These are the times I've got in my benchmark with 100k URLs

ZSET: 48.7541201115 secs
MGET: 13.725041151 secs

Reply all
Reply to author
Forward
0 new messages