redis database design help (Geolocation + Expiration)

482 views
Skip to first unread message

Susan Lin

unread,
May 7, 2015, 5:18:19 PM5/7/15
to redi...@googlegroups.com

I am in the process of learning redis and am building a geo program for learning purposes. I would like to only use redis to store the data and am trying to avoid any relational databases. My question is how to best design the database for the program. This is what the how the program goes:

1) I will create millions of random robots around the world which wander so they can have different geo coordinates (some robots can be in the exact same space).

2) Each robot will randomly send a post to the server (every few hours possibly on average) which will contain: a) the location of where the robot sent this data from (in either coordinates or geohash depending on the best implementation idea) b) some small text

3) I will have a map with all the robots and would like to be able to click on a robot and get this information: a) all the posts which were posted nearby the robot I just clicked

4) Due to the fact I will be hosting this on AWS I will need to delete the posts every couple of hours to keep the memory usage low so some type of expiration is mandatory.

My main concern is performance and I am interested in how to design the redis database.

In a single day (I will work out the math for random posts to do this) about ~500,000,000 posts will be generated.

My Incomplete ideas so far:

Idea 1

1) a post would be stored as such:

`HSET [Geohash of location] [timestamp] [small text] (<-- the value will be used in a later feature to increment the number of manual modification I make to a post).

2) I then would be able to get all the posts near a robot by sending the geohash location he is in. The downfall here is I would also need to include his 8 geohash neighbors which would require 8 more queries. Which is why I am also looking into concept of spatial proximity for this feature.

HGETALL [GeoHash Location of robot] 

This would then return the field ([timestamp]) and value ("0");

3) Expiration of old posts. Since I can't use the EXPIRE command to delete fields from a hashset, I would then need to scan through all the hashset fields periodicly and find old timestamps and remove them. Since Redis only allows pattern searching this could will be difficult when all the timestamps are different.

Idea 2:

Use Redis-geo (https://matt.sh/redis-geo).

1) To store the posts I would run:

geoadd globalSet [posts_long] [posts_lat] "small text";

2) To get all the post information for a robot nearby:

georadius nyc [robots_long] [robots_lat] [X] km

This would return all posts near the robot within X kms.

3) Then I am now stuck how to remove old posts

Josiah Carlson

unread,
May 8, 2015, 2:18:20 AM5/8/15
to redi...@googlegroups.com
Replies inline.

On Thu, May 7, 2015 at 10:57 AM, Susan Lin <susan...@gmail.com> wrote:

I am in the process of learning redis and am building a geo program for learning purposes. I would like to only use redis to store the data and am trying to avoid any relational databases. My question is how to best design the database for the program. This is what the how the program goes:

Reading through your description, you don't specify a few important pieces that significantly affect the way this could be implemented. I'll try to point them out and ask questions.

1) I will create millions of random robots around the world which wander so they can have different geo coordinates (some robots can be in the exact same space).

Do you know how you are going to store information about these millions of random robots? Where they are? Who was last updated? Who should be updated next?

2) Each robot will randomly send a post to the server (every few hours possibly on average) which will contain: a) the location of where the robot sent this data from (in either coordinates or geohash depending on the best implementation idea) b) some small text

 How do you know which robot will send a message next? Is this different or the same as how you store which robot next gets updated?

3) I will have a map with all the robots and would like to be able to click on a robot and get this information: a) all the posts which were posted nearby the robot I just clicked

I hope you're not thinking of showing your millions of robots on a map at the same time, as that has a good chance of killing your browser (if you're using Google Maps or Open Street Map overlays). Where are your robots going to be shown? Web? Desktop app? Mobile app?

4) Due to the fact I will be hosting this on AWS I will need to delete the posts every couple of hours to keep the memory usage low so some type of expiration is mandatory.

Manual expiration is the right answer here, I can explain this momentarily.

My main concern is performance and I am interested in how to design the redis database.

In a single day (I will work out the math for random posts to do this) about ~500,000,000 posts will be generated.

That's just under 5800 posts/second on average. Are you trying to simulate Twitter from 5 years ago? 

My Incomplete ideas so far:

Idea 1

1) a post would be stored as such:

`HSET [Geohash of location] [timestamp] [small text] (<-- the value will be used in a later feature to increment the number of manual modification I make to a post).

You can't rely on timestamps being unique, even in a given geohash location. This is the case in reality and simulation, unless you arbitrarily constrain your simulation to prevent these kinds of things, in which case it might not be representative of what you are trying to simulate. You will have to explore this.

But either way, if this is the only information you are storing about your messages, then you will have to periodically scan all of your geohash cells. Even if your geohash is .1 by .1 degrees (roughly 11km x 11km at the equator), which is pretty coarse for a geohash, you're looking at potentially 500k-1 million active geohash cells if you limit yourself to only populated cells (in reality, which you'd have to constrain your robots to fit in).

As an alternative, consider this:

def post_message(conn, geohash, msg):
    id = conn.incr('message:id')
    for hash in adjacent_cells(geohash):
        conn.hset(hash, id, msg)
    conn.zadd('message:idx', {id: '%s:%s'%(hash, id)})

Then cleanup becomes...

def clean_messages(conn, limit=1000000):
    messages = conn.zrange('message:idx', 0, -limit, withscores=True)
    for lid, id in messages:
        id = str(id)
        # get the plain geohash back
        geohash = lid.rpartition(':')[0]
        for hash in adjacent_cells(geohash):
            conn.hdel(hash, id)
        conn.zrem('message:idx', lid)

2) I then would be able to get all the posts near a robot by sending the geohash location he is in. The downfall here is I would also need to include his 8 geohash neighbors which would require 8 more queries. Which is why I am also looking into concept of spatial proximity for this feature.

HGETALL [GeoHash Location of robot] 

This would then return the field ([timestamp]) and value ("0");

The variant above puts the message in all 9 geo cells to make queries a simple HGETALL. If you need a timestamp (or other data), you can embed it in your message.

Now, if you don't want to duplicate data (understandably), you can do the 9 geocell queries. But here's the trick: you can do all 9 in a single network round trip to Redis using pipelining. You can also do much of the same to both the post_message() and clean_messages() functions above, which you can drop to 2 round trips each, or to 1 round trip each if you use Lua scripting.

There's also an alternate method that can cut this down to 4 cells, but it's a little strange to describe and implement.

3) Expiration of old posts. Since I can't use the EXPIRE command to delete fields from a hashset, I would then need to scan through all the hashset fields periodicly and find old timestamps and remove them. Since Redis only allows pattern searching this could will be difficult when all the timestamps are different.

I talked about and addressed this part earlier.


And here is an alternate idea for a limited-count geo messaging:

I presented about this at a talk a couple years ago. Slides and video (the link jumps to where I talk about this partitioning method):

This variant can also be improved.

Idea 2:

Use Redis-geo (https://matt.sh/redis-geo).

1) To store the posts I would run:

geoadd globalSet [posts_long] [posts_lat] "small text";

As a first step, yes.

You would also probably want to keep a record of that "small text", and/or record a separate identifier (like I did above), then use that identifier to store additional information, as well as use it as a key for later cleanups/deletion (like I did above).

2) To get all the post information for a robot nearby:

georadius nyc [robots_long] [robots_lat] [X] km

This would return all posts near the robot within X kms.

I think you mean:
georadius globalSet [robots_long] [robots_lat] [X] km

3) Then I am now stuck how to remove old posts

You can use the same secondary index method I described above.

Long story short, you have options. And these are only just a few.

 - Josiah

Message has been deleted

Susan Lin

unread,
May 8, 2015, 1:44:26 PM5/8/15
to redi...@googlegroups.com
Thank you for such a great and thorough reply!

1) I will create millions of random robots around the world which wander so they can have different geo coordinates (some robots can be in the exact same space).

Do you know how you are going to store information about these millions of random robots? Where they are? Who was last updated? Who should be updated next? 
Ans) Each robot (robot_id) is a just a number from 1 - ~5,000,000. I will be using Java (with vertx to create many client instances). The program will randomly generate  5000-6000 of posts to be created every second with X number of robots. The robots who will post will be randomly selected. The robots will exist in memory and I will not be storing who was last updated.   

2) Each robot will randomly send a post to the server (every few hours possibly on average) which will contain: a) the location of where the robot sent this data from (in either coordinates or geohash depending on the best implementation idea) b) some small text

 How do you know which robot will send a message next? Is this different or the same as how you store which robot next gets updated?
Ans) As mentioned in my above answer there will be random numbers generated which will tell the program which robots should send posts now.

3) I will have a map with all the robots and would like to be able to click on a robot and get this information: a) all the posts which were posted nearby the robot I just clicked

I hope you're not thinking of showing your millions of robots on a map at the same time, as that has a good chance of killing your browser (if you're using Google Maps or Open Street Map overlays). Where are your robots going to be shown? Web? Desktop app? Mobile app?
Ans) This will be a Web app. This is something I have thought of and I will now just be needing to type in some coordinates and to receive all the posts near them.

4) Due to the fact I will be hosting this on AWS I will need to delete the posts every couple of hours to keep the memory usage low so some type of expiration is mandatory.

Manual expiration is the right answer here, I can explain this momentarily.
My main concern is performance and I am interested in how to design the redis database.

In a single day (I will work out the math for random posts to do this) about ~500,000,000 posts will be generated.

That's just under 5800 posts/second on average. Are you trying to simulate Twitter from 5 years ago? 
Ans) I am in the process of learning redis, vertx 3.0, and Amazon cloud services. I am very interested in how well they will scale and how they perform under very heavy loads. 

Thank you very much for the link I will read them right now! If you have any other ideas now based on the answers I would love to hear them! 
Reply all
Reply to author
Forward
0 new messages