How to keep and manipulate data in memory with web2py

1,788 views
Skip to first unread message

cyan

unread,
May 7, 2012, 11:27:30 AM5/7/12
to web...@googlegroups.com

Hi group,

Is it possible for one to store incoming user data (e.g. those submitted by users on a page) in memory, manipulate them in memory, and send them back with web2py? Or do I need to external modules/libraries to do that? So far, it seems by default all user submitted data are written to database in web2py.

More specifically, I want to implement this sort of logic on the server side: the server waits for a pre-defined number of pieces data from different users, and once all the data are in, the server processes the set of data, saves the results in db and sends them back to respective users.

The most obvious way I can think of is to keep (and track) the incoming data in memory and only write to the db after they're processed. Is this something that we can do using the existing functionalities of web2py? Thanks!

Anthony

unread,
May 7, 2012, 11:35:49 AM5/7/12
to web...@googlegroups.com
If you just need to store data for a single user across requests, you can store it in the session (each user has a separate session). But if you need to store data from multiple users and then process it all together, you can store it in the cache: http://web2py.com/books/default/chapter/29/4#cache.

Anthony

Jonathan Lundell

unread,
May 7, 2012, 11:41:30 AM5/7/12
to web...@googlegroups.com
On May 7, 2012, at 10:35 AM, Anthony wrote:
If you just need to store data for a single user across requests, you can store it in the session (each user has a separate session). But if you need to store data from multiple users and then process it all together, you can store it in the cache: http://web2py.com/books/default/chapter/29/4#cache.

That might be just slightly unsafe, don't you think, if a piece of data gets evicted before it's needed? I'd prefer to limit caching to data that can be recreated when it's not found in the cache.

An alternative might be a memory-based SQLite database, taking care not to let it leak (a consideration regardless of the implementation).

cyan

unread,
May 7, 2012, 11:59:13 AM5/7/12
to web...@googlegroups.com

That might be just slightly unsafe, don't you think, if a piece of data gets evicted before it's needed? I'd prefer to limit caching to data that can be recreated when it's not found in the cache.

An alternative might be a memory-based SQLite database, taking care not to let it leak (a consideration regardless of the implementation)

Thanks for the suggestions. But would a piece of data get removed from the cache arbitrarily/automatically? In other words, if we're sure of the size of data coming in and make sufficiently large cache for storing them, then would it be 100% safe to use cache to keep the data for my case here? Or am I missing something else?

Your second suggestion sounds similar to using Redis, in which case would there still be a risk of leaking? Thank you. 

Anthony

unread,
May 7, 2012, 12:05:01 PM5/7/12
to web...@googlegroups.com
If you just need to store data for a single user across requests, you can store it in the session (each user has a separate session). But if you need to store data from multiple users and then process it all together, you can store it in the cache: http://web2py.com/books/default/chapter/29/4#cache.

That might be just slightly unsafe, don't you think, if a piece of data gets evicted before it's needed? I'd prefer to limit caching to data that can be recreated when it's not found in the cache.

An alternative might be a memory-based SQLite database, taking care not to let it leak (a consideration regardless of the implementation).

Doesn't a memory-based db have the same problem -- you have to purge at some point and could therefore "evict" data before it's needed in that case as well?

Also, is there a way to persist a memory-based SQLite db in web2py? I thought they would only persist for a single request.

Anthony 

Jonathan Lundell

unread,
May 7, 2012, 12:31:05 PM5/7/12
to web...@googlegroups.com
Maybe so; that would be a problem, wouldn't it? Be nice to have a persistent memory-based db.

WRT cache persistence, it's one of those things that should normally work just fine, but I tend to be suspicious of my intuitions of this kind of thing. Nasty if it works great 99.9% of the time.

 My thinking re db is that you'd be able to implement an explicit purge policy, and only purge stuff you were willing to do without. Presumably that'd be necessary; there's no guarantee that every transaction would complete gracefully and in a bounded time, I suppose.

villas

unread,
May 7, 2012, 1:01:30 PM5/7/12
to web...@googlegroups.com
Hi Cyan, 

Databases are specially designed for keeping persistent data - so there is your answer!   :)

I suggest:
  1. Write the data initially to a transitional Sqlite DB on disk.
  2. Once all the data pieces have arrived,  migrate the completed data to your main DB and delete all the transitional records.
  3. It is much safer than that which you have proposed.  Web2py can easily handle the two DBs.
  4. If you need any housekeeping, set up a scheduled job to purge all the old, incomplete stuff once in a while.
Regards, David

cyan

unread,
May 7, 2012, 5:03:38 PM5/7/12
to web...@googlegroups.com

If you just need to store data for a single user across requests, you can store it in the session (each user has a separate session). But if you need to store data from multiple users and then process it all together, you can store it in the cache: http://web2py.com/books/default/chapter/29/4#cache.

Anthony

Thanks Anthony. The data come from multiple users in my case.

After looking at the caching mechanism provided by web2py, I have a rough design in pseudo-code as below.

in controller:

user_data = cache.ram('user_data', lambda:dict(), time_expire=None)

# add the data from this user, this should also update the cached dict?
user_data
[this_user_id] = submitted_data

if len(user_data) == some_pre_defined_number:
   
# get data out of the dictionary 'user_data' by user Ids
   
...
   
# process them and persist results to DB
   
...
   
# reset 'user_data' dict by removing it from cache
    cache
.ram('user_data', None)

The above implies that the 'if' check is made upon every request by individual users, which is somewhat inefficient, but I am not sure if there is other better way(s) to implement this logic within web2py. If so, please enlighten me!

In addition, I have some specific questions regarding the cache mechanism in web2py:
1. when exactly does an object expire in cache? My understanding is that, as long as the 'time_expire' value of the current request is greater than the gap between the time of current request and the time at which the requested object was last updated in cache, then the object behaves as if it never expires in cache. This is illustrated by an example in the book, if I understand it correctly. Therefore, the only perceivable way (to me) for an object to expire is to supply a 'time_expire' value in a request that is smaller than the gap between the time of this request and the time when the requested object was updated in cache. In other words, if I do:

something = cache.ram('something', lambda:'blah blah', time_expire=1)

effectively, 'something' never expires in cache until I call something like the following after t seconds (with t > make_it_expire):

something = cache.ram('something', lambda:'blah blah', time_expire= make_it_expire)

2. An follow-up question is that, when an objects eventually expires in cache, does it mean it is completely removed from the cache? i.e. equivalent to:

cache.ram(key, None)

or something else happens?

3. The manual says that by setting 'time_expire = 0', it forces the cached object to be refreshed. Am I correct in understanding this as the cached object will be immediately set to whatever is supplied by the second argument of the 'cache.ram' call? Thus 'refreshed' here means 'updated'.

Your thoughts and comments are much appreciated. Thanks!

Anthony

unread,
May 7, 2012, 6:07:39 PM5/7/12
to web...@googlegroups.com
In general, the approach David suggests (https://groups.google.com/d/msg/web2py/4cBbis_B1i0/KkKnNwUw8lcJ) is probably preferable, but below are answers to your questions...

user_data = cache.ram('user_data', lambda:dict(), time_expire=None)

# add the data from this user, this should also update the cached dict?
user_data
[this_user_id] = submitted_data

The above would not update the dict in the cache -- you'd have to do that explicitly:

cache.ram('user_data', lambda: user_data, time_expire=[some positive value])

The above implies that the 'if' check is made upon every request by individual users, which is somewhat inefficient, but I am not sure if there is other better way(s) to implement this logic within web2py.

You're already reading the object from the cache on each request -- checking the length should not be a big deal. Note, it doesn't necessarily have to be on every request -- just requests that deal with this aspect of the application (i.e., only in the relevant controller function).
 
1. when exactly does an object expire in cache? My understanding is that, as long as the 'time_expire' value of the current request is greater than the gap between the time of current request and the time at which the requested object was last updated in cache, then the object behaves as if it never expires in cache. This is illustrated by an example in the book, if I understand it correctly. Therefore, the only perceivable way (to me) for an object to expire is to supply a 'time_expire' value in a request that is smaller than the gap between the time of this request and the time when the requested object was updated in cache. In other words, if I do:

something = cache.ram('something', lambda:'blah blah', time_expire=1)

effectively, 'something' never expires in cache until I call something like the following after t seconds (with t > make_it_expire):

something = cache.ram('something', lambda:'blah blah', time_expire= make_it_expire)

Yes, that's how it works.
 
2. An follow-up question is that, when an objects eventually expires in cache, does it mean it is completely removed from the cache? i.e. equivalent to:

cache.ram(key, None)

or something else happens?

No, expiring simply means that it gets updated with the new value instead of the existing value being retrieved. So, in your first example, if make_it_expire is greater than the elapsed time since the object was last saved, the "something" key would not be deleted from the cache -- rather, the value associated with the "something" key would be changed to "blah blah" (which in this case happens to be the same as the old value, but you get the idea).

Perhaps "time_expire" isn't the best name, as it may imply that the object will be expunged after some elapsed expiration time. Maybe something like "update_if_elapsed_time" would be more descriptive.
 
3. The manual says that by setting 'time_expire = 0', it forces the cached object to be refreshed. Am I correct in understanding this as the cached object will be immediately set to whatever is supplied by the second argument of the 'cache.ram' call? Thus 'refreshed' here means 'updated'.

Yes. 

Anthony

cyan

unread,
May 7, 2012, 8:16:45 PM5/7/12
to web...@googlegroups.com
Thanks David. This approach ensures better data consistency, but I have two concerns:

1. If we store everything in database, how should we track whether all the pieces of data have arrived. For example, if I am expecting 10 pieces of data from 10 users, do I then need to constantly poll the db to check this? Is there any other tracking/trigger mechanism available?

2. It seems to me that the following database operations are needed: 1. write to the transitional db; 2. check for data arrival; 3. once all data arrived, read from transitional db so we can process them; 4. write the results to the main db; 5. delete all data from the transitional db. Added together, all these db operations may be substantial, not to mention that this process may need to be repeated for many number of times. In this aspect, the cached solution seems to shine performance wise.

Would love to hear your thoughts on the above. Thanks! 

cyan

unread,
May 7, 2012, 8:31:05 PM5/7/12
to web...@googlegroups.com

In general, the approach David suggests (https://groups.google.com/d/msg/web2py/4cBbis_B1i0/KkKnNwUw8lcJ) is probably preferable, but below are answers to your questions...

user_data = cache.ram('user_data', lambda:dict(), time_expire=None)

# add the data from this user, this should also update the cached dict?
user_data
[this_user_id] = submitted_data

The above would not update the dict in the cache -- you'd have to do that explicitly:

cache.ram('user_data', lambda: user_data, time_expire=[some positive value]

Thank you for your insights, Anthony.

Regarding the pros and cons of this approach (in comparison to David's database approach), I wonder what are the potential pitfalls/risks of the cache approach. For examples, but not limited to,

1. Is there any consistency issue for the data stored in web2py cache?

2. Is there any size limit on the data stored in web2py cache?

3. Is it thread-safe? For instance, if I have two threads A and B (two requests from different users) trying to access the same object (e.g. 'user_data' dict) stored in the cache at the same time, would that cause any problem? This especially concerns the corner case where A and B bear the very last two pieces of data expected to meet 'some_pre_defined_number'.

Thanks!

Anthony

unread,
May 7, 2012, 8:44:05 PM5/7/12
to web...@googlegroups.com
Regarding the pros and cons of this approach (in comparison to David's database approach), I wonder what are the potential pitfalls/risks of the cache approach. For examples, but not limited to,

1. Is there any consistency issue for the data stored in web2py cache?

Yes, this could be a problem. The nice thing about using a db with transactions is that you can ensure any operations get rolled back if an error occurs. In web2py, each request is wrapped in a transaction, so if there is an error during the request, any db operations during that request are rolled back. The other issues is volatility -- if your server goes down, you lose the contents of RAM, but not what's stored in the db (or written to a file).
 
2. Is there any size limit on the data stored in web2py cache?

I think it's just limited to the amount of RAM available.
 
3. Is it thread-safe? For instance, if I have two threads A and B (two requests from different users) trying to access the same object (e.g. 'user_data' dict) stored in the cache at the same time, would that cause any problem? This especially concerns the corner case where A and B bear the very last two pieces of data expected to meet 'some_pre_defined_number'.

That's a good point -- your current design would introduce a potential race condition problem (I was originally thinking each request would add separate entries to the cache, not update a single object). Of course, a db could have a similar problem if you were just repeatedly updating a single record (rather than inserting new records each time).

Anthony

villas

unread,
May 8, 2012, 5:45:21 AM5/8/12
to web...@googlegroups.com
In my opinion,  using a DB is easier, more logical, transparent and makes your data more accessible. 

The function that is used to process the data on arrival can check whether all pieces of data are present.  If so,  then after saving the last piece it can then go on to migrate them to your main DB and delete them from the transitional DB.  It could,  if required,  also go on to scan the DB and expunge any old incomplete records afterwards.  In my description I broke down the logic into steps,  but really the one function receiving the data can do the whole thing.  

This code should not be too difficult to write and there are literally hundreds of examples of using DBs from which you can learn.  This community is more experienced in using DBs to store data rather than cache so your questions will be more easily answered.

I recommend that you write down the logic of what you would like the function to do and start writing some code.  Come back here when you get stuck.

Regards,  D

cyan

unread,
May 8, 2012, 6:22:03 PM5/8/12
to web...@googlegroups.com

Regarding the pros and cons of this approach (in comparison to David's database approach), I wonder what are the potential pitfalls/risks of the cache approach. For examples, but not limited to,

1. Is there any consistency issue for the data stored in web2py cache?

Yes, this could be a problem. The nice thing about using a db with transactions is that you can ensure any operations get rolled back if an error occurs. In web2py, each request is wrapped in a transaction, so if there is an error during the request, any db operations during that request are rolled back. The other issues is volatility -- if your server goes down, you lose the contents of RAM, but not what's stored in the db (or written to a file).

For each request, as long as it reaches the action without problem, then I suppose the subsequent updating to cache should be fairly straightforward. For the requests which carries the last piece of data to meet the trigger value, it will perform the processing as well as writing to the db, but as you said, this would be wrapped in a transaction. Regarding volatility, in my case, it doesn't matter if the data are written to the db at the point of crash - if it crashes, the whole process would have to start from scratch.


3. Is it thread-safe? For instance, if I have two threads A and B (two requests from different users) trying to access the same object (e.g. 'user_data' dict) stored in the cache at the same time, would that cause any problem? This especially concerns the corner case where A and B bear the very last two pieces of data expected to meet 'some_pre_defined_number'.

That's a good point -- your current design would introduce a potential race condition problem (I was originally thinking each request would add separate entries to the cache, not update a single object). Of course, a db could have a similar problem if you were just repeatedly updating a single record (rather than inserting new records each time).

I've looked deeper into this, and in web2py doc: http://www.web2py.com/examples/static/epydoc/web2py.gluon.cache.CacheInRam-class.html, it mentions: This is implemented as global (per process, shared by all threads) dictionary. A mutex-lock mechanism avoid conflicts.

Does this mean that when each request thread is accessing and modifying the content (e.g. a dictionary in my case) of the cache, every other cache is blocked and has to wait till the current request thread finishes with it. If so, it seems to me that the race condition we fear as above should not happen? Please correct me if I get this wrong. Thanks.
 

Anthony

unread,
May 8, 2012, 10:23:11 PM5/8/12
to web...@googlegroups.com
user_data = cache.ram('user_data', lambda:dict(), time_expire=None)

# add the data from this user, this should also update the cached dict?
user_data
[this_user_id] = submitted_data

The above would not update the dict in the cache -- you'd have to do that explicitly:

Scratch that. Apparently cache.ram returns a reference rather than a copy, so you can directly update the dict, as you have above.

Anthony 

Anthony

unread,
May 8, 2012, 10:36:02 PM5/8/12
to web...@googlegroups.com
I've looked deeper into this, and in web2py doc: http://www.web2py.com/examples/static/epydoc/web2py.gluon.cache.CacheInRam-class.html, it mentions: This is implemented as global (per process, shared by all threads) dictionary. A mutex-lock mechanism avoid conflicts.

Does this mean that when each request thread is accessing and modifying the content (e.g. a dictionary in my case) of the cache, every other cache is blocked and has to wait till the current request thread finishes with it. If so, it seems to me that the race condition we fear as above should not happen? Please correct me if I get this wrong. Thanks.

Well, because cache.ram returns a reference to the object rather than a copy, I think you're OK in terms of avoiding conflicts when updating the dict (if you're just adding new keys). However, you might have to think about what happens when you hit the required number of entries. What if one request comes in with the final entry, but while that request is still processing (before it has cleared the cache), another request comes in -- what do you do with that new item? It may be workable, but you'll have to think carefully about how it should work.

Anthony

cyan

unread,
May 9, 2012, 9:55:59 AM5/9/12
to web...@googlegroups.com

I've looked deeper into this, and in web2py doc: http://www.web2py.com/examples/static/epydoc/web2py.gluon.cache.CacheInRam-class.html, it mentions: This is implemented as global (per process, shared by all threads) dictionary. A mutex-lock mechanism avoid conflicts.

Does this mean that when each request thread is accessing and modifying the content (e.g. a dictionary in my case) of the cache, every other cache is blocked and has to wait till the current request thread finishes with it. If so, it seems to me that the race condition we fear as above should not happen? Please correct me if I get this wrong. Thanks.

Well, because cache.ram returns a reference to the object rather than a copy, I think you're OK in terms of avoiding conflicts when updating the dict (if you're just adding new keys). However, you might have to think about what happens when you hit the required number of entries. What if one request comes in with the final entry, but while that request is still processing (before it has cleared the cache), another request comes in -- what do you do with that new item? It may be workable, but you'll have to think carefully about how it should work.

Thanks for confirming this. The only kind of updates would be adding new key-value pairs. Having confirmed that this sort of updates will be consistent on the cache, I am now a little worried about the performance of doing so, given the mention of the mutex-lock design of the cache.

If I understand this correctly, each thread (from a request) will lock the cache so that all other threads (requests) will have to wait. I intend to store multiple dictionaries (say 10) in the cache, and each dictionary will handle the data from a fixed set of users (say 30 of them) for a given period of time. If the cache truly behaves as above, then when one thread is updating the cache, all the other 10 * 30 - 1 = 299 threads will be blocked and will have to wait. This might drag the efficiency of the server-side.  

Anthony

unread,
May 9, 2012, 2:13:09 PM5/9/12
to web...@googlegroups.com
Thanks for confirming this. The only kind of updates would be adding new key-value pairs. Having confirmed that this sort of updates will be consistent on the cache, I am now a little worried about the performance of doing so, given the mention of the mutex-lock design of the cache.

If I understand this correctly, each thread (from a request) will lock the cache so that all other threads (requests) will have to wait. I intend to store multiple dictionaries (say 10) in the cache, and each dictionary will handle the data from a fixed set of users (say 30 of them) for a given period of time. If the cache truly behaves as above, then when one thread is updating the cache, all the other 10 * 30 - 1 = 299 threads will be blocked and will have to wait. This might drag the efficiency of the server-side.

As far as I can tell, the ram cache is not locked for the entire duration of the request -- it is only locked very briefly to delete keys, update access statistics, etc. So, I don't necessarily think this will pose a performance problem.

Anthony

cyan

unread,
May 9, 2012, 2:39:26 PM5/9/12
to web...@googlegroups.com
That's great news! May I ask where you found these details in the source code? I'd just want to double-check and make sure it is the case, as this is important to my design and implementation. Many thanks again! 

Anthony

unread,
May 9, 2012, 3:03:16 PM5/9/12
to web...@googlegroups.com
Here are the __init__ and __call__ functions for the CacheInRam class:

Looks like the both acquire and release locks during the course of the function call, but they don't acquire and hold the lock beyond the duration of the function call (which should be very short). I also did a quick test involving two Ajax calls from a page. One waited a second before accessing the cache, and the other waited several seconds after accessing the cache. The function that waited to access the cache was still able to return first, so it did not have to wait for the other function to complete.

Anthony

cyan

unread,
May 9, 2012, 4:21:26 PM5/9/12
to web...@googlegroups.com

Very nice. Thank you!
Reply all
Reply to author
Forward
0 new messages