Redis caching

299 views
Skip to first unread message

Arnon Marcus

unread,
Jan 16, 2014, 4:17:31 AM1/16/14
to web...@googlegroups.com
I noticed that the current implementation for web2py uses pickles.
That is a design choice. There are pros and cons.
Right off of my head, the biggest cons may be retricting cache-use to python, and performance penalties.
When I think of all that redis can do, I can not help imagining a better solution - especially for caching query results.
All result-sets are flat and simple in nature - before the dal steps in and converts them to row objects. This makes it an ideal candidate for redis.
Has anyone thought of this already?
A simplistic (naive) solution aould be to store every result in a hash, and stlre all the ids in a sorted set. This way, the result-sef in the cache may be queried by redis, and not necessarily be pulled in an all-or-nothing fasion, improving read-performance and resources dramatically, while opening the possibilities for external non-python processes to access the cache talking to redis directly.
It may not be desierable for all use-cases, as there are obviouse security concearns, but for ipc stuff and/or intranet applications (which are a common use-case in the web2py world), this can be most beneficial.
What do you say?

Derek

unread,
Jan 16, 2014, 4:11:04 PM1/16/14
to web...@googlegroups.com
Nope, nobody thought of it already. You should patent the idea!

I say you should implement it. And while you are at it, add in an ORM and ZeroMQ.

Niphlod

unread,
Jan 16, 2014, 4:16:20 PM1/16/14
to web...@googlegroups.com


On Thursday, January 16, 2014 10:17:31 AM UTC+1, Arnon Marcus wrote:
I noticed that the current implementation for web2py uses pickles.
That is a design choice. There are pros and cons.
Right off of my head, the biggest cons may be retricting cache-use to python, and performance penalties.

cache doesn't cache only resultsets, hence pickle is the only possible choice.
 
When I think of all that redis can do, I can not help imagining a better solution - especially for caching query results.
All result-sets are flat and simple in nature - before the dal steps in and converts them to row objects. This makes it an ideal candidate for redis.
Has anyone thought of this already?
A simplistic (naive) solution aould be to store every result in a hash, and stlre all the ids in a sorted set. This way, the result-sef in the cache may be queried by redis, and not necessarily be pulled in an all-or-nothing fasion, improving read-performance and resources dramatically, while opening the possibilities for external non-python processes to access the cache talking to redis directly.
It may not be desierable for all use-cases, as there are obviouse security concearns, but for ipc stuff and/or intranet applications (which are a common use-case in the web2py world), this can be most beneficial.
What do you say?

It's cool. Actually, I started developing something like that using DAL callbacks, but as soon as multiple tables are involved with FK and such, it starts to loose "speed". Also, your whole app needs to be coded a-la "ActiveRecord", i.e. fetch only by PK.
BTW, I'm not properly sure that fetching 100 records with 100 calls to redis vs pulling a single time a pickle of 1000 records and discarding what you don't need is faster.

BTW2: ORM are already there: redisco and redis-lympid

Arnon Marcus

unread,
Jan 16, 2014, 5:57:05 PM1/16/14
to web...@googlegroups.com
Derek: Are you being sarcastic and mean?

 
cache doesn't cache only resultsets, hence pickle is the only possible choice.
 

Well, not if you only need flat and basic objects - there the benefit of pickle is mute and it's overhead is obvious - take a look at this project:
 
It's cool. Actually, I started developing something like that using DAL callbacks, but as soon as multiple tables are involved with FK and such, it starts to loose "speed". Also, your whole app needs to be coded a-la "ActiveRecord", i.e. fetch only by PK.

Hmmm... Haven't thought of that... Well, you can't search/query for specific records by their hashed-values, but that's not the use-case I was thinking about - I am not suggesting "replacing" the dal... Plus, that restriction would also exist when using pickles for such a use-case...
What I had in mind is simpler than that - just have a bunch of simple queries that you would do in your cache.ram anyways, and instead have their "raw" result-set (before being parsed into "rows" objects) and cached as-is (almost...) - that would be faster to load-in the cache than into cache.ram, and also faster for retrieval.
 
BTW, I'm not properly sure that fetching 100 records with 100 calls to redis vs pulling a single time a pickle of 1000 records and discarding what you don't need is faster.

Hmmm... I don't know, redis is famous for crunching somewhere in the order of 500K requests per-second - have you tested it? 
 
BTW2: ORM are already there: redisco and redis-lympid

10x, I'll take a look - though I think an ORM would defeat the purpose (in terms of of speed) and would be overkill... 

Anthony

unread,
Jan 16, 2014, 6:10:33 PM1/16/14
to web...@googlegroups.com
What I had in mind is simpler than that - just have a bunch of simple queries that you would do in your cache.ram anyways, and instead have their "raw" result-set (before being parsed into "rows" objects) and cached as-is (almost...)

Note, when you do .select(..., cache=...), it does in fact just cache the raw result set from the database -- it does not parse into a Rows object and pickle/cache the Rows object (though you can do that as well, if you instead do .select(..., cacheable=True), though the Row objects will be missing some functionality in that case).

Anthony

Derek

unread,
Jan 27, 2014, 2:44:30 PM1/27/14
to web...@googlegroups.com
Not at all. I'm simply suggesting you get a patent for your idea right away and start implementing it, so we can see just how awesome the idea really is. I can only imagine!

Niphlod

unread,
Jan 27, 2014, 3:16:03 PM1/27/14
to web...@googlegroups.com

Hmmm... I don't know, redis is famous for crunching somewhere in the order of 500K requests per-second - have you tested it? 
 
I though it was clear, anyway, I reinstate: yep, it was clearly loosing performances each step of complexity added. I'm a big fan of NoSQL but please do mind that relational engines are backed by tenths of years of development and millions of testing users.
NoSQL strips out some of the major complexities of relational engines to work at warp speed. If you need a really small subset of operation than by any mean, NoSQL are a cool addition to the toolbelt. However when you start adding cool features of relational engines to NoSQL (basically reinventing the wheel) performances drop considerably.

Jufsa Lagom

unread,
Feb 6, 2014, 5:44:37 PM2/6/14
to web...@googlegroups.com
Hello Arnon.

I just made a quick search of your posts on the other groups on groups.google.com..

On many (almost all) groups that you have made posts, you run into arguments with longtime members/contributors that have put down huge amount of time in the projects.

You say yourself in many posts, that you are inexperienced in the subject that are being discussed?
Then, perhaps it's good to take a more humble approach when addressing your questions/statements?
I can only speak for myself, that I should at least pick that approach if I had a question to the community..  

Don't misunderstand me, It's always good with new ideas and fresh insights..
But when meeting massive resistance in a community about an idea that doesn't seem to get any traction, then perhaps that idea shouldn't be forced with endless arguments just to "win"? 

Sorry for the OT, and this is just a friendly hint from an old news user :)

--
Kind Regards
Jufsa Lagom

Derek

unread,
Feb 18, 2014, 2:27:05 PM2/18/14
to web...@googlegroups.com
>endless arguments just to "win"? 

I don't think it's that, I think that people who consider themselves "idea men" are people who are generally lazy who don't want to do any of the work, but want to take credit for it. They discount the amount of time that developers put into a project and state that they could do it better (if they could just be bothered to implement their idea, which happens to be too simple for them to bother with.) I was merely suggesting that the best way to handle such people is to say 'it is a wonderful idea! people might steal it! better be the first to implement it yourself and then patent it!' What I've seen is that they usually shut up about their great 'new idea' and maybe they learn that programming isn't as easy or 'simple' as they thought it was.

Arnon Marcus

unread,
Feb 18, 2014, 4:15:09 PM2/18/14
to web...@googlegroups.com
I like how in this medium people can talk behind your back and in your face at the same time! :P

I actually invested about 2 weeks (both at work AND at home), experimenting with MANY different options of storing and retrieving data in redis, using all structure-types, using both generic (procedural-generated) data, and our own real-world data. It started out as a pet-project, but it mushroomed into a very detailed and flexible "py-redis-benchmarking tool", which I have every intention of sharing on github - I think it's over a 1k-loc already... You basically tell it which benchmak-combination(s) you wish to run, and it prints the results in a nicely-organized table. If you choose to use the procedurally-generated data (for synthetic-benchmarking) you can define each of the 3 dimensions it has (keys, records, fields), to see how each effect each redis-storing-option (lists, sets, hashes, etc.). So you can get a feel for how "scale" behaves as a factor of influence on the benefits/trade-offs of each storage-option. I think I will add graph-plotting for IPython, just for the fun of it...

In conclusion:
A major performance-factor is the number of round-trips to redis, so I employed heavy-use of "pipeline", But it turns out that another major-performance-factor after that, is the manipulations that need to happen to the data in python, on pre-storing and post-retrival, in order to fit the data into the redis-structures. Turns out, that - at least for bulk-store/retrival (pipeline-usage), the overheads of fitting a data structure into redis, outweighs the benefits, sometimes by orders of magnitude. Perhaps if an application is written to use redis as a database, it would be worth it, as interjecting into a specific value "nested" inside a redis-structure "may" be faster than having to pull an entire "key" with serialized data - but that's not the use-case we're talking about for "caching" in web2py.

So, the *tl;dr;*  version of it, is:
"Flat key-value store of serialized data is fastest for bulk-store/retrieval"

* Especially when using "hiredis" (python-wrapper around a "c-compiled" redis-client - That's orders-of-magnitudes faster...)

Then I went to testing many serialization formats/libraries:
- JSON (pure-python)
- simplejson (with "c-compiled" optimizations)
- cjson (a "c-compiled" library w/Python-wrapper)
- ujson (a "c-compiled" library w/Python-wrapper)
- pickle (pure-python)
- cPickle (a "c-compiled" library w/Python-wrapper)
- msgpack (with "c-compiled" optimizations)
- u-msgpack (pure-python)
- marshal

Results:
- all pure-python options are slowest (unsurprising)
- simplejson is almost as fast as cjson when used with c-compiled-optimization, and is more maintained, so no use for cjson.
- cPickle is almost as fast as marshal, and is platform/version agnostic, so no use for marshal.
- ujson is only faster than simplejson for very long (and flat) lists, and is less maintained/popular/mature.

So, that leaves us with:
- simplejson
- cPickle
- msgpack

- cPickle is actually "slowest", AND is python-only.
- With either simplejson or msgpack, you can read the data from redis from non-python clients AND they both (surprisingly) handle unicode really well..
- msgpack is roughly x2 faster than simplejson, but is less-readable in a redis-GUI.

However:
When using simplejson or msgpack. once you introduce "DateTime" values, you need to process the results in python by interjecting into the parsers with hooks... Once you do that, all the performance-gain nullifies...
So cPickle becomes fastest, as it generates the python "DateTime" objects in the c-level...

So I ended-up where I started, rounding a full-circle back to flat-keys with cPickle...

The only benefit I ended-up gaining, is by re-factoring our high-level cache-data-structure, on-top of redis_cache.py, that does bulk-retrival and smart-refreshes - but I'm not sure I can share that code...

We are now doing a bulk-get of our entire redis-cache on every request.
It has over 100 keys, some very small and some with hundreds of nested-records. We narrowed it down to 16ms per-request (best-case), which is good enough for me.

We basically have a class in a module, which instanciates a non-thread-local singleton, once per-process. It has an ordered-dictionary of "keys" mapped to "lambdas". We call it the "cache-catalog". The results are stored in a regular dictionary (which is thread-local), which maps the keys to their respective resultant-value. On each request, a bulk-get is issued with a list of all the keys (which we already have - it's the keys of the catalog + the "w2py:<app>:" prefix, so we don't even need to have them stored in redis in a separate set... And we still don't have to use the infamous "GET KEYS" redis-command...), and since the catalog is an ordered-dictionary, we know which value in the result maps to which key. So we know the "None" values represent the keys that are currently "missing" in redis, due to a deletion triggered by a cache-update on another request/thread/process. So we get a list of "missing keys", that we just run through in a regular for-loop, generating new values using the regular cache-mechanism (which triggers the lambdas) - so we only update what's missing.

This turns out to be extremely efficient, fast and resilient.
I suggest this approach would be factored-into the redis_cache.py file itself somehow...
Not sure I can share that code though... (legally...)

Anyways, hope this somes-up the topic, and hope some people learned something from this summary of my experience.
If not, hey, what do I know, I'm just an "idea guy" after all, right? :P

I'll be posting a link to the git-repo of the benchmark-code in a few days, after I clean it up a bit...


--
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
---
You received this message because you are subscribed to a topic in the Google Groups "web2py-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/web2py/im3pZuKWkWI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to web2py+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all
Reply to author
Forward
0 new messages