Problem saving dictionaries in session

189 views
Skip to first unread message

Anthony

unread,
Mar 15, 2016, 3:55:51 PM3/15/16
to web2py-developers
Please see https://groups.google.com/forum/#!topic/web2py/EHDQrNI1-ok? Anyone have any ideas?

Anthony

Leonel Câmara

unread,
Mar 15, 2016, 7:33:22 PM3/15/16
to web2py-developers
I'm guessing this has something to do with the order of keys in a dictionary given that dictionaries are unordered data structures. So sometimes the session will load itself with a different order triggering the overwrite. In fact in globals.py we have a sorting_dumps function defined but it doesn't seem to be used anywhere as Session uses the regular pickle.dumps and not sorting_dumps.

I think this is fixable but I'm inclined to think we should actually do a breaking change here. Our sessions are incredibly slow even if they don't write to disk just on account of doing a pickle and a hash every single time. 

Massimo DiPierro

unread,
Mar 15, 2016, 7:47:38 PM3/15/16
to web2py-d...@googlegroups.com
How would you change them?

On Mar 15, 2016, at 6:33 PM, Leonel Câmara <leonel...@gmail.com> wrote:

I'm guessing this has something to do with the order of keys in a dictionary given that dictionaries are unordered data structures. So sometimes the session will load itself with a different order triggering the overwrite. In fact in globals.py we have a sorting_dumps function defined but it doesn't seem to be used anywhere as Session uses the regular pickle.dumps and not sorting_dumps.

I think this is fixable but I'm inclined to think we should actually do a breaking change here. Our sessions are incredibly slow even if they don't write to disk just on account of doing a pickle and a hash every single time. 

--
-- mail from:GoogleGroups "web2py-developers" mailing list
make speech: web2py-d...@googlegroups.com
unsubscribe: web2py-develop...@googlegroups.com
details : http://groups.google.com/group/web2py-developers
the project: http://code.google.com/p/web2py/
official : http://www.web2py.com/
---
You received this message because you are subscribed to the Google Groups "web2py-developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web2py-develop...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Leonel Câmara

unread,
Mar 15, 2016, 8:24:20 PM3/15/16
to web2py-d...@googlegroups.com
I would add something like session.dirty and unless that's set to True nothing in the session gets written. The only way I see to make this backwards compatible would be to have a kind of session.autodirty which would default to True. In which case it would keep using the slow way of pickling and hashing, which we could fix with the aforementioned sorting (which would make it even slower), to determine if it needed to write.  

Anthony

unread,
Mar 16, 2016, 1:06:56 AM3/16/16
to web2py-developers
On Tuesday, March 15, 2016 at 7:33:22 PM UTC-4, Leonel Câmara wrote:
I'm guessing this has something to do with the order of keys in a dictionary given that dictionaries are unordered data structures. So sometimes the session will load itself with a different order triggering the overwrite.

I was thinking it might have something to do with that, but note that the problem occurs only if you actually access dictionary items as part of a variable assignment in the global scope. Presumably this access should not change the underlying data structure, and whether accessed or not, the same dictionary is written back to the session at the end of the request. I have also observed the effect with a dictionary containing only a single key.

Anthony

Leonel Câmara

unread,
Mar 16, 2016, 6:14:42 AM3/16/16
to web2py-developers
Weird, maybe it has to do with dictionary internals caching the hashes or something like that I'll have to investigate this more carefully.

Leonel Câmara

unread,
Mar 16, 2016, 6:26:14 AM3/16/16
to web2py-developers
Do note that the Session itself still has a __dict__ that will have more than one key so I'm still betting on the order thing.

Anthony

unread,
Mar 16, 2016, 11:12:25 AM3/16/16
to web2py-developers
On Wednesday, March 16, 2016 at 6:26:14 AM UTC-4, Leonel Câmara wrote:
Do note that the Session itself still has a __dict__ that will have more than one key so I'm still betting on the order thing.

But the problem does not occur in the general case (i.e., just by virtue of the fact that the session has a dictionary with multiple keys). It only happens if you access (but do not mutate) items within a dictionary stored in the session.

Anthony

Ricardo Pedroso

unread,
Mar 18, 2016, 2:36:03 PM3/18/16
to web2py-d...@googlegroups.com
The explanation and solution:

http://stackoverflow.com/questions/7501577/cpickle-different-results-pickling-the-same-object
https://github.com/shrubberysoft/django-picklefield/blob/master/src/picklefield/tests.py#L58

In very short:

This issue is due to cPickle depends on the objects refcount.
django-picklefield has the deepcopy as a solution.

I tryied in globals.py, Session.connect:

session_pickled = pickle.dumps(copy.deepcopy(self), pickle.HIGHEST_PROTOCOL)

and in globals.py Session._unchanged:

session_pickled = pickle.dumps(copy.deepcopy(self), pickle.HIGHEST_PROTOCOL)

and it seems to solve this issue.

I don't know how performance is impacted, didnt test it.

Ricardo

Massimo DiPierro

unread,
Mar 18, 2016, 2:37:34 PM3/18/16
to web2py-d...@googlegroups.com

On Mar 18, 2016, at 1:36 PM, Ricardo Pedroso <rmdpe...@gmail.com> wrote:

> On 3/16/16, Anthony <abas...@gmail.com> wrote:
>> On Wednesday, March 16, 2016 at 6:26:14 AM UTC-4, Leonel Câmara wrote:
>>>
>>> Do note that the Session itself still has a __dict__ that will have more
>>> than one key so I'm still betting on the order thing.
>>>
>>
>> But the problem does not occur in the general case (i.e., just by virtue of
>>
>> the fact that the session has a dictionary with multiple keys). It only
>> happens if you access (but do not mutate) items within a dictionary stored
>> in the session.
>
> The explanation and solution:
>
> http://stackoverflow.com/questions/7501577/cpickle-different-results-pickling-the-same-object
> https://github.com/shrubberysoft/django-picklefield/blob/master/src/picklefield/tests.py#L58
>
> In very short:
>
> This issue is due to cPickle depends on the objects refcount.

Damn, this is stupid!

> django-picklefield has the deepcopy as a solution.
>
> I tryied in globals.py, Session.connect:
>
> session_pickled = pickle.dumps(copy.deepcopy(self), pickle.HIGHEST_PROTOCOL)
>
> and in globals.py Session._unchanged:
>
> session_pickled = pickle.dumps(copy.deepcopy(self), pickle.HIGHEST_PROTOCOL)
>
> and it seems to solve this issue.
>
> I don't know how performance is impacted, didnt test it.
>
> Ricardo
>

Massimo DiPierro

unread,
Mar 19, 2016, 3:36:58 AM3/19/16
to web2py-d...@googlegroups.com
What do people this about this? I am keen to include the change. I do not think the deepcopy is more expensive than the pickling.


On Mar 18, 2016, at 1:36 PM, Ricardo Pedroso <rmdpe...@gmail.com> wrote:

Anthony

unread,
Mar 19, 2016, 1:58:39 PM3/19/16
to web2py-developers
On Saturday, March 19, 2016 at 3:36:58 AM UTC-4, Massimo Di Pierro wrote:
What do people this about this? I am keen to include the change. I do not think the deepcopy is more expensive than the pickling.

It's not deepcopy instead of pickling, but deepcopy in addition to pickling. According to some quick testing, seems like this takes about 10 times as long.

Anthony

Anthony

unread,
Mar 19, 2016, 2:10:26 PM3/19/16
to web2py-developers

> In very short:
>
>  This issue is due to cPickle depends on the objects refcount.

Damn, this is stupid!

Seems like it is due to the fact that pickle avoids duplicating objects with multiple references. As a result, you always get back the same object, but there is no guarantee that the result of pickle.dumps will always be the same. In short, pickle wasn't designed to be used as a means of detecting changes in Python objects.

Anthony

Alex

unread,
Mar 22, 2016, 8:27:04 AM3/22/16
to web2py-developers
That sounds like the best solution to me. It's backward compatible and if you care about performance you can manually set the dirty Flag to avoid pickling and hashing.

Massimo DiPierro

unread,
Mar 22, 2016, 5:01:39 PM3/22/16
to web2py-d...@googlegroups.com
It sounds too complicated to me. This should be automatic and not decided by the code. There must be a better solution. Let’s all think about it.

Niphlod

unread,
Mar 24, 2016, 4:42:33 AM3/24/16
to web2py-developers
Dunno what could be a real solution (hey, it was in the roadmap because we discussed it before) but if you check ANY other framework you'd see two ways:
- forcing users to use session only with plain dictionaries (serialized to strings, or json, the difference can be computed with a simple compare)
- classes that record when a top-level key has been added/removed/changed but FORCE you to signal any change deeper than the first level

I like the first better just for the "promoting sane defaults" argument: I don't like apps (or coders) that use session heavily with god-knows-what serialized in and out at each and every request.
The second one is good too: the core code certainly knows when the session needs to be saved (i.e. a formkey has been added as a CSRF prevention) and the user certainly knows when he alters the session.


One thing is sure: I wouldn't want the current session code in web3py nor accept a "web3py won't have session turned on by default" . we just need a simpler interface 
Yes, I DO like reading django's, werzeug's and pyramid's, I see what they're doing and I joy at the resulting cleanliness and "leanliness" ...  and I see myself writing a backend or overriding the core code in less than a day.
It took actually several days worth to hook web2py up when I wrote the redis-backed one and changes to the underlying module broke what I wrote. Actually there's still a small issue with redis-backed used together with locking because there's no way in web2py to know when a request finished without touching the session.

Massimo DiPierro

unread,
Mar 24, 2016, 4:45:11 AM3/24/16
to web2py-d...@googlegroups.com
On Mar 24, 2016, at 3:42 AM, Niphlod <nip...@gmail.com> wrote:

Dunno what could be a real solution (hey, it was in the roadmap because we discussed it before) but if you check ANY other framework you'd see two ways:
- forcing users to use session only with plain dictionaries (serialized to strings, or json, the difference can be computed with a simple compare)
- classes that record when a top-level key has been added/removed/changed but FORCE you to signal any change deeper than the first level

I like the first better just for the "promoting sane defaults" argument: I don't like apps (or coders) that use session heavily with god-knows-what serialized in and out at each and every request.
The second one is good too: the core code certainly knows when the session needs to be saved (i.e. a formkey has been added as a CSRF prevention) and the user certainly knows when he alters the session.


One thing is sure: I wouldn't want the current session code in web3py nor accept a "web3py won't have session turned on by default" . we just need a simpler interface 
Yes, I DO like reading django's, werzeug's and pyramid's, I see what they're doing and I joy at the resulting cleanliness and "leanliness" ...  and I see myself writing a backend or overriding the core code in less than a day.

Tell us more and it may go on web3py.

It took actually several days worth to hook web2py up when I wrote the redis-backed one and changes to the underlying module broke what I wrote. Actually there's still a small issue with redis-backed used together with locking because there's no way in web2py to know when a request finished without touching the session.

On Tuesday, March 22, 2016 at 10:01:39 PM UTC+1, Massimo Di Pierro wrote:
It sounds too complicated to me. This should be automatic and not decided by the code. There must be a better solution. Let’s all think about it.

On Mar 22, 2016, at 7:27 AM, Alex <mrau...@gmail.com> wrote:

That sounds like the best solution to me. It's backward compatible and if you care about performance you can manually set the dirty Flag to avoid pickling and hashing.



Niphlod

unread,
Mar 24, 2016, 5:21:06 AM3/24/16
to web2py-developers
"more" what ?

did you take a look at the projects I mentioned (which also have a good documentation both for usage and its API)? Can't you see the "gain" in adopting ?

IMHO the first one is better but it involves a mind-shift web2py's users are not accustomed to: certainly the second one is more "likable" by the current userbase.

Leonel Câmara

unread,
Mar 24, 2016, 5:36:43 AM3/24/16
to web2py-developers
I don't find the way the sessions currently work to be likable at all, it's only good for demo apps because anything more than that and it causes serious performance issues.

Massimo DiPierro

unread,
Mar 24, 2016, 11:22:27 AM3/24/16
to web2py-d...@googlegroups.com
I am familiar with the way Django sessions work but I do not know what you particularly like about them. They have a gazillion options and are overcomplicated and they do not support the syntax of out Storage object. I know they have a .save() API but I cannot figure out 

But it is not clear to me what rule they use for determining when to save a session. Do you look into the source for that?

I would rather have a session object that can only be serialized in JSON and use that with (sort_keys) to determine when it has changed.

Anthony

unread,
Mar 24, 2016, 12:35:52 PM3/24/16
to web2py-developers
IMHO the first one is better but it involves a mind-shift web2py's users are not accustomed to: certainly the second one is more "likable" by the current userbase.

I'd say go with the first option (limit to what can be converted to JSON). If someone really wants to save a more complicated object, they can always manually pickle/unpickle it.

Anthony

Leonel Câmara

unread,
Mar 24, 2016, 12:45:46 PM3/24/16
to web2py-developers
Humm if we only allow JSON serializable data we could be even smarter than that, we could have our own Dict and List class that would notify their containing Dict or List object if they were changed, so it would bubble up until the last session Dict would then be dirty (we could override __setitem__ to do this). In the end we could know if the session was dirty or not without making any calculation. This way, there would be no need to check for differences in the end.  

Massimo DiPierro

unread,
Mar 24, 2016, 12:55:42 PM3/24/16
to web2py-d...@googlegroups.com
That can be more computationally expensive then checking the serialized version.

On Mar 24, 2016, at 11:45 AM, Leonel Câmara <leonel...@gmail.com> wrote:

Humm if we only allow JSON serializable data we could be even smarter than that, we could have our own Dict and List class that would notify their containing Dict or List object if they were changed, so it would bubble up until the last session Dict would then be dirty (we could override __setitem__ to do this). In the end we could know if the session was dirty or not without making any calculation. This way, there would be no need to check for differences in the end.  

Niphlod

unread,
Mar 24, 2016, 1:38:02 PM3/24/16
to web2py-developers
@massimo: what I like about django sessions (even the gazillion options and hooks, which are clearly explained) because they have a "dirty-like" flag, which is "modified". 
ATM we're talking about serialization and "heavyness" issues .... https://docs.djangoproject.com/ja/1.9/topics/http/sessions/#when-sessions-are-saved

@leonel: incidentally, see the gotchas about your idea at the same link (the tl;dr is it works, but not for nested changes) 

Leonel Câmara

unread,
Mar 24, 2016, 1:52:28 PM3/24/16
to web2py-developers
Ahh but my technique would still work for that as I would also change inner dicts and lists to my overriden types and they would warn they've been changed.

Niphlod

unread,
Mar 24, 2016, 4:20:46 PM3/24/16
to web2py-developers
well, that remains to be benchmarked then, but with recursion estimates are not easy... the real dealbreaker is that if session is actually changed, then the serialization step would need to happen but hey, that's what benchmarks are for.

again, I'd be happy either way: IMHO a "flag" that checks for a pre-intelligeble change is better to support users accustomed to save into session anything that can be pickleable, and it'd be a minor "shift" in usage. Plain json objects are more limited but probably less computationally-intensive.  

Michele Comitini

unread,
Mar 25, 2016, 11:46:50 AM3/25/16
to web2py-developers
I am for json serialization  even if it could be limited in scope.  One benefit would be leveraging on storage engines supporting that format.

For change detection what is the performance footprint of visiting  the tree, cpu and memory wise?  In deep structures of that kind how complex is detecting and avoiding cycles?  


--

Niphlod

unread,
Mar 25, 2016, 12:14:56 PM3/25/16
to web2py-developers
don't know (a benchmark shouldn't be hard) but IMHO there has to be a limit in recursion.... at that point, we may as well just track top-level changes and have the beloved flag shifted (if not yet shifted), and warn users about changing deep nested values.

Leonel Câmara

unread,
Mar 25, 2016, 1:07:54 PM3/25/16
to web2py-developers
I'll try to make a very very basic prototype of a Session that would only accept JSON and that would track the changes to its contents next week. So we can do some benchmarking.

Massimo DiPierro

unread,
Mar 25, 2016, 8:01:53 PM3/25/16
to web2py-d...@googlegroups.com
I am curious about the strategy. 

On Mar 25, 2016, at 12:07 PM, Leonel Câmara <leonel...@gmail.com> wrote:

I'll try to make a very very basic prototype of a Session that would only accept JSON and that would track the changes to its contents next week. So we can do some benchmarking.

Leonel Câmara

unread,
Mar 26, 2016, 11:01:20 AM3/26/16
to web2py-developers
I made a very rough prototype of what I meant you can look at it here:


It seems to work pretty well according to my tests, but obviously we will have to benchmark this with a relatively big nesting to see if it's indeed faster as it does incur an overhead parsing the json and converting the types.

Next step would be someone reviewing/optimizing this and contributing an implementation that just serializes with sort_keys and compares in the end to check for changes so we could timeit.
Reply all
Reply to author
Forward
0 new messages