References (py/ref) not correct

88 views
Skip to first unread message

Kieran Darcy

unread,
Oct 9, 2009, 12:07:30 PM10/9/09
to jsonpickle
Great project. I'm in awe, and hope to use jsonpickle soon.

I've found a problem with references. I haven't looked into a patch
yet, but I just thought I'd post an example before I start. Sorry if
I'm reporting an already known issue. I had a search and couldn't find
it.

Anyway, here's an example. Run the following, and compare the output
of "list1" with "list2", you will see that the two lists (list2 is
decoded list1) do not match. The object "o4" is referencing the wrong
object.

import jsonpickle

class Shared(object):
def __init__(self, text):
self.text = text

if __name__=='__main__':
o1 = Shared('Object 1, shared with o2')
o2 = o1
o3 = Shared('Object 3, shared with o4')
o4 = o3

list1 = [o1, o2, o3, o4]
print('list1')
for o in list1:
print( o.text )
print

list2 = jsonpickle.decode( jsonpickle.encode(list1) )
print('list2')
for o in list2:
print( o.text )

John Paulett

unread,
Oct 9, 2009, 2:51:28 PM10/9/09
to jsonp...@googlegroups.com
Kieran,

Interesting problem, thanks for bringing it up!

I think this issue brings up a philosphical question about what
jsonpickle is, and I think there are two possible answers:
1) A drop in replacement for pickle, that generates JSON instead of
the pickle data stream format.
2) An extension to json (or simplejson/cjson/demjson) that attempts to
serialize more complex object graphs than the json modules can
natively handle.

I think the use case so far has been closer to #2 (if anyone thinks
differently, please chime in).

Let me show a simple example which demonstrates roughly the same
problem as your excellent example.

First, let's create an object and store it twice in a list. We see
that the references point to the same object using the id() method.

    >>> a = 'hello'
    >>> b = a
    >>> container = [a, a]
    >>> id(container[0]) == id(container[1])
    True

Next, we serialize the list using pickle.

    >>> import pickle
    >>> pickleContainer = pickle.dumps(container)

The pickle data stream format indicates to us that the two list
elements likely reference the same object.

    >>> pickleContainer
    "(lp0\nS'hello'\np1\nag1\na."

We confirm this fact when we check the ids of the two deserialized list objects.

    >>> unPickleContainer = pickle.loads(pickleContainer)
    >>> unPickleContainer
    ['hello', 'hello']
    >>> id(unPickleContainer[0]) == id(unPickleContainer[1])
    True

The json module (and therefore the jsonpickle module), behaves
differently, as we can see when we serialize using json.

   >>> import json
    >>> jsonContainer = json.dumps(container)

JSON clearly does not have any notion of references:

    >>> jsonContainer
    '["hello", "hello"]'

When we deserialize, we get two separate copies of the original object.

    >>> unJsonContainer = json.loads(jsonContainer)
    >>> unJsonContainer
    [u'hello', u'hello']
    >>> id(unJsonContainer[0]) == id(unJsonContainer[1])
    False

I am more than happy to incorporate any patch that would add support
for preserving references.  However (without thinking about this too
much yet), I think there may be some issues incorporating such a
change.

I think the Python side of such a change would be practical, but I
would be more concerned about the Javascript side and the JSON
portability.  While we might be able to maintain references in Python,
we would lose all such references in Javascript. I

I may be way off--please correct me if I am missing something obvious.
If anyone has any thoughts or proposed implementations, feel free to
share.  I'm open to anything, but we may want to approach any solution
as an optional part of jsonpickle (i.e. turn on "reference
preservation" via a flag).

Kieran, can you share with us the use case in which you need these
references preserved.  In some cases, redefining what equality
(__eq__, __neq__, and __hash__) means might be an easy workaround
(i.e. use "smart" equality, rather than instance equality).

John

Kieran Darcy

unread,
Oct 12, 2009, 7:51:11 AM10/12/09
to jsonpickle
Thanks for the reply John, your response is really appreciated.

Sorry if this post is a bit long, but hopefully it's in some way
useful. As for the deep philosophical question about what jsonpickle
is: 1) a replacement for pickle; or 2) an extension to json, I'd kinda
have to go with 3) a bit of both, with more than a fair share of
emphasis on 2; jsonpickle could extend and then improve on json by
adding a bit more pickle.

To my mind, the most important thing json or jsonpickle should do is
maintain data across languages. In your json example, data is not
lost: ['hello', 'hello'] arrives in JSON almost intact, all we're
missing is the meta data which describes the relationship between the
two strings in the list (i.e. being the same object), but at least we
know what those strings should contain.

In my original example, jsonpickle loses data and meta data. The
object o4 leaves python with the data "Object 3, shared with o4" and
returns as "Object 1, shared with o2". Now, not only have we lost our
meta data, but we've also lost our data.

Sure, I could extend json's JSONEncoder http://docs.python.org/library/json.html#json.JSONEncoder
and JSONDecoder classes to describe my objects and how they interact
with each other, but where's the fun in that? jsonpickle is doing so
much of the hard work for me already, (ah, laziness, the foundation of
all good programming) wouldn't it be great if jsonpickle could to the
rest of the hard work too?

You mention your concern about...

> the Javascript side and the JSON
> portability. While we might be able to maintain references in Python,
> we would lose all such references in Javascript.

I don't think we need to be concerned about JSON or Javascript. All we
have to be concerned about giving enough data, and meta data, to allow
whatever the other language (Javascript or otherwise) needs, to be
able to do whatever it has to do to put the objects back together
again. If we've done our job right in the encoding, we should trust
that the target language can do its thing in the decoding.

As you say, we need a use case to see why this is useful. Here's mine.
I'm working on a module to allow "users" to create a document. This
document happens to be a questionnaire. The questionnaire can be made
up of a collection of question objects. Questions can have unique data
(like name) and can also share some data (like answers). Here's an
example:

Q1) What operating system(s) do you use for your development?
Ubuntu
Mac OS X
Windows XP

Q2) When developing, which operating system do you enjoy using the
most?
Ubuntu
Mac OS X
Windows XP

Let's say you want to define 100 questions with the same answer list.
You'd make your answer list an object (or a collection of Text
objects) and make it an attribute of your Question object. Some time
in the future, you decide to extend your answer list with a new
answer, or rename one of the existing answers to correct a spelling,
or reflect some other change. You don't want to do it 100 times, you
want to do it once and have that change automatically updated
everywhere. Easy when you're dealing with shared objects.

Here's my original example extended a bit to show such a use case.
Copy it to a file and run it, you'll see what I mean.

import jsonpickle

class Shared(object):
# A bit like a question
def __init__(self, name, text):
self.name = name
self.text = text

class Text(object):
# A bit like an answer
def __init__(self, text):
self.text = text

def shout(self):
# Do something with a text
return '%s!!' % self.text

def update(self, text):
# Change a text
self.text = text

def report(l):
for o in l:
print('%s = %s' % (o.name, o.text.shout(),))

if __name__=='__main__':
o1 = Shared( 'o1', Text('Object 1, shared with o2') )
o2 = Shared( 'o2', o1.text )
o3 = Shared( 'o3', Text('Object 3, shared with o4') )
o4 = Shared( 'o4', o3.text )

list1 = [o1, o2, o3, o4]

print('list1, before changes')
report(list1)

# Now, let's change some text.
o2.text.update('Object 2, shared with o1')

print('list1, after changes')
report(list1)

list2 = jsonpickle.decode( jsonpickle.encode(list1) )
print('list2, after jsonpickle')
report(list2)

Phew, I told you this was a long post. I won't post the code, but here
http://remote.ronin.com/jsonpickle/ is the Javascript solution for the
same problem. View the source on the link, you'll see the similarities
between the Python and the Javascript objects. Javascript works just
like Python. Javascript is capable to keeping references in the same
way Python does. We can trust Javascript to do it's job, if it has the
data to do it.

Now, coming to a conclusion at last (inhales deeply, then drinks tea).
Without working out the details, here's how I might go about
maintaining references in jsonpickle. Firstly, let's look at the JSON
produced by jsonpickle in my new example above. We'll only look at
half of it:

[
{
"py/object": "__main__.Shared",
"text": {
"py/object": "__main__.Text", # Here's jsonpickle's meta data
do say "This is a Text object"
"text": "Object 2, shared with o1"
},
"name": "o1"},

{
"py/object": "__main__.Shared",
"text": {
"py/ref": "/text" # Here's jsonpickle's meta data to day
"This is a reference to an object"
},
"name": "o2"
},
.......
]

Now, if we added more meta data (i.e. the dictionary keys beginning
'py/'), we could allow for the references to be rebuilt in our target
language. Here I'm adding the output of Python's built-in id() to two
new keys: "py/object_id" and "py/ref_id". When this data arrives in
Javascript, I can have my Javascript look up the "py/ref" object via
it's "py/ref_id" and remake the references. In Python, our jsonpickle
UnPickler class can do the same. Make sense?

[
{
"py/object": "__main__.Shared",
"text": {
"py/object": "__main__.Text",
"py/object_id": 123456, # Or whatever unique int id()
assigns. It doesn't matter, so long as it's unique.
"text": "Object 2, shared with o1"
},
"name": "o1"},

{
"py/object": "__main__.Shared",
"text": {
"py/ref": "/text",
"py/ref_id": 123456 # Our unique int as assigned by id()
above.
},
"name": "o2"
},
....
]

I think this goes some way towards maintaining object references in
jsonpickle. It's certainly something I could do with.


Thanks again for replying John, hopefully you were able to read to the
end :)

Kieran.

David Aguilar

unread,
Oct 15, 2009, 6:38:59 AM10/15/09
to jsonp...@googlegroups.com

Hi,

On Mon, Oct 12, 2009 at 04:51:11AM -0700, Kieran Darcy wrote:
>
> Thanks for the reply John, your response is really appreciated.
>
> Sorry if this post is a bit long, but hopefully it's in some way
> useful. As for the deep philosophical question about what jsonpickle
> is: 1) a replacement for pickle; or 2) an extension to json, I'd kinda
> have to go with 3) a bit of both, with more than a fair share of
> emphasis on 2; jsonpickle could extend and then improve on json by
> adding a bit more pickle.


Preserving identity for list items is something that would
be a great feature.

jsonpickle preserves references for dictionary values and
objects that reference other objects. When I wrote the
referencing support I was well aware that it wouldn't
work with across list boundaries due to the way we were
encoding the object id/"namespace".

We use a path-like key for the object IDs;
something like /foo/bar would refer to 42
{"foo": {"bar": 42}}

There are a range of possibilities for how we could do it.
Your suggestion to use id() is an interesting one and could
probably work in practice. I could also imagine something like
"/foo/bar[index]" for referring to an element inside of a list.

e.g. /foo/bar[2] refers to 42:

{"foo": {"bar": [10, 21, 42]}}

I never tried doing it but in theory it should be doable.
The encoder is recursive and currently maintains a name
stack which is used to construct the object ref path.

Extending that to know that it's inside a list shouldn't
be too hard and could probably be done in a way that is
backwards compatible.

Aside from lists/sets/tuples, have you found other cases where
the object references are not preserved?

My specific test case was a DAG-like structure where child
nodes hold references back to their parents, so I focused
specifically on making that work; having it work for lists
as well would be pretty cool.

Now... would I want it to preserve references for
something simple like:

foo = 'some string'
obj = [foo, foo]
?

I would argue that doing that is going too far and not
very useful in practice. Nonetheless, doing the right thing
for non-primitive types would be very nice; it's just
something that no one's gotten around to writing yet.

It might be worth logging a feature request for handling
object identity across sequence types:
http://code.google.com/p/jsonpickle/issues/list


An interesting aside:

We already do better than the pickle module ;-)
The pickle module doesn't handle recursive DAG-like
structures. I would be interested in knowing about
any cases that pickle handles better than jsonpickle
if you run across them.


--
David

David Aguilar

unread,
Nov 28, 2009, 9:51:06 PM11/28/09
to jsonp...@googlegroups.com, kieran...@gmail.com, eoghan...@gmail.com
On Mon, Oct 12, 2009 at 04:51:11AM -0700, Kieran Darcy wrote:
>
Make perfect sense.

Using an id()-like thing to to identify objects will make for
powerful format and solves a number of problems.

I'll see if I can give it a stab soon unless someone beats me to
it. http://github.com/davvid/jsonpickle

It actually means a lot of things can be simplified in the
format. For example, see below:

>
> [
> {
> "py/object": "__main__.Shared",
> "text": {
> "py/object": "__main__.Text",
> "py/object_id": 123456, # Or whatever unique int id()
> assigns. It doesn't matter, so long as it's unique.
> "text": "Object 2, shared with o1"
> },
> "name": "o1"},
>
> {
> "py/object": "__main__.Shared",
> "text": {
> "py/ref": "/text",

Here. If the decoder has to worry about parsing this xpath-like
string then there's no more work in it storing a dict of
object-ids instead. Does this mean we can eliminate the
py/ref "/text" thing completely and replace it with ref_id?
I think so.

We then wouldn't need the xpath-like syntax for object refs
at all since it'd all just be simple dict lookups. I like that.


BTW did anyone see http://www.mikealrogers.com/archives/695 ?

Since I just removed cjson from jsonpickle's built-in list of
known backends maybe it's worth replacing it with py-yajl?
(see the comments)

--
David
Reply all
Reply to author
Forward
0 new messages