Diff between object graphs?

Cem Karan

unread,

Apr 22, 2015, 6:37:35 AM4/22/15

to pytho...@python.org

Hi all, I need some help. I'm working on a simple event-based simulator for my dissertation research. The simulator has state information that I want to analyze as a post-simulation step, so I currently save (pickle) the entire simulator every time an event occurs; this lets me analyze the simulation at any moment in time, and ask questions that I haven't thought of yet. The problem is that pickling this amount of data is both time-consuming and a space hog. This is true even when using bz2.open() to create a compressed file on the fly.

This leaves me with two choices; first, pick the data I want to save, and second, find a way of generating diffs between object graphs. Since I don't yet know all the questions I want to ask, I don't want to throw away information prematurely, which is why I would prefer to avoid scenario 1.

So that brings up possibility two; generating diffs between object graphs. I've searched around in the standard library and on pypi, but I haven't yet found a library that does what I want. Does anyone know of something that does?

Basically, I want something with the following ability:

Object_graph_2 - Object_graph_1 = diff_2_1
Object_graph_1 + diff_2_1 = Object_graph_2

The object graphs are already pickleable, and the diffs must be, or this won't work. I can use deepcopy to ensure the two object graphs are completely separate, so the diffing engine doesn't need to worry about that part.

Anyone know of such a thing?

Thanks,
Cem Karan

Rustom Mody

unread,

Apr 22, 2015, 8:11:47 AM4/22/15

to

On Wednesday, April 22, 2015 at 4:07:35 PM UTC+5:30, Cem Karan wrote:
> Hi all, I need some help. I'm working on a simple event-based simulator for my dissertation research. The simulator has state information that I want to analyze as a post-simulation step, so I currently save (pickle) the entire simulator every time an event occurs; this lets me analyze the simulation at any moment in time, and ask questions that I haven't thought of yet. The problem is that pickling this amount of data is both time-consuming and a space hog. This is true even when using bz2.open() to create a compressed file on the fly.

No answer to your questions...
But you do know that bzip is rather worse than gzip in time
and not really so much better in space dont you??
http://tukaani.org/lzma/benchmarks.html

CFK

unread,

Apr 22, 2015, 8:34:36 AM4/22/15

to Rustom Mody, pytho...@python.org

I had no idea, I'll try my tests using gzip as well, just to see.

That said, I could still use the diff between object graphs; saving less state is definitely going to be a speed/space improvement over saving everything!

Thanks,

Cem Karan

Peter Otten

unread,

Apr 22, 2015, 8:54:08 AM4/22/15

to pytho...@python.org

Cem Karan wrote:

> Hi all, I need some help. I'm working on a simple event-based simulator
> for my dissertation research. The simulator has state information that I
> want to analyze as a post-simulation step, so I currently save (pickle)
> the entire simulator every time an event occurs; this lets me analyze the
> simulation at any moment in time, and ask questions that I haven't thought
> of yet. The problem is that pickling this amount of data is both
> time-consuming and a space hog. This is true even when using bz2.open()
> to create a compressed file on the fly.
>

> This leaves me with two choices; first, pick the data I want to save, and
> second, find a way of generating diffs between object graphs. Since I
> don't yet know all the questions I want to ask, I don't want to throw away
> information prematurely, which is why I would prefer to avoid scenario 1.
>
> So that brings up possibility two; generating diffs between object graphs.
> I've searched around in the standard library and on pypi, but I haven't
> yet found a library that does what I want. Does anyone know of something
> that does?
>
> Basically, I want something with the following ability:
>
> Object_graph_2 - Object_graph_1 = diff_2_1
> Object_graph_1 + diff_2_1 = Object_graph_2
>
> The object graphs are already pickleable, and the diffs must be, or this
> won't work. I can use deepcopy to ensure the two object graphs are
> completely separate, so the diffing engine doesn't need to worry about
> that part.
>
> Anyone know of such a thing?

A poor man's approach:

Do not compress the pickled data, check it into version control. Getting the
n-th state then becomes checking out the n-th revision of the file.

I have no idea how much space you save that way, but it's simple enough to
give it a try.

Another slightly more involved idea:

Make the events pickleable, and save the simulator only for every 100th (for
example) event. To restore the 7531th state load pickle 7500 and apply
events 7501 to 7531.

Cem Karan

unread,

Apr 22, 2015, 9:40:17 PM4/22/15

to Peter Otten, pytho...@python.org

Sounds like a good approach, I'll give it a shot in the morning.

> Another slightly more involved idea:
>
> Make the events pickleable, and save the simulator only for every 100th (for
> example) event. To restore the 7531th state load pickle 7500 and apply
> events 7501 to 7531.

I was hoping to avoid doing this as I lose information. BUT, its likely that this will be the best approach regardless of what other methods I use; there is just too much data.

Thanks,
Cem Karan

Dave Angel

unread,

Apr 22, 2015, 9:40:24 PM4/22/15

to pytho...@python.org

Why would that lose any information???

--
DaveA

Chris Angelico

unread,

Apr 22, 2015, 9:47:16 PM4/22/15

to pytho...@python.org

It loses information if event processing isn't perfectly deterministic.

ChrisA

Cem Karan

unread,

Apr 22, 2015, 9:54:42 PM4/22/15

to Chris Angelico, pytho...@python.org

Precisely. In order to make my simulations more realistic, I use a lot of random numbers. I can fake things by keeping the seed to the generator, but if I want to do any sort of hardware in the loop simulations, then that approach won't work.

Thanks,
Cem Karan

Dave Angel

unread,

Apr 22, 2015, 9:58:03 PM4/22/15

to pytho...@python.org

On 04/22/2015 09:46 PM, Chris Angelico wrote:
> On Thu, Apr 23, 2015 at 11:37 AM, Dave Angel <da...@davea.name> wrote:
>> On 04/22/2015 09:30 PM, Cem Karan wrote:
>>>
>>>
>>> On Apr 22, 2015, at 8:53 AM, Peter Otten <__pet...@web.de> wrote:
>>>
>>>> Another slightly more involved idea:
>>>>
>>>> Make the events pickleable, and save the simulator only for every 100th
>>>> (for
>>>> example) event. To restore the 7531th state load pickle 7500 and apply
>>>> events 7501 to 7531.
>>>
>>>
>>> I was hoping to avoid doing this as I lose information. BUT, its likely
>>> that this will be the best approach regardless of what other methods I use;
>>> there is just too much data.
>>>
>>
>> Why would that lose any information???
>
> It loses information if event processing isn't perfectly deterministic.

Quite right. But I hadn't seen anything in this thread to imply that.

I used an approach like that on the Game of Life, in 1976. I saved
every 10th or so state, and was able to run the simulation backwards by
going forward from the previous saved state. In this case, the analogue
of the "event" is determined from the previous state. But it's quite
similar, and quite deterministic.

--
DaveA

Cem Karan

unread,

Apr 22, 2015, 10:16:10 PM4/22/15

to Dave Angel, pytho...@python.org

On Apr 22, 2015, at 9:56 PM, Dave Angel <da...@davea.name> wrote:

> On 04/22/2015 09:46 PM, Chris Angelico wrote:
>> On Thu, Apr 23, 2015 at 11:37 AM, Dave Angel <da...@davea.name> wrote:
>>> On 04/22/2015 09:30 PM, Cem Karan wrote:
>>>>
>>>>
>>>> On Apr 22, 2015, at 8:53 AM, Peter Otten <__pet...@web.de> wrote:
>>>>
>>>>> Another slightly more involved idea:
>>>>>
>>>>> Make the events pickleable, and save the simulator only for every 100th
>>>>> (for
>>>>> example) event. To restore the 7531th state load pickle 7500 and apply
>>>>> events 7501 to 7531.
>>>>
>>>>
>>>> I was hoping to avoid doing this as I lose information. BUT, its likely
>>>> that this will be the best approach regardless of what other methods I use;
>>>> there is just too much data.
>>>>
>>>
>>> Why would that lose any information???
>>
>> It loses information if event processing isn't perfectly deterministic.
>
> Quite right. But I hadn't seen anything in this thread to imply that.

My apologies, that's my fault. I should have mentioned that in the first place.

Thanks,
Cem Karan

Steven D'Aprano

unread,

Apr 23, 2015, 1:59:29 AM4/23/15

to

On Thursday 23 April 2015 11:53, Cem Karan wrote:

> Precisely. In order to make my simulations more realistic, I use a lot of
> random numbers. I can fake things by keeping the seed to the generator,
> but if I want to do any sort of hardware in the loop simulations, then
> that approach won't work.

That's exactly why we have *pseudo* random number generators. They are
statistically indistinguishable from "real" randomness, but repeatable when
needed.

Obviously you need a high-quality PRNG like the Mersenne Twister, as used by
Python. and you need to ensure that the distribution of values matches that
of the real-life events. If you are truly paranoid, you might even run the
simulation twice, using independent PRNGs (e.g. Mersenne Twister for one
run, Marsaglia xorshift generator for another), and compare the results. But
given that you are using a high-quality generator in the first place, that
is unlikely to gain you anything.

(MT is uniformly distributed with no correlations in up to 623 dimensions. I
suppose it is possible if your simulation involves a phase space with more
than 623 dimensions, it may inadvertently find correlations in the random
numbers.)

There's no benefit (except maybe speed, and probably not that) for using
unrepeatable "real" random numbers. Using real randomness for simulations is
a bad idea because it means you can never run the same simulation twice and
you are forced to store large amounts of data instead of just storing the
seed then running the simulation again.

--
Steve

Cem Karan

unread,

Apr 23, 2015, 6:34:28 AM4/23/15

to Steven D'Aprano, pytho...@python.org

On Apr 23, 2015, at 1:59 AM, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> On Thursday 23 April 2015 11:53, Cem Karan wrote:
>
>> Precisely. In order to make my simulations more realistic, I use a lot of
>> random numbers. I can fake things by keeping the seed to the generator,
>> but if I want to do any sort of hardware in the loop simulations, then
>> that approach won't work.
>
> That's exactly why we have *pseudo* random number generators. They are
> statistically indistinguishable from "real" randomness, but repeatable when
> needed.

Which is why is why I mentioned keeping the seed above. The problem is that I eventually want to do hardware in the loop, which will involve IO between the simulation machine and the actual robots, and IO timing is imprecise and uncontrollable. That is where not recording something becomes lossy. That said, the mere act of trying to record everything is going to cause timing issues, so I guess I'm over thinking things yet again.

Thanks for the help everyone, its helped me clarify what I need to do in my mind.

Thanks,
Cem Karan

Steve Smaldone

unread,

Apr 23, 2015, 11:05:58 AM4/23/15

to Cem Karan, pytho...@python.org

On Thu, Apr 23, 2015 at 6:34 AM, Cem Karan <cfka...@gmail.com> wrote:

On Apr 23, 2015, at 1:59 AM, Steven D'Aprano <steve+comp....@pearwood.info> wrote:

> On Thursday 23 April 2015 11:53, Cem Karan wrote:
>
>> Precisely. In order to make my simulations more realistic, I use a lot of
>> random numbers. I can fake things by keeping the seed to the generator,
>> but if I want to do any sort of hardware in the loop simulations, then
>> that approach won't work.
>
> That's exactly why we have *pseudo* random number generators. They are
> statistically indistinguishable from "real" randomness, but repeatable when
> needed.

Which is why is why I mentioned keeping the seed above. The problem is that I eventually want to do hardware in the loop, which will involve IO between the simulation machine and the actual robots, and IO timing is imprecise and uncontrollable. That is where not recording something becomes lossy. That said, the mere act of trying to record everything is going to cause timing issues, so I guess I'm over thinking things yet again.

Thanks for the help everyone, its helped me clarify what I need to do in my mind.

Well, you could achieve this on Linux by using the rdiff library. Not exactly a purely Python solution, but it would give you file-based diffs.

Basically, what you could do is write the first file. Then for each subsequent saves, write out the file (as a temp file) and issue shell commands (via the Python script) to calculate the diffs of the new file against the first (basis) file. Once you remove the temp files, you'd have a full first save and a set of diffs against that file. You could rehydrate any save you want by applying the diff to the basis.

If you work on it a bit, you might even be able to avoid the temp file saves by using pipes in the shell command.

Of course, I haven't tested this so there may be non-obvious issues with diffing between subsequent pickled saves, but it seems that it should work on the surface.

Good luck!

SS

Cem Karan

unread,

Apr 24, 2015, 8:42:49 AM4/24/15

to Steve Smaldone, pytho...@python.org

On Apr 23, 2015, at 11:05 AM, Steve Smaldone <smal...@gmail.com> wrote:

> On Thu, Apr 23, 2015 at 6:34 AM, Cem Karan <cfka...@gmail.com> wrote:
>
> On Apr 23, 2015, at 1:59 AM, Steven D'Aprano <steve+comp....@pearwood.info> wrote:
>

> > On Thursday 23 April 2015 11:53, Cem Karan wrote:
> >
> >> Precisely. In order to make my simulations more realistic, I use a lot of
> >> random numbers. I can fake things by keeping the seed to the generator,
> >> but if I want to do any sort of hardware in the loop simulations, then
> >> that approach won't work.
> >
> > That's exactly why we have *pseudo* random number generators. They are
> > statistically indistinguishable from "real" randomness, but repeatable when
> > needed.
>

> Which is why is why I mentioned keeping the seed above. The problem is that I eventually want to do hardware in the loop, which will involve IO between the simulation machine and the actual robots, and IO timing is imprecise and uncontrollable. That is where not recording something becomes lossy. That said, the mere act of trying to record everything is going to cause timing issues, so I guess I'm over thinking things yet again.
>
> Thanks for the help everyone, its helped me clarify what I need to do in my mind.
>
>
> Well, you could achieve this on Linux by using the rdiff library. Not exactly a purely Python solution, but it would give you file-based diffs.
>
> Basically, what you could do is write the first file. Then for each subsequent saves, write out the file (as a temp file) and issue shell commands (via the Python script) to calculate the diffs of the new file against the first (basis) file. Once you remove the temp files, you'd have a full first save and a set of diffs against that file. You could rehydrate any save you want by applying the diff to the basis.
>
> If you work on it a bit, you might even be able to avoid the temp file saves by using pipes in the shell command.
>
> Of course, I haven't tested this so there may be non-obvious issues with diffing between subsequent pickled saves, but it seems that it should work on the surface.

That might work... although I'm running on OS X right now, once I get to the hardware in the loop part, it's all going to be some flavor of Linux. I'll look into it... thanks!

Thanks,
Cem Karan