Improvements to the existing versioning model

72 views
Skip to first unread message

Rufus Pollock

unread,
Aug 30, 2007, 7:07:06 AM8/30/07
to sqle...@googlegroups.com
## Per-object vs. global versioning

The current versioning implementation of elixir is per-object that is
all versioning is done on an individual domain object/entity independent
of any other domain object/entity. This is simple and is often
sufficient for many situations however it does have various problems:

1. No way to support relationships. Relationships, whether many-to-many
or a simple has_many involve changes on multiple objects. With
per-object versioning it is impossible to link a change on object A to
object B.

2. No way to change multiple objects as part of one change. For example
consider a wiki with many Page entities. You might want to rename a Page
and update links in other Pages all as part of one update (this is
obviously very similar to atomic multi-file commits in subversion as
well as transactions in a db).

The natural way to solve this is move to having an explicit central
'Revision' object which version objects hold a reference to. I've
already had a go at modding the existing versioning code to do this:

http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/complex.py
http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/complex_test.py

This leads to code that looks very similar to the original but has
explicit Revisions (from complex_test.py):

...
rev1 = cx.Revision()
rev1.log_message = 'Revision 1'

gilliam = Director(name='Terry Gilliam')
session = sqlalchemy.object_session(gilliam)
monkeys = Movie(id=1, title='12 Monkeys', description='draft
description', director=gilliam)
bruce = Actor(name='Bruce Willis', movies=[monkeys])

rev1.commit()
...

Note for those looking at the complex_test code that while the tests are
passing the full overhaul is not yet complete:

* e.g. get_as_of should be deprecated in favour of using explicit
revision.
* no example of versioning multiple objects


## Miscellaneous Other Improvements

Finally in addition I point out some other improvements to the existing
versioning model that could be made:

1. Versioning models distinguish between the continuity object and the
versions of those object. However one would normally aim to keep the
'version' objects away from clients (unless they explicitly need
access). When you request an object a particular version you still get
that object back but it transparently sources attributes from the
correct version.[^1] Elixir differs subtly from this in that when you
get an old version you actually get the verion object (e.g. MovieVersion
rather than Movie). For 'normal' attributes this is not a problem but
for relationships it leads to unusual behaviour in that the has_many and
has_and_belongs_to_many attributes do not exist on versions e.g. (from a
minor mod of existing test):

oldest_version = movie.get_as_of(after_create)
middle_version = movie.get_as_of(after_update_one)
latest_version = movie.get_as_of(after_update_two)

initial_timestamp = oldest_version.timestamp

assert oldest_version.version == 1
assert oldest_version.description == 'draft description'
# this fails with
# 'MovieVersion' object has no attribute 'director'
# assert oldest_version.director.name == 'Terry Gilliam'
assert oldest_version.director_id == 1
# fails with ...
# 'MovieVersion' object has no attribute 'actors'
# assert len(oldest_version.actors) == 1

I think it would be good to ensure that the version and continuity
object display the same behaviour. This can either be achieved by
duplicating the continuity mapper more closely or by always using the
continuity object all the time and making attribute calls depend on
current set version/revision (more like the approach detailed in [^1]).

[^1]: see e.g. http://www.martinfowler.com/ap2/temporalObject.html and
my implementation in:

http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/base.py
http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/base_test.py

2. If you have two versioned objects (say both Movie and Director).
Suppose there are changes to a related Movie and Director M1 and D1 say
before time T. Suppose you get the old M1 (before T). Then when you gets
its director you get the current version of D1 not the old one.

3. State. Once one has versioning it is natural to distinguish between
deleting and purging (or perhaps hibernating and deleting). If I delete
the current object I might still want to be able to go back to earlier
versions of that object where it was not deleted. This is even more
important once we have versioned relationships. Consider the Movie
example with versioned Directors. There I might want to delete a
Director object (and existing dependent Movies) but not have to mess
with old Movie versions that still link to that Director.

The natural way, I would suggest, to deal with this is to introduce a
'state' attribute which can take the values of 'active' (normal) or
'deleted/hibernated'. This then allows us to delete/hibernate object
without breaking relationships in old versions. The cost of this change
is either increased burden on clients when doing selects (they need to
explicitly exclude hibeIrnated/deleted items) or messing around with
select type actions on versioned objects so they ignore
deleted/hibernated objects.

Regards,

Rufus Pollock

Gaetan de Menten

unread,
Sep 10, 2007, 6:08:47 AM9/10/07
to sqle...@googlegroups.com
First, sorry for the slow answer. I hoped Jonathan would answer this
one since he wrote the versioning extension, but it seems like it's
not the case...

On 8/30/07, Rufus Pollock <ru...@rufuspollock.org> wrote:
>
> ## Per-object vs. global versioning
>
> The current versioning implementation of elixir is per-object that is
> all versioning is done on an individual domain object/entity independent
> of any other domain object/entity. This is simple and is often
> sufficient for many situations however it does have various problems:
>
> 1. No way to support relationships. Relationships, whether many-to-many
> or a simple has_many involve changes on multiple objects. With
> per-object versioning it is impossible to link a change on object A to
> object B.

Indeed, though many-to-one relationships should work fine.

> 2. No way to change multiple objects as part of one change. For example
> consider a wiki with many Page entities. You might want to rename a Page
> and update links in other Pages all as part of one update (this is
> obviously very similar to atomic multi-file commits in subversion as
> well as transactions in a db).
>
> The natural way to solve this is move to having an explicit central
> 'Revision' object which version objects hold a reference to.

Seems like a good idea.

> This leads to code that looks very similar to the original but has
> explicit Revisions (from complex_test.py):
>
> ...
> rev1 = cx.Revision()
> rev1.log_message = 'Revision 1'
>
> gilliam = Director(name='Terry Gilliam')
> session = sqlalchemy.object_session(gilliam)
> monkeys = Movie(id=1, title='12 Monkeys', description='draft
> description', director=gilliam)
> bruce = Actor(name='Bruce Willis', movies=[monkeys])
>
> rev1.commit()
> ...

The advantage of the versioned ext is that once it's set up, you do
not need to do anything special. In what you show, you have to
manually create revisions. I think it would be much better to somehow
intercepts flushes and do your stuff automatically there. I've seen
that SQLAlchemy 0.4 provides a new SessionExtension mechanism which I
think could be used for that.

> Note for those looking at the complex_test code that while the tests are
> passing the full overhaul is not yet complete:
>
> * e.g. get_as_of should be deprecated in favour of using explicit
> revision.

That's not the same thing. Being able to search with a timestamp is
something that should be kept, in my opinion.

> * no example of versioning multiple objects
>
>
> ## Miscellaneous Other Improvements
>
> Finally in addition I point out some other improvements to the existing
> versioning model that could be made:
>
> 1. Versioning models distinguish between the continuity object and the
> versions of those object. However one would normally aim to keep the
> 'version' objects away from clients (unless they explicitly need
> access). When you request an object a particular version you still get
> that object back but it transparently sources attributes from the
> correct version.[^1] Elixir differs subtly from this in that when you
> get an old version you actually get the verion object (e.g. MovieVersion
> rather than Movie). For 'normal' attributes this is not a problem but
> for relationships it leads to unusual behaviour in that the has_many and
> has_and_belongs_to_many attributes do not exist on versions

Yes, that's definitely something which should be corrected IMO.

> I think it would be good to ensure that the version and continuity
> object display the same behaviour. This can either be achieved by
> duplicating the continuity mapper more closely or by always using the
> continuity object all the time and making attribute calls depend on
> current set version/revision (more like the approach detailed in [^1]).

I'd rather keep the original object as clean as possible not to slow
it down for a normal usage. So I'd go for the first option.

> [^1]: see e.g. http://www.martinfowler.com/ap2/temporalObject.html and
> my implementation in:
>
> http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/base.py
> http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/base_test.py
>
> 2. If you have two versioned objects (say both Movie and Director).
> Suppose there are changes to a related Movie and Director M1 and D1 say
> before time T. Suppose you get the old M1 (before T). Then when you gets
> its director you get the current version of D1 not the old one.

I'm not sure here. Depending on the situation, you could want one or
the other. Do you think it would be possible (or rather practical) to
support both behavior and let the user choose which one to use with an
option?

> 3. State. Once one has versioning it is natural to distinguish between
> deleting and purging (or perhaps hibernating and deleting). If I delete
> the current object I might still want to be able to go back to earlier
> versions of that object where it was not deleted. This is even more
> important once we have versioned relationships. Consider the Movie
> example with versioned Directors. There I might want to delete a
> Director object (and existing dependent Movies) but not have to mess
> with old Movie versions that still link to that Director.

Indeed. At first, I thought this would be easily fixable since we
could have this as an option (delete => delete history). But then I
thought it's not so easy since we'd have not way to access the old
versions, since our current way of doing thing (get_as_of/revert_to)
need an instance. So we'd need to add a class method to somehow
resurrect an instance from the versioned data.

> The natural way, I would suggest, to deal with this is to introduce a
> 'state' attribute which can take the values of 'active' (normal) or
> 'deleted/hibernated'. This then allows us to delete/hibernate object
> without breaking relationships in old versions. The cost of this change
> is either increased burden on clients when doing selects (they need to

> explicitly exclude hibernated/deleted items) or messing around with


> select type actions on versioned objects so they ignore
> deleted/hibernated objects.

I'd have to think more about this, but my initial reaction is that I
don't like the "state" column idea. Wouldn't it be possible to say
that if the row is not in the main table but still in the versioned
table, it's hibernated? This way what happens to other objects
pointing to the deleted object seem more logical: if there is a
cascade rule, it'll cascade properly. And if there is a set null rule,
the relationship will change as expected, the old relationship being
still present in the versioned table of the other entity (if it has
one).

--
Gaëtan de Menten
http://openhex.org

Jonathan LaCour

unread,
Sep 10, 2007, 9:07:07 AM9/10/07
to sqle...@googlegroups.com
Gaetan de Menten wrote:

> First, sorry for the slow answer. I hoped Jonathan would answer this
> one since he wrote the versioning extension, but it seems like it's
> not the case...

Sorry about that. The initial message went to my Junk Mail, but I did
receive your response, so I'll respond here :)

>> The current versioning implementation of elixir is per-object that
>> is all versioning is done on an individual domain object/entity
>> independent of any other domain object/entity. This is simple and is
>> often sufficient for many situations however it does have various
>> problems:
>>
>> 1. No way to support relationships. Relationships, whether
>> many-to-many or a simple has_many involve changes on multiple
>> objects. With per-object versioning it is impossible to link a
>> change on object A to object B.
>
> Indeed, though many-to-one relationships should work fine.

Agreed, apart from the caveat that Gaetan pointed out.

Yes, I am in 100% agreement with Gaetan here. It should be reasonable
to trace through the object graph to detect changes across relations,
which could then be versioned in a simple way.

>> Note for those looking at the complex_test code that while the tests
>> are passing the full overhaul is not yet complete:
>>
>> * e.g. get_as_of should be deprecated in favour of using explicit
>> revision.
>
> That's not the same thing. Being able to search with a timestamp is
> something that should be kept, in my opinion.

Yes, this must be kept, as it was in the requirements that I was
initially provided by the company I did this work for. I'd also like it
if the API would stay basically the same, with only additions, or minor
changes.

>> ## Miscellaneous Other Improvements
>>
>> Finally in addition I point out some other improvements to the
>> existing versioning model that could be made:
>>
>> 1. Versioning models distinguish between the continuity object and
>> the versions of those object. However one would normally aim to keep
>> the 'version' objects away from clients (unless they explicitly
>> need access). When you request an object a particular version you
>> still get that object back but it transparently sources attributes
>> from the correct version.[^1] Elixir differs subtly from this in
>> that when you get an old version you actually get the verion object
>> (e.g. MovieVersion rather than Movie). For 'normal' attributes this
>> is not a problem but for relationships it leads to unusual behaviour
>> in that the has_many and has_and_belongs_to_many attributes do not
>> exist on versions
>
> Yes, that's definitely something which should be corrected IMO.

Agreed. Its the way it is largely because I ran out of time, but I
think that it would be good if the versions could look exactly like the
original objects. However, we'll probably need to make it so that the
Version object inherits from the original model object, where possible,
so that you get any properties as well.

>> I think it would be good to ensure that the version and continuity
>> object display the same behaviour. This can either be achieved by
>> duplicating the continuity mapper more closely or by always using
>> the continuity object all the time and making attribute calls depend
>> on current set version/revision (more like the approach detailed in
>> [^1]).
>
> I'd rather keep the original object as clean as possible not to slow
> it down for a normal usage. So I'd go for the first option.

Yes, see my comment above.

>> [^1]: see e.g. http://www.martinfowler.com/ap2/temporalObject.html
>> and my implementation in:
>>
>> http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/base.py
>> http://knowledgeforge.net/ckan/svn/vdm/trunk/vdm/elixir/
>> base_test.py
>>
>> 2. If you have two versioned objects (say both Movie and Director).
>> Suppose there are changes to a related Movie and Director M1 and D1
>> say before time T. Suppose you get the old M1 (before T). Then when
>> you gets its director you get the current version of D1 not the old
>> one.
>
> I'm not sure here. Depending on the situation, you could want one or
> the other. Do you think it would be possible (or rather practical) to
> support both behavior and let the user choose which one to use with an
> option?

Eek. This seems like a hairy problem. I'd also think that the desired
behavior would depend on the problem, but I'd suggest starting by making
it so that you implement the case that you think is most optimal, and
then maybe we can add in support for the other later.

In projects of my own where I have versioning and history, I almost
always end up going with a state column as described above. It makes
for a lot of manual tweaking of selects to make sure that you don't get
back deleted objects in your queries, which could lead to subtle bugs
in your code, but if you are diligent and test well, it ends up feeling
quite natural, and its also very safe.

That being said, I think the simplest approach is the best in this case,
which leads me to believe that we might want to go with the classmethod
approach, as it imposes less upon the user's code.

--
Jonathan LaCour
http://cleverdevil.org

Rufus Pollock

unread,
Nov 7, 2007, 4:38:15 AM11/7/07
to sqle...@googlegroups.com
Apologies for the huge delay in responding on this thread ...

Jonathan LaCour wrote:
[snip]

>>> The current versioning implementation of elixir is per-object that
>>> is all versioning is done on an individual domain object/entity
>>> independent of any other domain object/entity. This is simple and is
>>> often sufficient for many situations however it does have various
>>> problems:
>>>
>>> 1. No way to support relationships. Relationships, whether
>>> many-to-many or a simple has_many involve changes on multiple
>>> objects. With per-object versioning it is impossible to link a
>>> change on object A to object B.
>>
>> Indeed, though many-to-one relationships should work fine.

They'll only work one way. I.e. you'll see the change on the object with
the Foreign Key but not see changes from the other end.

I agree it should be simple but once you want to track changes to
multiple objects as part of one single 'change/revision' you'll need
some kind of revision object. Of course this can be taken care of behind
the scenes as Gaeten suggests. I'd be happy to code something like this
up given some pointers on how to walk the session to get the set of
objects that have changed.

>>> Note for those looking at the complex_test code that while the tests
>>> are passing the full overhaul is not yet complete:
>>>
>>> * e.g. get_as_of should be deprecated in favour of using explicit
>>> revision.
>> That's not the same thing. Being able to search with a timestamp is
>> something that should be kept, in my opinion.
>
> Yes, this must be kept, as it was in the requirements that I was
> initially provided by the company I did this work for. I'd also like it
> if the API would stay basically the same, with only additions, or minor
> changes.

Sure, this would be trivial to implement via a revision object in any case.

>>> ## Miscellaneous Other Improvements
>>>
>>> Finally in addition I point out some other improvements to the
>>> existing versioning model that could be made:
>>>
>>> 1. Versioning models distinguish between the continuity object and
>>> the versions of those object. However one would normally aim to keep
>>> the 'version' objects away from clients (unless they explicitly
>>> need access). When you request an object a particular version you
>>> still get that object back but it transparently sources attributes
>>> from the correct version.[^1] Elixir differs subtly from this in
>>> that when you get an old version you actually get the verion object
>>> (e.g. MovieVersion rather than Movie). For 'normal' attributes this
>>> is not a problem but for relationships it leads to unusual behaviour
>>> in that the has_many and has_and_belongs_to_many attributes do not
>>> exist on versions
>> Yes, that's definitely something which should be corrected IMO.
>
> Agreed. Its the way it is largely because I ran out of time, but I
> think that it would be good if the versions could look exactly like the
> original objects. However, we'll probably need to make it so that the
> Version object inherits from the original model object, where possible,
> so that you get any properties as well.

Yes, I think that is about right though one might need to be careful
regarding circularity its more like:

MyEntityBase # written as normal
A A
| |
MyEntity | # Continuity (versioned object)
|
|
MyEntityVersion # Versions of MyEntity

>>> I think it would be good to ensure that the version and continuity
>>> object display the same behaviour. This can either be achieved by
>>> duplicating the continuity mapper more closely or by always using
>>> the continuity object all the time and making attribute calls depend
>>> on current set version/revision (more like the approach detailed in
>>> [^1]).
>> I'd rather keep the original object as clean as possible not to slow
>> it down for a normal usage. So I'd go for the first option.
>
> Yes, see my comment above.

Agreed.

[snip]

>>> 2. If you have two versioned objects (say both Movie and Director).
>>> Suppose there are changes to a related Movie and Director M1 and D1
>>> say before time T. Suppose you get the old M1 (before T). Then when
>>> you gets its director you get the current version of D1 not the old
>>> one.
>> I'm not sure here. Depending on the situation, you could want one or
>> the other. Do you think it would be possible (or rather practical) to
>> support both behavior and let the user choose which one to use with an
>> option?
>
> Eek. This seems like a hairy problem. I'd also think that the desired
> behavior would depend on the problem, but I'd suggest starting by making
> it so that you implement the case that you think is most optimal, and
> then maybe we can add in support for the other later.

I think normally people will expect to be traversing the object tree as
it existed at the time -- not as it exists now. However I can see this
might not always be the case (e.g. perhaps for a wiki?)

>>> 3. State. Once one has versioning it is natural to distinguish
>>> between deleting and purging (or perhaps hibernating and
>>> deleting). If I delete the current object I might still want
>>> to be able to go back to earlier versions of that object where
>>> it was not deleted. This is even more important once we have
>>> versioned relationships. Consider the Movie example with versioned
>>> Directors. There I might want to delete a Director object (and
>>> existing dependent Movies) but not have to mess with old Movie
>>> versions that still link to that Director.
>>
>> Indeed. At first, I thought this would be easily fixable since we
>> could have this as an option (delete => delete history). But then I
>> thought it's not so easy since we'd have not way to access the old
>> versions, since our current way of doing thing (get_as_of/revert_to)
>> need an instance. So we'd need to add a class method to somehow
>> resurrect an instance from the versioned data.

Yup. I don't think that is the best way to go.

>>> The natural way, I would suggest, to deal with this is to introduce
>>> a 'state' attribute which can take the values of 'active' (normal)
>>> or 'deleted/hibernated'. This then allows us to delete/hibernate
>>> object without breaking relationships in old versions. The cost
>>> of this change is either increased burden on clients when doing
>>> selects (they need to explicitly exclude hibernated/deleted items)
>>> or messing around with select type actions on versioned objects so
>>> they ignore deleted/hibernated objects.
>>
>> I'd have to think more about this, but my initial reaction is that I
>> don't like the "state" column idea. Wouldn't it be possible to say
>> that if the row is not in the main table but still in the versioned
>> table, it's hibernated? This way what happens to other objects
>> pointing to the deleted object seem more logical: if there is a
>> cascade rule, it'll cascade properly. And if there is a set null rule,
>> the relationship will change as expected, the old relationship being
>> still present in the versioned table of the other entity (if it has
>> one).

This isn't possible since versioned objects foreign keys point to
continuity objects (not versions of that object) and hence if you delete
the item from the main table you will mess up all foreign keys
(has_many, etc). To make it clear suppose you are deleting a particular
movie. While it may not be referenced by current directors it may be
referenced by old director versions and deleting will break them (unless
you are happy to update all of them with NULL which starts to make
reverting *really* complicated if not impossible).

> In projects of my own where I have versioning and history, I almost
> always end up going with a state column as described above. It makes
> for a lot of manual tweaking of selects to make sure that you don't get
> back deleted objects in your queries, which could lead to subtle bugs
> in your code, but if you are diligent and test well, it ends up feeling
> quite natural, and its also very safe.
>
> That being said, I think the simplest approach is the best in this case,
> which leads me to believe that we might want to go with the classmethod
> approach, as it imposes less upon the user's code.

I'm not sure how this would deal with the issue of foreign keys pointing
to the continuity object. One might also want to be cautious about
having too much behind the scenes 'magic' in an area that already has
quite a lot. However I take the point that one would want to keep this
simple.

~rufus

Gaetan de Menten

unread,
Nov 7, 2007, 6:11:22 AM11/7/07
to sqle...@googlegroups.com
On 11/7/07, Rufus Pollock <ru...@rufuspollock.org> wrote:
> >
> >> The advantage of the versioned ext is that once it's set up, you
> >> do not need to do anything special. In what you show, you have to
> >> manually create revisions. I think it would be much better to somehow
> >> intercepts flushes and do your stuff automatically there. I've seen
> >> that SQLAlchemy 0.4 provides a new SessionExtension mechanism which I
> >> think could be used for that.
>
> I agree it should be simple but once you want to track changes to
> multiple objects as part of one single 'change/revision' you'll need
> some kind of revision object. Of course this can be taken care of behind
> the scenes as Gaeten suggests. I'd be happy to code something like this
> up given some pointers on how to walk the session to get the set of
> objects that have changed.

See the attached proof-of-concept script. Note that in a real version,
we'd probably prefer to insert the records in the version table by
using the SQL layer rather than through the ORM as I did in that
example.

> > I think that it would be good if the versions could look exactly like the
> > original objects. However, we'll probably need to make it so that the
> > Version object inherits from the original model object, where possible,
> > so that you get any properties as well.
>
> Yes, I think that is about right though one might need to be careful
> regarding circularity its more like:
>
> MyEntityBase # written as normal
> A A
> | |
> MyEntity | # Continuity (versioned object)
> |
> |
> MyEntityVersion # Versions of MyEntity

I don't understand why you need three objects.


> >>> 2. If you have two versioned objects (say both Movie and Director).
> >>> Suppose there are changes to a related Movie and Director M1 and D1
> >>> say before time T. Suppose you get the old M1 (before T). Then when
> >>> you gets its director you get the current version of D1 not the old
> >>> one.
> >> I'm not sure here. Depending on the situation, you could want one or
> >> the other. Do you think it would be possible (or rather practical) to
> >> support both behavior and let the user choose which one to use with an
> >> option?
> >
> > Eek. This seems like a hairy problem. I'd also think that the desired
> > behavior would depend on the problem, but I'd suggest starting by making
> > it so that you implement the case that you think is most optimal, and
> > then maybe we can add in support for the other later.
>
> I think normally people will expect to be traversing the object tree as
> it existed at the time -- not as it exists now. However I can see this
> might not always be the case (e.g. perhaps for a wiki?)

Let's go for that and we'll do the other case if the need arise.

I might be missing something obvious, but I don't see the problem as
long as both objects are versioned (if it isn't the case, reverting
will be impossible but that's to be expected). The trick is to make
the history tables foreign keys point to the other history tables and
not the main table. In the example below: m1 is my first movie, d1 a
director, m1r1 is movie 1 at revision 1, the "->" means "has a foreign
key pointing to".

Movie MovieHistory
m1->d1 (empty)

Director DirectorHistory
d1 (empty)


after director delete:

Movie MovieHistory
m1->NULL m1r1 -> d1r1

Director DirectorHistory
(empty) d1r1


create new director and link Movie to him

Movie MovieHistory
m1->d2 m1r1 -> d1r1
m1r2 -> NULL

Director DirectorHistory
d2 d1r1

Gaetan de Menten

unread,
Nov 7, 2007, 6:13:24 AM11/7/07
to sqle...@googlegroups.com
> See the attached proof-of-concept script. Note that in a real version,
> we'd probably prefer to insert the records in the version table by
> using the SQL layer rather than through the ORM as I did in that
> example.

Attachment included this time.

test_session_ext.py
Reply all
Reply to author
Forward
0 new messages