RFC: Django history tracking

29 views
Skip to first unread message

Uros Trebec

unread,
Jun 14, 2006, 7:11:42 AM6/14/06
to Django developers
Hi, everyone!

First: introduction. My name is Uros Trebec and I was lucky enough to
be
selected to implement my idea of "history tracking" in Django. I guess
at least some of you think this is a very nice feature to have in web
framework, so I would like to thank you all who voted for my Summer Of
Code proposal! Thank you!

Ok, to get right to the point: this is a Request For Comment. I would
like to know
what you think about my idea for implementation and how can I make it
better. Here is what I have in mind so far...

(Just for reference: http://zabica.org/~uros/soc/ . Here you can find
my initial project proposal and some diagrams.)


1. PROPOSAL:
Main idea is to create a way to have a content history for every
change in a Model. Current changes tracking is very limited to say the
least, so I will extend/replace that so one could actually see how
something was changed.

1.1 SCOPE:
Changes will have to be made in different parts of Django. Most of
the things should be taken care of inside django.db, except diff-ing
and merging.


USAGE

2. MODELS:
The easiest way to imagine how stuff will work is to have an actual
usage case. So, let's see how Bob would use this feature.

2.1. Basic models:
To enable history tracking Bob has to create a sub-class for those
models that he will like to track:

class Post(models.Model):
author = models.CharField(maxlength=100)
title = models.CharField(maxlength=100)
content = models.TextField()
date = models.dateField()

class History:
pass

This works much like using Admin subclass. The difference is that if
the subclass is present then database will have change to include two
tables for this class:

(the main table - not changed):

CREATE TABLE app_post (
"id" serial NOT NULL PRIMARY KEY,
"author" varchar(100) NOT NULL,
"title" varchar(100) NOT NULL,
"content" text NOT NULL,
"date" datestamp NOT NULL
);


(and the history table):

CREATE TABLE app_post_history (
"id" serial NOT NULL PRIMARY KEY,
"change_date" datestamp NOT NULL, # required for datetime
revert
"parent_id" integer NOT NULL REFERENCES app_post (id),
"author" varchar(100) NOT NULL, # data from app_post
"title" varchar(100) NOT NULL, # data from app_post
"content" text NOT NULL, # data from app_post
"date" datestamp NOT NULL # data from app_post
);

I think this would be enough to be able to save "basic full" version of
changed record. "parent_id" is a ForeignKey to app_post.id so Bob can
actually find the saved revision for a record from app_post and when he
selects a record from _history he knows to which record it belongs.


2.2. Selective models:
But what if Bob doesn't want to have every information in history (why
would someone like to keep an incomplete track of a record is beyond
me, but never mind)? Maybe the 'author' and 'date' of a post can't
change, so he
would like to leave that out. But at the same time, he would like to
know who made the change, but does not need the information when
using Post.

Again, this works like Admin subclass when defining which fields to
use:

class Post(models.Model):
author = models.CharField(maxlength=100)
title = models.CharField(maxlength=100)
content = models.TextField()
date = models.dateField()

class History:
track = ('title', 'content')
additional = {
"changed_by": "models.CharField(maxlength=100)
}


In this case "app_post_history" would look like this:

CREATE TABLE app_post_history (
"id" serial NOT NULL PRIMARY KEY,
"change_date" datestamp NOT NULL, # required for datetime
revert
"parent_id" integer NOT NULL REFERENCES app_post (id),
"title" varchar(100) NOT NULL, # data from app_post
"content" text NOT NULL, # data from app_post
"changed_by" varchar(100) NOT NULL # new field
);

3. VIEWS
3.1. Listing the change-sets:
Ok, so after a few edits Bob would like to see when and what was
added/changed in one specific record.

A view should probably work something like this:

from django.history import list_changes

def show_changes(request, post_id):
list = list_changes(post_id)
return render_to_response('app/changes.html', {'post_changes_list':
post_changes_list})

And a template:

<h1>{{ post.title }}</h1>
<ul>
{% for change in post_changes_list %}
<li><div>
<b>{{ change.id }}</b>
<h3>{{ change.title }}</h3>
<p>{{ change.content }}</p>
<b>{{ change.change_date }}</b>
</div>
</li>
{% endfor %}
</ul>

So, this lists all changes for a record from "app_post" table. It's
just what a developer would use.

4. MERGING/REVERTING/ROLLBACK
I'm no sure if there is a best way to do this, but I imagine it
should be done something like this:

4.1 Full revert

object = get_object_or_404(Post, pk=id)
object2 = object.version(-1)


4.2 Merge only selected "changes"

object.content = object.content.version(-1)

4.3 Version selection

object.version(-i) # go back for "i" versions
object.version(Datetime d) # find a version with "d" as
"change_date"
object.version(object.content) # find last version in which a
# change was made to "content"
field

The above functions would all return an object of the same type as
"object" if "object.version()" is used. If "object.field.version()" is
used it would return the object corresponding to the "field".

The problem with this is that you don't get a direct access to
"history_table" specific fields, like "change_date" or additional
fields.

IMPLEMENTATION

5. CHANGES
A question that pops up here is how will changes be stored?

The answer is not so straightforward because there's a lot of different
field types. For most of them there is nothing possible rather than a
full copy. But there are some of theme (textField, CharField, ...)
that will work better if we would store just a difference from current
version (the edited one, the one to be saved in the main "app_post")
and the one that was retrieved from the database before it was edited.

For the later it would be wise to use Pythons "difflib" [0] to
calculate
the difference and to merge it back when comparison is needed.

For this one I'm not too sure how it should work.
- When saving, should it retrieve the original version _again_ and then
apply the 'diff' over current and original one? Or should the original
be already available somehow?
- (more questions to come)

The preliminary diagram of my original idea is here:
http://zabica.org/~uros/soc/Soc_django1.png
What do you think?

PS: Current version of this RFC can be found at [1]. And I do have a
category on my blog [2], where I'll post about the progress and such.

[0] http://www.python.org/doc/current/lib/module-difflib.html
[1] http://zabica.org/~uros/soc/rfc.txt
[2] http://zabica.org/uros/category/soc/

Tom Tobin

unread,
Jun 14, 2006, 11:31:02 AM6/14/06
to django-d...@googlegroups.com
On 6/14/06, Uros Trebec <uros....@gmail.com> wrote:
>
> 2.1. Basic models:
> To enable history tracking Bob has to create a sub-class for those
> models that he will like to track:
>
> class Post(models.Model):
> author = models.CharField(maxlength=100)
> title = models.CharField(maxlength=100)
> content = models.TextField()
> date = models.dateField()
>
> class History:
> pass

A minor quibble: "History" would be called an "inner class" in this
case; a "subclass" would be a class that inherits from another class,
like "Post" in this example (being a subclass of Model).

Other than that, neat stuff! :-)

DavidA

unread,
Jun 15, 2006, 9:09:43 AM6/15/06
to Django developers
There was a similar thread on this earlier where I commented about a
slightly different way to store the changes:
http://groups.google.com/group/django-users/browse_thread/thread/f36f4e48f9579fff/0d3d64b25f3fd506?q=time_from&rnum=1

To summarize, in the past I've used a time_from/time_thru pair of
date/time columns to make it more efficient to retrieve the version of
a row as it looked at a particular point in time. Your design of just
using change_date makes this more difficult.

I can also think of use cases where I want the versioning to track both
date and time since I would expect multiple changes on the same day.

Maybe these could also be options?

william

unread,
Jun 17, 2006, 3:27:36 AM6/17/06
to Django developers

Sounds nice, this is a feature I'm currently looking for... but I've
already started my own implementation.

I would just share it with you.

I've build a single table History with :
- "change"; a text field which will contain a python pickled
dictionary: { field: old_value} in case you update a record.
- type: type of modification (update, delete, insert).
- "obj": the table object. This can come from ContentType
- "obj_id": the id of the impacted object.
- create_date: a timestamp automatically set.

I'm using it by sub-classing the save methods in each model I want to
see the history.
This is quite flexible, because you can decide which field you want to
track.

To facilitate, yet one step further, it would be nice to have a
PickledField within Model.models of django.

Feedbacks are welcome.

Vitaliy Fuks

unread,
Jun 17, 2006, 7:14:37 PM6/17/06
to Django developers
Hi Uros,

Great to see that your RFC is pretty much exactly what I was thinking
(feature and implementation-wise) when I posted
http://groups.google.com/group/django-developers/browse_thread/thread/d90001b1d043253e/77d36caaf8cfb071

It would be nice to record "who" made the change (optionally when there
is a user with an id available).

I thought that storing complete row copies on both inserts and updates
to original object isn't that bad - it certainly simplifies the
machinery. Because the way I was considering using this feature would
read history tables very infrequent their size wasn't a big factor in
my mind.

An admin to view change history "diff" colored output and to revert to
arbitrary previous version would be an obvious future addition.

Jeremy Dunck

unread,
Jun 17, 2006, 7:26:46 PM6/17/06
to django-d...@googlegroups.com
On 6/17/06, Vitaliy Fuks <vita...@gmail.com> wrote:
> It would be nice to record "who" made the change (optionally when there
> is a user with an id available).

+1

> I thought that storing complete row copies on both inserts and updates
> to original object isn't that bad - it certainly simplifies the
> machinery. Because the way I was considering using this feature would
> read history tables very infrequent their size wasn't a big factor in
> my mind.

Wikipedia, probably one of the largest text diff histories, still
doesn't do compression of previous revisions. It can wait. ;-)

Malcolm Tredinnick

unread,
Jun 17, 2006, 8:39:06 PM6/17/06
to django-d...@googlegroups.com
On Sat, 2006-06-17 at 07:27 +0000, william wrote:
>
[...]

> Sounds nice, this is a feature I'm currently looking for... but I've
> already started my own implementation.
>
> I would just share it with you.
>
> I've build a single table History with :
> - "change"; a text field which will contain a python pickled
> dictionary: { field: old_value} in case you update a record.

A drawback of this is that you pay the price when searching for "all
changes to field X since date D" or "show me a change history for field
X". You have to read and unpickle every single row before you can know
whether to discard it or not.

Another thing that occurs to me -- may not be relevant in your
particular situation, but does have general application -- unless you
are setting your timestamps manually (and it has sufficiently fine
granularity), you don't get a concept of a "changeset" of changes that
happen all at once, which does fall out of Uros's implementation.

> - type: type of modification (update, delete, insert).
> - "obj": the table object. This can come from ContentType
> - "obj_id": the id of the impacted object.
> - create_date: a timestamp automatically set.

Regards,
Malcolm

Gábor Farkas

unread,
Jun 19, 2006, 10:57:34 AM6/19/06
to django-d...@googlegroups.com
Uros Trebec wrote:
> class Post(models.Model):
> author = models.CharField(maxlength=100)
> title = models.CharField(maxlength=100)
> content = models.TextField()
> date = models.dateField()

hi,

sorry to jump in so late into the discussion, but right now i'm in a
situation where maybe model-history is the answer.

a question:

what about ForeignKeys and ManyToManyFields? how do you propose to
version those?

what if a ForeignKey-relation changes?

gabor

IanSparks

unread,
Jun 19, 2006, 3:52:05 PM6/19/06
to Django developers
Uros Trebec wrote:

Although you have a date field in your example model it might not hurt
to add an automatic timestamp to a model that uses versioning in this
way. One that relies on a "data" field that could be changed by a user
doesn't seem safe to me.

I'd also like an automatic userid stamp on there over and above the
"author" which again is a data field not a hidden system field.

You might also consider some automatic "revision number" system which
increments every time the record is changed. This makes it easier to
"roll back" to the previous entry and can be a lifesaver if something
happens to whatever is providing the dates.


> 2.2. Selective models:
> But what if Bob doesn't want to have every information in history (why
> would someone like to keep an incomplete track of a record is beyond
> me, but never mind)?

Me either. Suggests that this may be a non-feature?

I think the framework suggested is a great start. I would be interested
in seeing a feature that tied changes not just to the user who made the
change but also to the "session" that they made the change in. i.e. if
my system allows "Dave" to have two active sessions at different
computers I'd like to track what he did in each session not just what
date the changes occurred. This is very helpful for user complaints and
fraud detection.

Uros Trebec

unread,
Jun 19, 2006, 6:01:23 PM6/19/06
to django-d...@googlegroups.com
Hi!

> There was a similar thread on this earlier where I commented about a
> slightly different way to store the changes:
> http://groups.google.com/group/django-users/browse_thread/thread/f36f4e48f9579fff/0d3d64b25f3fd506?q=time_from&rnum=1

Thanks for this one, I already found something usefull.

> To summarize, in the past I've used a time_from/time_thru pair of
> date/time columns to make it more efficient to retrieve the version of
> a row as it looked at a particular point in time. Your design of just
> using change_date makes this more difficult.

I don't know what you mean exactly, but I'm not using just
change_date. The ID in *_history table defines the "revision/version
number", so you don't have to use "change_date" to get the exact
revision.

> I can also think of use cases where I want the versioning to track both
> date and time since I would expect multiple changes on the same day.

This one is my fault. What I meant was using datetime for that field,
for said reasons exactly. Good catch!

> Maybe these could also be options?

Such ideas are always welcome. I will try and make it as versatile as possible.

Regards,
Uros

Uros Trebec

unread,
Jun 19, 2006, 6:09:28 PM6/19/06
to django-d...@googlegroups.com
> Sounds nice, this is a feature I'm currently looking for... but I've
> already started my own implementation.

Nice! Do you have anyting in code yet? Any bottlenecks?


> I would just share it with you.
>
> I've build a single table History with :
> - "change"; a text field which will contain a python pickled
> dictionary: { field: old_value} in case you update a record.

How does this help/work? Why dictionary? Can you explain?

> - type: type of modification (update, delete, insert).

Is this really necesary? How do you make use of it?

> - "obj": the table object. This can come from ContentType

I don't understand...

> - "obj_id": the id of the impacted object.
> - create_date: a timestamp automatically set.


> I'm using it by sub-classing the save methods in each model I want to
> see the history.
> This is quite flexible, because you can decide which field you want to
> track.

I agree. But I fail to see the need for not versioning the whole record/row.


> To facilitate, yet one step further, it would be nice to have a
> PickledField within Model.models of django.

Can you elaborate on that?


> Feedbacks are welcome.

Same here! :) And thanks for your feedback!

Regards,
Uros

Uros Trebec

unread,
Jun 19, 2006, 6:17:27 PM6/19/06
to django-d...@googlegroups.com
Hi!

> Great to see that your RFC is pretty much exactly what I was thinking
> (feature and implementation-wise) when I posted

> http://roups.google.com/group/django-developers/browse_thread/thread/d90001b1d043253e/77d36caaf8cfb071

I'm glad! Thanks for the link too.

> It would be nice to record "who" made the change (optionally when there
> is a user with an id available).

I was thinking of not pushing the use of such fields, because there is
no easy way to figure out how each applications handles
accounts/users. But it's something that it should be made possible
with additional/custom fields, IMHO.


> I thought that storing complete row copies on both inserts and updates
> to original object isn't that bad - it certainly simplifies the
> machinery.

This is true.

> Because the way I was considering using this feature would
> read history tables very infrequent their size wasn't a big factor in
> my mind.

I'm sort-of undecided about this. On one hand you can potentialy have
a lot more data to handle, but on the other, you don't need multiple
SELECTs and merging happen when you want a version from way back.

What do others think about this?


> An admin to view change history "diff" colored output and to revert to
> arbitrary previous version would be an obvious future addition.

I agree. And I do have it on my todo list, but it's not "feature
critical", so it will have to wait until the machinery is done. Or
maybe not... hmm...

Regards,
Uros

Uros Trebec

unread,
Jun 19, 2006, 6:33:45 PM6/19/06
to django-d...@googlegroups.com
On 6/19/06, IanSparks <IanJS...@gmail.com> wrote:
> Although you have a date field in your example model it might not hurt
> to add an automatic timestamp to a model that uses versioning in this
> way.

Changing the versioned model because of use of versioning is something
I would like to avoid. Forcing such things might not be a good idea.
But if there is no other way...

> One that relies on a "data" field that could be changed by a user
> doesn't seem safe to me.

I don't know what you mean by this?

> I'd also like an automatic userid stamp on there over and above the
> "author" which again is a data field not a hidden system field.

As I said before, this is not something that can be easily done,
because various ways of user/account handling. Or am I missing the
point here?

> You might also consider some automatic "revision number" system which
> increments every time the record is changed. This makes it easier to
> "roll back" to the previous entry and can be a lifesaver if something
> happens to whatever is providing the dates.

Every record in *_history table has its own ID which I was going to
use as "revision number". And to make it easier to find "previous
revision" I was thinking on adding a "prev_rev" column to the table.
What do you think? Would this be enough?

> I think the framework suggested is a great start. I would be interested
> in seeing a feature that tied changes not just to the user who made the
> change but also to the "session" that they made the change in. i.e. if
> my system allows "Dave" to have two active sessions at different
> computers I'd like to track what he did in each session not just what
> date the changes occurred. This is very helpful for user complaints and
> fraud detection.

Hmm, very interesting idea! Do you have any suggestions on how would
this be best implemented? I must admit, I don't have much knowledge
how Django works internally so any help that I can get would be very
appreciated!

Thanks!

Regards,
Uros

Vitaliy Fuks

unread,
Jun 20, 2006, 12:18:21 AM6/20/06
to Django developers
> > It would be nice to record "who" made the change (optionally when there
> > is a user with an id available).
>
> I was thinking of not pushing the use of such fields, because there is
> no easy way to figure out how each applications handles
> accounts/users.

and

> > I'd also like an automatic userid stamp on there over and above the
> > "author" which again is a data field not a hidden system field.
>
> As I said before, this is not something that can be easily done,
> because various ways of user/account handling. Or am I missing the
> point here?

I think using django.contrib.auth's user id (when available) would be
satisfactory for most people.

DavidA

unread,
Jun 20, 2006, 8:01:58 AM6/20/06
to Django developers

Uros Trebec wrote:
>
> > To summarize, in the past I've used a time_from/time_thru pair of
> > date/time columns to make it more efficient to retrieve the version of
> > a row as it looked at a particular point in time. Your design of just
> > using change_date makes this more difficult.
>
> I don't know what you mean exactly, but I'm not using just
> change_date. The ID in *_history table defines the "revision/version
> number", so you don't have to use "change_date" to get the exact
> revision.

Let me clarify. What I meant was that your design makes it hard to
directly query the row that was in effect at a certain point of time,
i.e. given a date/time, how do I find the record that was current at
that instant in time? In your model I would have to use a query like
this to find the active record for 1/1/06:

select * from FooHist where change_date = (select max(change_date)
from FooHist where change_date < '2006-01-01')

So you find the most recent change that occurred *before* the date in
question, which requires a subselect. That is a bit ugly, inefficient,
and I think very difficult to map to the Django DB API.

With a time_from/time_thru model such a query looks like this:

select * from FooHist where time_from <= '2006-01-06' and (time_thru
> '2006-01-06' or time_thru is null)

So here we are looking for the row who's *active interval* contains the
date in question which is a simple, direct query (no subselect). The
test for null is a special case for the version of the row that is
current (has no end date). I've seen other people use a sentinal value
like '9999-12-31' to make the query a little simpler (but then you get
that magic date all over the place).

I know some people might say this smells of premature optimization, but
in my experience - where I have had to make a lot of applications work
correctly for a past date - you may end joining many tables with such
an expression and the subselects will kill you. You are simply adding
one more date/time field to allow joining the table via time more
easily. Since this is a *history* table, joining based on time is a
very common use case.

william

unread,
Jun 25, 2006, 11:04:27 AM6/25/06
to Django developers
Uros Trebec wrote:
> > Sounds nice, this is a feature I'm currently looking for... but I've
> > already started my own implementation.
>
> Nice! Do you have anyting in code yet? Any bottlenecks?
>

sorry not yet. But will come, I need it for my current development.

>
> > I would just share it with you.
> >
> > I've build a single table History with :
> > - "change"; a text field which will contain a python pickled
> > dictionary: { field: old_value} in case you update a record.
>
> How does this help/work? Why dictionary? Can you explain?
>

I should add first that my history table can contain history of any
other table. In fact I don' t have an history table per "master" table.

This is a kind of "generic" history, thus I don't know the type of data
I will save in it. I solve this by pickling the data and save them into
a textField, that's the goal of the "change" field.
For flexibility reasons, I let possibility to select the field people
want to have in their history table.

Does this is better explained ?


> > - type: type of modification (update, delete, insert).
>
> Is this really necesary? How do you make use of it?
>

I take inspiration from svn log files. Sure this is necessary!!.
History is a kind of record trace from his creation up to his delete
(if occurs). But I'm agreed that most of the case, "insert" and
"update" are really useful. Concerning the "delete", this would be
useful if you want to have "undo" functionalities.

> > - "obj": the table object. This can come from ContentType
>
> I don't understand...
>

As I've explained, my "history table" can record history data of any
table I'm interested. Thus "obj" and "obj_id" give me possibility to
build a link between the history and the "master table".


> > - "obj_id": the id of the impacted object.
> > - create_date: a timestamp automatically set.
>
>
> > I'm using it by sub-classing the save methods in each model I want to
> > see the history.
> > This is quite flexible, because you can decide which field you want to
> > track.
>
> I agree. But I fail to see the need for not versioning the whole record/row.
>
>
> > To facilitate, yet one step further, it would be nice to have a
> > PickledField within Model.models of django.
>
> Can you elaborate on that?

behind the scene, Django will unpickle and pickle when you access the
data in a PickleField. This will avoid to have a model like this:
class History(models.Model):
change = TextField()
def get_change(self):
return loads(change)
def set_change(self,data):
change = dumps(data)

Imagine you have 3 fields with Pickling functionalities in the same
table ...

>
>
> > Feedbacks are welcome.
>
> Same here! :) And thanks for your feedback!
>

Thanks for the questions. I hope I'm a bit clearer ;-).
I hope I'll have a bit time this week to tackle this problem, and come
with real code.

William

enki

unread,
Jun 27, 2006, 11:08:18 PM6/27/06
to Django developers
Hey Uros,

Just two weeks ago, I've been trying to make my own generic history
implementation, but then decided for lack of time, to just make a
one-shot implementation and wait for someone else to write a clean
implementation. Awesome you're already working on it.

To help you find an optimal design, i'd like to show you the code i've
used for my new-code-for-every-history implementation. What i like
about my approach, is that there is no duplication between the master
and the history. Basically all the data is stored in revisions, and the
master only points to the current revision using a ForeignKey. The user
possibly not wanting to track all data in Revisions, can be easily
handled by putting those fields in the master model.

The biggest problem with the current implementation is that you need to
rewrite all the glue code for every class that you want to revision
track like this. Something like inheritance (but without any of the
special magic that is being discussed for model inheritance), would be
really useful to take care of that.

But on to the code:

class Profile(models.Model):
user = models.ForeignKey(User, edit_inline=models.STACKED,
max_num_in_admin=1, min_num_in_admin=1, num_in_admin=1)
current = models.ForeignKey("Revision", null=True, core=True)

def getAll(self):
all =
Revision.objects.filter(instanceOf=self.id).order_by('-version')
return all

def getLatest(self):
latest =
Revision.objects.filter(instanceOf=self.id).order_by('-version')[0]
return latest

def getSpecific(self, version):
specific = Revision.objects.get(instanceOf=self.id,
version=version)
return specific

def newRevision(self, **kwargs):
try:
nextVersion = self.getLatest().version + 1
except IndexError:
nextVersion = 1
revision = Revision(instanceOf=self, version=nextVersion,
**kwargs)
revision.save()
return revision

def __str__(self):
return "Profile: %s" % self.user.username

class Admin:
pass

class Revision(models.Model):
instanceOf = models.ForeignKey(Profile, null=False)
version = models.IntegerField()
description = models.TextField(maxlength=400, core=True)
telNr = models.CharField(maxlength=50)
street = models.CharField(maxlength=255)

def __str__(self):
return "%s/%d" % (self.instanceOf.user.username, self.version)

class Admin:
pass

tell me what you think

regards
paul

Reply all
Reply to author
Forward
0 new messages