I’ve ported my schema evolution work from my SoC project last summer to
Django v0.96. To use it, download the patch below, and run the following:
$ cd /<path_to_python_dir>/site-packages/django/
$ patch -p1 < ~/<download_dir>/django_schema_evolution-v096patch.txt
It should output the following:
patching file core/management.py
patching file db/backends/mysql/base.py
patching file db/backends/mysql/introspection.py
patching file db/backends/postgresql/base.py
patching file db/backends/postgresql/introspection.py
patching file db/backends/sqlite3/base.py
patching file db/backends/sqlite3/introspection.py
patching file db/models/fields/__init__.py
patching file db/models/options.py
To use it:
$ cd /<path_to_project_dir>/
$ ./manage.py sqlevolve <app_name>
It should output something like this:
BEGIN;
ALTER TABLE `main_query` CHANGE COLUMN `accuracy` `accuracynew`
numeric(10, 6) NULL;
ALTER TABLE `main_query` ADD COLUMN `price` varchar(256) NULL;
COMMIT;
Assuming you have a model such as this:
class Query(models.Model):
query = models.CharField(maxlength=256, blank=False)
accuracynew = models.FloatField(max_digits=10, decimal_places=6,
null=True, blank=True, aka='accuracy')
price = models.CharField(maxlength=256, null=True, blank=True) #
new column
Note the aka field where I changed the name of “accuracy” to “accuracynew”.
Let me know if you find any bugs.
http://kered.org/blog/2007-07-19/django-schema-evolution/
http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolution-v096patch.txt
-- Derek
Sorry for the double-post, but I've written up some examples /
documentation:
http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolution_documentation.html
Also, I've ported the changes to SVN. I would like to solicit testers,
for potential inclusion in django-proper.
Thanks,
Derek
Sebastian Macias
On Jul 19, 5:33 pm, Derek Anderson <pub...@kered.org> wrote:
> Hey all,
>
> Sorry for the double-post, but I've written up some examples /
> documentation:
>
> http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolut...
> >http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolut...
>
> > -- Derek
Derek, nice job! does your solution support all of the db backends or
just sqlite/mysql/postgres?
(we use a hacked version of dbmigration (http://www.aswmc.com/
dbmigration/) at work -
works like a charm.)
Thanks,
Peter
On Jul 20, 11:22 am, Sebastian Macias <sebast...@sebastianmacias.com>
wrote:
nope, sqlite/mysql/postgres is all she wrote right now, as that was all
that existed last summer when i did the bulk of the work on this. but
i'll add oracle and mssql, if someone can provide me access to a test
db. (well, maybe...i don't have any windows boxes around - can you
access mssql from a linux machine?)
I wouldn't worry about mssql; that backend is pretty much dead at the moment.
On Jul 19, 5:33 pm, Derek Anderson <pub...@kered.org> wrote:
> Hey all,
>
> Sorry for the double-post, but I've written up some examples /
> documentation:
>
> http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolut...
> >http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolut...
>
> > -- Derek
http://kered.org/blog/wp-content/uploads/2007/07/django_schema_evolution-svn20070719patch.txt
I think the aka-method has some shortcomings:
* You have to change all your code at once to match the new name
(evolution is a slow process in nature ;-)
* The old name cannot change the type of the field
(for example ForeignKey -> ManyToMany)
* The change-info can only contain a name, nothing more
(goes hand in hand with the last point)
So after thinking some time about this I came up with this idea:
Old:
class SomeModel(models.Model):
new_fieldname = models.FooField(..., aka='old_fieldname')
New:
class SomeModel(models.Model):
new_fieldname = models.FooField(...)
# Not sure if new_fieldname as string or the
# new_fieldname-object ist better here.
# But I think the string might be easier.
old_fieldname = evolution.RenamedField('new_fieldname')
For Models:
class NewModel(models.Model):
pass
# May use models from other applications.
OldModel = evolution.MovedModel(NewModel)
In the following text I will only use fields as examples, but I think
models will be similar.
Because the old name is still an attribute in the model it can change
the new attribute/return the new attribute if it is returned. So all
code will work without changes. Now - because this is nonsense in the
long term - it will write logs/print warning/... (you name it). This can
be extended by using parameters:
evolution.RenamedField(..., obsolete=True) -> writes warning
evolution.RenamedField(..., deprecated=True) -> raises exception
(Perhaps I mixed the meaning of "deprecated" and "obsolete" here...not
sure about them)
...you could even give it a date where the old name gets rejected. I
think that would give users a great advantage other the current solution.
So the new use-case:
Users can change the model, change every code they know of directly, but
get warnings when they missed something (but code works). Now after some
time they can mark the old name to be deprecated, so access to it will
be rejected after they think they got all pieces. If the code works the
old name can be removed from the code (and should be removed...think
that goes for "aka" as well), if not the exception tells them what is
missing.
The next thing you could do now would be things like "Replace a
ForeignKey with ManyToMany" (which might a common use-case for
db-evolution I think), example:
class SomeModel(models.Model):
some_others = models.ManyToManyField(...)
some_other = evolution.ForeignKeyToManyToMany('some_others')
This now can first do some nice translation to fill the m2m-tables with
the old data. "some_other" might just return the first object in the
m2m-relation or None if the relation is empty. If this behavior doesn't
fit the user-idea of doing this deprecated/obsolete could be used to
reject access from the start (or some callback could be given to do it
"nicer").
Of course this is only one example on how this idea may help doing more
advanced stuff. Another use-case I can think of now is replacing an
BB-Code-field with some XML/Textile/Rest/...-field where the content
needs to be converted. Or a DateTimeField that gets replaced by a
DateField and a TimeField, so the data needs to be split up (or the
other way round).
Of course this means that the translation must be done inside python,
which is bad for performance, but at least its possible (and the
translation is only done once).
What do you think about this? In my opinion that could help big sites to
do model-changes in a productive environment without having to fear that
they screw the whole thing up easily. In a professional environment this
might help having a nice way to deprecate things.
Greetings, David Danier
> * You have to change all your code at once to match the new name
> (evolution is a slow process in nature ;-)
excellent point. for large apps, optionally/temporarily maintaining api
compatibility through a special field type (that maps to the new field)
is a great idea. but i see that as augmenting the "aka" attribute, not
replacing it. you will (hopefully) eventually remove the old field from
the class interface, likely long before you lose interest in being able
to migrate older schemas. plus, i think the "aka" syntax is cleaner in
the probably much more common case where you don't care about keeping
the old api intact.
> * The old name cannot change the type of the field
> * The change-info can only contain a name, nothing more
> (goes hand in hand with the last point)
yes, these two are the same - how to keep the previous field metadata
when updating, without mucking up your model code with lots of
"old_maxlength=256" style attributes or duplicating the entire field
definition every change. but beautifully, you don't have to keep any of
it (sans the name changes), because the rest can be pulled from the
existing schema. don't duplicate information, i say. granted, my code
doesn't support all type changes yet, but this isn't a limitation of the
method. think 9-digit numeric SSN field changing to a 12-char (to
support the new 10-digit SSNs plus dashes - yes, it's coming folks. run
in fear) or even splitting/combining certain field types - the two new
fields both aka the original. all possible.
now as far as changing the inter-model relationship fields...let me
break that into another email later.
derek
i have finished writing unit tests covering the mysql, postgresql and
sqlite3 backends, as well as notably expanding what
introspections/migrations the sqlite3 backend can handle. also, full
documentation (including installation and usage instructions) is
available both in the branch and here:
http://code.djangoproject.com/wiki/SchemaEvolutionDocumentation
additionally, the latest changes from the trunk (r5787) have been merged
into the schema-evolution branch, so you won't lose any recent updates
by giving it a shot.
obviously, this is an area with a lot of corner cases. if you can,
please give it a shot ("./manage sqlevolve app_name" after you've made a
model change) and send me whatever bugs you find [pub...@kered.org].
please include if you can:
* which database you're using
* your models.py file
* your previous models.py and/or your existing database schema
thanks!
derek anderson
Hi Derek,
As I noted in your last thread on schema-evolution, I'm _really_ not
happy with this design, and I am a _strong_ -1 on merging this branch.
- The 'aka' approach has some serious flaws. It is ambiguous for all
but trivial use cases. It also doesn't capture the idea that database
changes occur in bulk, in sequence. For example, On Monday, I add two
fields, remove 1 field, rename a table. That creates v2 of the
database. On Tuesday, I bring back the deleted field, and remove one
of the added fields, creating v3 of the database. In each stage of the
migration, the DB is a stable state; this approach doesn't track which
state a given database is in, and doesn't apply changes in blocks
appropriate to versioned changes.
- The unit tests are not complete by a long way. The fact that
something is difficult to test doesn't mean it shouldn't be tested.
Your tests evaluate whether the right SQL for any given task, but
doesn't check if the correct tasks are identified.
- The implementation doesn't address use cases 2 and 3 from the
original design document. It's ok if implementing these use cases is
to be deferred as later work, but there needs to be a plan for how
these features will integrate into the overall schema-evolution
structure. As it stands, I don't see how that can happen.
- On a housekeeping issue, I'm really not happy with all the migration
code sitting in management.py. management.py is already getting pretty
bloated; if we're adding a huge feature like evolution, I'd rather see
it abstracted into a module of its own, with only the invocation hooks
in management.py.
In short, IMHO this is not yet ready for general user testing, and I
would be strongly against this becoming the final design of schema
evolution.
That said, there is some good code in the patch - in particular, it
looks like you've done some good work on the database introspection
code. I just think we need to have a bit more of a think about how the
good code gets used.
I've got a few ideas on what could be done; I'll try to put my
thoughts together coherently and write up these ideas for comment.
Yours,
Russ Magee %-)
for the scenario you give the aka approach (or any formatting of
model+existing_schema+rename_metadata) works just fine. i detail how
(with working examples and sql output) here:
http://kered.org/blog/2007-08-03/schema-evolution-confusion-example-case/
i'm not saying we don't need additional metadata, but i haven't seen or
thought up a use case for it yet.
> Your tests evaluate whether the right SQL for any given task, but
> doesn't check if the correct tasks are identified. [...] The fact that
> something is difficult to test doesn't mean it shouldn't be tested.
i don't see how this criticism makes sense. both *are* tested. if the
evolution code doesn't identify the correct task, the wrong migration
sql is assured, therefore the test fails. give it a scenario, and see
if introspection + sql generation works. if you have a different
suggestion for testing methodology i'm all ears, but this comment seems
to be based on only a superficial look at the unit test code.
btw, i'm not suggesting that the unit tests are comprehensive. we need
a LOT more cases. (and i plan to add them one by one as either anyone
thinks of them, or bug reports come in) but i certainly hope i haven't
suggested i was *unwilling* to test something.
> - On a housekeeping issue, I'm really not happy with all the migration
> code sitting in management.py. management.py is already getting pretty
> bloated; if we're adding a huge feature like evolution, I'd rather see
> it abstracted into a module of its own, with only the invocation hooks
> in management.py.
sure thing...done.
> In short, IMHO this is not yet ready for general user testing
i hope you don't mind if i solicit it anyway. ;) but seriously, i
don't get this. i think it's of great value to seek testing early and
often. if you don't want to, that's fine, but don't discourage others.
> That said, there is some good code in the patch - in particular, it
> looks like you've done some good work on the database introspection
> code. I just think we need to have a bit more of a think about how the
> good code gets used.
>
> I've got a few ideas on what could be done; I'll try to put my
> thoughts together coherently and write up these ideas for comment.
thanks. the introspection is where prob. 1/2 of the work is. (the
other 1/2 being the sql generation for discrete db modification types,
esp. in sqlite) the user-facing-interface is a minor part... but
again, i am anxious to see recommendations on how it could be better.
thanks,
derek
But I don't think he's necessarily incorrect. Basically, the trouble
is that your solution doesn't really solve the bigger problems. Your
philosophy seems to be, "Those problems are spiky edge cases", but I
think the hope was they would be solved as well, or at least put on
track for a solution later on.
Surely you are correct in the assertion that v1 => v2 => v3 is
structurally equivalent to v1 => v3. But the data can be different,
by your own concession, especially if v1 => v2 has some data
processing involved. For instance, if a user wanted to take a field
or group of fields from a Model and place them into a separate,
related Model. Correct me if I'm wrong, but I don't even think your
solution supports any sort of data processing like this. You would
have to manually save the data, migrate, and then import the altered
data into the new schema, which all but precludes a distributed
solution.
If we go back to Jacob's Schema Evolution page
(http://code.djangoproject.com/wiki/SchemaEvolution) you can see that
he identifies 3 different tactics, but really there are just two with
the third being the combination of them, introspection, and applied
migration code. Your solution seems to be square in the introspection
camp. But the "applied migration code" side of it is severely
lacking, and that's exactly what everyone wants to see.
As for the "aka" bit, I also am weary of putting even more meta-data
into the model. But the practicality of it is admirable, and I'm
impressed by the discovery that it's the only thing you need for
proper inspection.
And finally I will say that if you just had a plan of action for the
migration code bit, then there probably wouldn't be much resistance.
I think Django severely needs this introspection part as soon as
possible (I've been saying that since its release). It would be
especially helpful for development, where the introspection-only works
fine, and where it would be used the most.
ps. Did you ever see my long-ago proposal?
(http://code.djangoproject.com/wiki/SchemaEvolutionProposal)
pps. Russell, you can't be '_strong_ -' , what's next '-2'?
Hi Brantley, you have written very nice proposal, I like it!
Your evolutions are python equivalent of Rails migrations, but have
better syntax, aren't they?
The problem is that you lose possibility to autodetect database
changes. I want to have autodetection!
Python flexibility is great, but autodetection is much better!
Also I want to perform db downgrades (I hope I don't ask too much? :) ).
The most challenging problem with evolutions for django is how to
store previous versions metadata while allowing autodetection. If you
solve this problem, all others are easy as pie.
So that is why I'm thinking of 2 files for keeping evolution metadata:
one that is created automatically and is destined to be auto-appended
before each new evolutionary syncdb, second contained of pre_ and
post_ conditions, pre_ and post_ actions and any other python code.
First file may look exactly like in your proposal (= contain only
"evolve" commands and database tables with unique names). Idea is that
you don't add commands manually to the first file, but only links to
actions and constraints, written in second file.
How do you like this idea?
Also it seems you forgot about the very important thing - current
tables version number metadata. Syncdb can't deduce what action you
are likely to do with your tables, and stored version numbers will
decrease risk of executing wrong actions on sensitive data.
I'm +1 for storing this data in separate database table.
Now naming proposal:
Creating upgrade plan - "manage.py evolve plan",
Upgrades and downgrades - "manage.py evolve <target-version>".
Initialize evolutions in legacy database - "manage.py evolve init
<current-version>".
Initialize evolutions on clean database - "manage.py evolve init".
p.s.
Derek chose the best possible approach, because (citing Zen of Python):
Simple is better than complex.
Now is better than never.
While we are challenging overall problem, he has done first version.
And I'm happy and using his code already in my real world projects.
[Wish it were like that with django tickets - I have now about 5 bugs
fixed totally in both my checked out last revision of trunk and last
revision of svn-admin branch because months fly by and tickets are
still pending :)]
p.p.s.
One more hard problem to solve in next versions: non-linear evolutions graph.
--
Best regards, Yuri V. Baburov, ICQ# 99934676, Skype: yuri.baburov,
MSN: bu...@live.com
more like "low-hanging-most-useful-fruit first" :) everyone agrees
that there will never be a 100% introspection solution, so why not
release what is currently useful, and then grow the scope as we go? (as
long as it doesn't break anything, of course...and i'm not there yet for
the record)
> For instance, if a user wanted to take a field
> or group of fields from a Model and place them into a separate,
> related Model. Correct me if I'm wrong, but I don't even think your
> solution supports any sort of data processing like this. You would
> have to manually save the data, migrate, and then import the altered
> data into the new schema, which all but precludes a distributed
> solution.
no, it doesn't support moving fields from one model to another, but
that's also on the todo list. (just as key support is still
non-existent), but as long as the proper relation exists between the
models...syntax-wise:
class Article(models.Model):
name = models.CharField(maxlength=40, default="woot")
#tag = models.CharField(maxlength=40)
class Tag(models.Model):
article = models.ForeignKey(Article)
tag_text = models.CharField(maxlength=40, aka='Article.tag')
the following sql can be auto-generated:
1) add table tag (w/ fk)
2) create one tag record for each article using the old tag data
3) drop the tag column in article
or if Tag already existed, it would simply add the column and update all
Tag records with their respective values from article.
some data migration is already supported btw...look at the workarounds i
had to do for sqlite. proof-of-concept-worthy, at least.
> But the "applied migration code" side of it is severely
> lacking, and that's exactly what everyone wants to see.
really? but that's the easy part of the two. here's the key:
there is no need to store schema versioning information in the
db. first, for reliability, you can't trust any version in the
db, so you have to scan the entire schema regardless. and if
you're doing that, just dump it to a string and run it through a
hash algorithm. this hash becomes the schema fingerprint. then
store your migration scripts mapped from pairs of hashes
(from_hash, to_hash) and your hashes in a list of all known hash
values. then on startup (or manually triggered) a tool scans
the existing schema, uses the fingerprint/hash to figure out
where it is, then applies any scripts found matching it's from
value. loop(scan, apply). when you run out of scripts, you're
at current. or, give the tool a fingerprint to stop at.
works wonders. i wrote it a few years ago in java to manage the 200+
table schema the air force uses to track all their students. (they have
their own university system) great thing is that it works both
ways....downgrade scripts are stored right next to upgrade scripts.
but anyway... i agree they are two different problems; one that feeds
into the other. it just seemed natural to write the introspection
first, and it is useful by itself. (i would argue the more useful)
> ps. Did you ever see my long-ago proposal?
> (http://code.djangoproject.com/wiki/SchemaEvolutionProposal)
yes... i used your process description. :)
derek
Hi Derek;
First off - let me apologize - my comments were not intended to be
harsh or derogatory.
You have done some great work here, and I don't want to give you the
impression that your efforts are not appreciated and valued.
However, your call for testing left me with the impression that you
thought the branch was almost ready for merging. IMHO, we are still a
long way from this being the case. I raised some issues a few days
ago. These concerns still exist, and unless Adrian invokes his BDFL
powers or the other core developers step in, you will need to either
convince me I am wrong or address these issues before I will put my
weight behind a merge.
> Russell Keith-Magee wrote:
> > - The 'aka' approach has some serious flaws.
>
> i'm not saying we don't need additional metadata, but i haven't seen or
> thought up a use case for it yet.
Your approach is predicated on the assumption that migrating
v1->v2->v3 is equivalent to v1->v3. This is true if you only look at
renaming, deleting or adding the fields themselves.
However, it doesn't follow if you consider the semantic significance
of the data itself. Here's one class of migration - migrate the
contents of the field 'distance' to be stored in kilometres rather
than miles. The field is exactly the same, but the contents isn't.
Here's another example:
v1: 'name' contains the last name of the user.
v2: add field 'firstname', containing the first name of a user.
v3: rename field 'name' to 'lastname'.
v4: merge field 'firstname' and 'lastname' into 'name'.
From a purely model perspective, the model hasn't changed. However,
the data _has_ changed, and managing those changes is the hard part of
the migration process. If there wasn't any data changes, you would
just drop the tables and resync. We are migrating because we don't
want to lose data - and we don't want to lost the semantic
interpretation of the data, either. Under your approach, v1->v4
requires no change. I contest that you _must_ apply the changes in
sequence to maintain semantic consistency.
One argument against this would be to say "munging data during
migration is a difficult case, and I'm not handling that at the
moment". This is fine, but it must be at least on the radar as an end
goal, and some thought must be given to how the difficult case fits
into the framework that the (implemented) simple case provides. I
don't see how adding 'aka' as metadata on a field or model can be
expanded to encompass sophistocated, sequential transformations of
data at a semantic level.
I would contest that you can't define these changes in metadata. If
each change is atomic; they should be defined in an atomic form, not
inserted and spread across a model definition. The model should be a
clear statement of what the model should look like and behave like
right now. Migrations are there to define how an old database table
gets converted into the model we now have. Mixing migrations into
metadata just seems dirty to me.
> i don't see how this criticism makes sense. both *are* tested. if the
> evolution code doesn't identify the correct task, the wrong migration
> sql is assured, therefore the test fails. give it a scenario, and see
> if introspection + sql generation works. if you have a different
> suggestion for testing methodology i'm all ears, but this comment seems
> to be based on only a superficial look at the unit test code.
Ok; I've had a closer look now, and you are correct. I wasn't making
the distinction between the SQL that was being executed to set up the
test conditions, and the SQL that was being tested. Apologies for
casting your good name into disrepute :-)
My initial reaction to testing this sort of thing was that we would
need to have two model definitions, and dynamically swap them over.
Your approach will be much faster, and requires less model loading
magic. My only concern is that your approach uses the backend that is
under test to produce the code that is used to set up the test
conditions. The pathological case for this problem would be 'delete
column' and 'add column' both producing no SQL statements. If a test
ran the empty 'delete' code to set up the test, then ran the empty
'add' code, the test would pass even though the backend was in fact
broken.
You have worked around this by testing that the output of the backend
is as expected. However, this output is database specific, which means
that you have to duplicate the tests for each backend. Duplication
like this rubs me the wrong way - since you are using doctests, it
should be possible to write a single 'template' docstring that is
filled in with the appropriate backend specific statements.
> > In short, IMHO this is not yet ready for general user testing
>
> i hope you don't mind if i solicit it anyway. ;) but seriously, i
> don't get this. i think it's of great value to seek testing early and
> often. if you don't want to, that's fine, but don't discourage others.
True, but my experience has been that you run the risk of test
fatigue. When you have a big set of changes that require serious
evaluation, it takes a lot to get people interested. Serious
evaluation of new features (especially on this sort of scale) takes a
lot of effort, and you don't want to burn up all your goodwill early
in the process - especially if we're still sorting out some major
design issues.
I don't want to discourage anybody that wants to help test from
putting in the effort. I just want to give them fair warning that some
of the design issues still aren't completely sorted out, so the code
in the branch is still subject to significant flux.
Yours,
Russ Magee %-)
Looking at this, we share a few similar ideas. Here goes my attempt at
a description of what I was thinking.
My goals:
* Keep the model file as a canonical representation of what the tables
should look like right now.
* Keep each migration as an atomic unit.
* Provide an entry point for raw SQL migrations
* Provide some validation that the evolutions that are defined are
sufficient before they are applied, or at least some validation that
post-evolution, the database is correct.
* Provide a way to implement the simple cases now, but make the
expansion path for the future obvious.
* Provide a clean interface against which to test.
The approach:
1) We add a contrib.evolution package. This application must be added
to INSTALLED_APPS for your project. In contrib.evolution, we define an
'Evolution' model:
class Evolution(Model):
model = ForeignKey(ContentType)
label = CharField(maxlength=20)
signature = TextField()
In this model, the 'label' is used to identify a particular version of
a database. It could be as simple as 'v1', or you could use any
helpful text description (e.g., 'pre-christmas 2006')
The 'signature' is a pickled description of a model. A utility
function exists in the contrib package to create signatures from model
definitions. For example, the model:
class Author(Model):
name = CharField(maxlength=20)
age = IntegerField()
books = ManyToManyField(Book)
would generate a signature like:
{
'name': 'Author'
'fields': [
('name', 'CharField', { 'maxlength': 20 }),
('age', 'IntegerField', {}),
('books', 'ManyToManyField', { 'to': 'myapp.Book' })
]
}
which could be pickled into a text representation. Introspection code
is used to generate signatures from the database representation of a
model. Utility methods also exist to compare two signatures for
equality, and to identify the changes required to make signature1
equivalent to signature2.
2) The contrib.evolution package defines a listener on the post_syncdb
signal to handle evolutions.
3) When syncdb is run and identifies a new model, the listener adds an
initial entry in the evolutions table (label=None), with a signature
for the original model definition.
4) When syncdb identifies an existing model, the listener generates a
signature for the model as defined in models.py, and introspects the
database to generate a signature for what is currently in the
database. If the generated signatures match, no changes are required.
5) If the signatures don't match, the listener raises a warning saying
'evolutions are required'. The end goal of this process is to run
./manage.py syncdb and get no messages.
6) The user can define evolutions. They exact sequence is defined in
the Meta property of the model:
class Author(Model):
...
class Meta:
evolution = [
'v1',
'v2',
'pre_christmas_2006',
'valentines_day_2007',
'v5'
]
Then they define their evolutions. An 'evolution' directory is created
in the application, and each evolution is a file in the evolution
directory, corresponding to a name in the sequence.
For example:
pre_christmas_2006.py:
from django.contrib.evolution.mutation import *
from django.db import models
mutations = [
AddColumn('Author', 'dateofbirth', models.DateField, initial_value=None),
DeleteColumn('Author', 'age')
]
Initially, the evolution.mutation package will only include some
simple transforms (Add, Delete, Rename, etc). However, the built in
set can be easily expanded later on (say, adding mutations to do
column merges, or unit transformations on a column). Also, end users
could define their own mutations for their own specific needs. Testing
is also simplified because you can test the behavior of each
individual mutation class.
7) The user then runs ./manage.py syncdb --evolve. The evolution
listener takes the signature generated by introspecting the database,
and does a lookup on the evolution table to determine which label
corresponds to the current state of the database. This becomes the
starting point in the evolution sequence, and each mutation in each
evolution in the sequence can be applied.
8) A mutation can run in two modes - simulation and real. When run as
a simulation, you call the 'simulate' method on the mutation. This
method takes the input initial signature, and returns the signature
that would exist if the mutation were applied. After simulating all
the mutations, you should have a signature that corresponds to the
signature for the currently defined model. If this isn't the case, an
error is raised to say that more (or different) evolutions are
required.
9) If the simulation endpoint corresponds to the current model
signature, the mutations are run for real. Each mutation knows how to
generate the SQL to affect the mutation (using backends as required);
the SQL is collated and executed.
10) Finally, an updated entry is put into the evolutions table. This
entry contains the new signature and the last label in the evolution
sequence.
Now, some variations to cover other use cases:
The simple case of an evolution is a python file describing some
changes. However, to cover the "Carol" use case, an evolution can also
be a .sql file in the evolution directory. When you come to apply
migration 'pre_christmas_2007', the migration listener looks for
pre_christmas_2007.sql; if it doesn't exist, it looks for
pre_christmas_2007.py. If the sql version exists, the SQL in the file
is executed in preference to the python instructions.
The reason for the sql over py priority is that you also allow
./manage.py syncdb --sqlhint, for the "Ben" use case. This is the same
as normal evolution, except that the SQL isn't run on the database -
it is output to a file. So, if pre_christmas_2007.py exists,
./manage.py syncdb --sqlhint will generate pre_christmas_2007.sql -
the SQL corresponding to the instructions implied by the python
definition. Ben can then tweak the SQL, and because .sql takes
priority over .py, the tweaked SQL will take priority when he runs
./manage.py syncdb --evolve.
We can also use hinting to generate the initial attempt at a python
evolution file (This is essentially the case that Derek's code
covers). Using the model signatures and introspection, syncdb --hint
attempts to guess the transformations that are required, and produces
a first pass .py migration. If Alice is trusting, she just runs
./manage.py syncdb --hint --evolve; this will generate the hint and
evolve the database, all in one step.
Some caveats:
At stage 8, if there isn't a match in the evolutions table for the
current signature, someone must have been tweaking the database
outside of the evolution framework. I'm happy to put this outside the
scope for schema evolution, and raise an error to the effect of
'you'll need to get the db back to a known state by yourself'.
When using SQL transformations: obviously, you lose the ability to
simulate changes. If there is an SQL step in the evolution sequence,
simulation is abandoned (with a warning to that effect). However, at
the end, if the introspected signature doesn't correspond to the
actual model signature, you can generate a warning saying 'more
migrations are required'.
So - there you have it. Now, I'm not set in stone on this plan - I'm
happy to entertain any plan that meets my originally stated goals.
> pps. Russell, you can't be '_strong_ -' , what's next '-2'?
Of course. And then, -3. :-)
Yours,
Russ Magee %-)
This part of the plan looks quite similar to the dbmigration project at
http://www.aswmc.com/dbmigration/ - hope that its useful in some way :)
If not, ah well, it works for me and a few others until the master plan
is complete ;)
Cheers,
MikeH
cheers,
peter
> This part of the plan looks quite similar to the dbmigration project athttp://www.aswmc.com/dbmigration/- hope that its useful in some way :)
so i think i have a cross-over solution. :)
what it boils down to is there are four basic categories of developers:
1. users who trust introspection and never want to touch/see SQL
2. users who mostly trust introspection but want the option of
auto-applied upgrades for specific situations (Wash)
3. users who use introspection-generated SQL, but don't trust it (they
want it generated at development and stored for use in production)
4. users who hate introspection and just want auto-application of their
own scripts
who wish to perform different combinations of the two basic subtasks of
schema evolution:
1. generation of SQL via magical introspection
2. storage and auto-application of upgrade SQL
the first i've detailed before, so i won't cover here. it works the
same way.
the second works through auto-identification of known schema types, and
the application of pre-rendered scripts. now this is very similar what
other code out there does (for instance mikeh's quite capable
dbmigration project [http://www.aswmc.com/dbmigration/]), but i feel in
a simultaneously simpler and more flexible way. specifically, the
following i consider key goals:
* there is no need for schema identification metadata to be stored in
the database itself
* you should be able to swap between introspection and custom scripts
freely within the same project, using as much or at little of each as
you want (with little or no transition costs)
* full migration graphs should be supported, not just linear steppings
* it should be butt easy to use (because if it's not simpler than
keeping your own text file, noone is going to use it)
note: all this is covered in greater depth here:
http://code.djangoproject.com/wiki/SchemaEvolutionDocumentation
for the developers work, all they have to do is define a new module
named 'schema_evolution' in their app directory, next to 'models'. this
contains two data structures:
* a list of known schema fingerprints
* a dict mapping (from_fprint,to_fprint) pairs to scripts (lists of
sql statements and/or python function calls, to be run in order...think
pre_script/sql/post_script)
the fingerprints act as both version numbers and ids, and are
automatically generated. (either by you, running './manage
sqlfingerprint my_app' or by sqlevolve or syncdb when they're looking
for what scripts are available and comparing them to the existing schema)
when schema evolution is applied, specified scripts ALWAYS take priority
over introspection generated sql...so if you never want to see an aka
field or generated sql, you never have to.
different users i imagine will use it in different ways, but i imagine
that most people will use introspection the majority of the time,
ignoring fingerprints until they hit a bump that introspection can't
handle. they'll simply code that one transition and continue on their
way storing only that one tricky transition.
a different subset of developers i imagine will always use the stored
scripts during deployment, but will use introspection to generate the
majority of them. but they will store all transitions, having a full
version history of every state of their db stored in svn.
anyway, an example of the evolution file is available in the
documentation. let me know what you think. :)
oh yeah, it's mysql only at this point....i want to get some feedback on
the design before i write the others.
thanks!
derek
p.s. mike i like your stuff but i feel that the whole-hog approach
requiring developers to either start their db with your scripts or face
a complicated setup is overly burdensome. plus users are lazy....i
don't want their hand-written scripts to be the only check for whether
or not it's safe to apply a particular script. but maybe there are
parts where we can merge or projects together? pm me.
Again - IMHO, the 'aka' syntax is a non-starter (for all the reasons I
have stated previously). Unless you get a BDFL or majority core-dev
pronouncement to the contrary, or you can provide a compelling reason
why 'aka' is desirable, it's probably safe to assume that
schema-evolution with 'aka'-style syntax as currently proposed won't
be merged to trunk.
> * full migration graphs should be supported, not just linear steppings
...
> * a list of known schema fingerprints
> * a dict mapping (from_fprint,to_fprint) pairs to scripts (lists of
> sql statements and/or python function calls, to be run in order...think
> pre_script/sql/post_script)
This is an interesting idea. Some comments/questions:
1) Isn't the list of known schema fingerprints just the list of keys
in the migration graph/dictionary?
2) From your code, a schema fingerprint is a hash of the output of
SHOW TABLES; this is an interesting approch, but there will be a major
problem going cross-platform. If I remove a field from a model, the
rough description is platform independent, but the SQL required isn't.
As a result, I would need to manage multiple sets of migrations - one
for each database I have to support. Wouldn't the model itself be a
better source for the fingerprint (or, at least, and abstracted
representation that has been derived from the database)?
3) If you are using fingerprints to identify nodes in the graph, how
do you handle cyclic states (e.g., v1, I add a field, v2 delete it ,v3
add it again)? As I read your proposal, the fingerprint for v1 and v3
should be the same - which will cause some major problems determining
resolution order for applying migrations.
4) If you are using model fingerprints to identify changes, how do you
propose to keep parallel changes synchronized? For example, Model 1
goes from v3->v4, but that change relies upon Model 2 going from
v6->v7. If the pair of changes share a common label, the label can be
used as a point of synchronization; however, the fingerprint (by
itself) doesn't provide such a reference point.
5) What is the role of the post-state here? Is it your intention to
use the post-state for validation that a migration has been
successful? If so, how do I populate the post-state value before I
have applied the migration for the first time?
Yours,
Russ Magee %-)
I have some *long* feedback I'm trying to finish on this work, but in
the meantime:
On 8/7/07, Russell Keith-Magee <freakb...@gmail.com> wrote:
> Again - IMHO, the 'aka' syntax is a non-starter (for all the reasons I
> have stated previously). Unless you get a BDFL or majority core-dev
> pronouncement to the contrary, or you can provide a compelling reason
> why 'aka' is desirable, it's probably safe to assume that
> schema-evolution with 'aka'-style syntax as currently proposed won't
> be merged to trunk.
Russ is doing a great job speaking for me here, too.
We're currently trying to move non-required metadata *out* of the
model (c.f. the moving of the admin declarations in newforms-admin),
and that process is the right one. Django's ORM is currently too
coupled to supposedly-not-required pieces, and adding core code to
deal with extra field attributes ("aka" in this case) is a
non-starter.
For schema evolution to get trunkified it's going to need to be almost
completely self-contained.
Jacob
Derek, I really appreciate your hard work on this. You're getting an
large amount of pushback, but that's because this is a critical
feature and we want to get it right.
So please don't confuse criticism with a lack of gratitude!
Jacob
Situations like this are the reason I have generated linear steppings
which are sequenced and run in order, in bulk.
I may have misunderstood how this is proposed to work, so please correct
me if I have.
Deleting the column and adding it again has the effect of resetting the
data to the default value for the column.
Say I have two clients, X and Y.
I roll out revision 1 to both clients.
I roll out revision 2 to client X. This deletes a column.
I roll out revision 3 to client X. This adds the column again.
I roll out revision 3 to client Y. This does nothing to the column as
the fingerprints are the same.
Now I have the same application, at the same revision, rolled out to two
clients, but they have not treated the data in the same way. This is not
acceptable - making sure you have the correct schema structure at the
end of a migration is vital, but data integrity must not be compromised
in any way.
If I've misunderstood, please tell me how this situation would be dealt
with :)
Cheers,
Mike
as i said before i'm not wed to it. (actually i feel it's a rather
minor point) propose a different location. i just want it somewhere
simple and obvious. (or if you two just hate the keyword, what would
you like to call it?)
the only thing i'm against is completely duplicating the entire model
structure somewhere else, just to store one tiny extra bit of metadata.
that seems to be an unreasonable extra burden for the developer. i'm
a big proponent of DRY, as i thought i've read others here are too.
(one of the reasons i like django so much better than all the java wed
and ORM frameworks)
i'm a yankee...don't worry so much about hurting my feelings! :-P
(but thanks for the thought)
derek
no. it adds ordering. (the keys are a set remember) we wouldn't need
it if python had something similar to java's LinkedHashMap, but alas...
(besides, it's easier to read, don't you think?)
> 2) From your code, a schema fingerprint is a hash of the output of
> SHOW TABLES; this is an interesting approch, but there will be a major
> problem going cross-platform. If I remove a field from a model, the
> rough description is platform independent, but the SQL required isn't.
> As a result, I would need to manage multiple sets of migrations - one
> for each database I have to support. Wouldn't the model itself be a
> better source for the fingerprint (or, at least, and abstracted
> representation that has been derived from the database)?
well, show tables + show index. and yes, it will be different for
different backends. but as you said, the SQL is different too. if
you're writing migration scripts by hand (or even generating them via
introspection), you have to store one copy for each backend you support
anyway... so you want the fingerprints to be different.
> 3) If you are using fingerprints to identify nodes in the graph, how
> do you handle cyclic states (e.g., v1, I add a field, v2 delete it ,v3
> add it again)? As I read your proposal, the fingerprint for v1 and v3
> should be the same - which will cause some major problems determining
> resolution order for applying migrations.
no problem. migrations to later versions take priority, which following
the django "do the right thing" philosophy. on the rare time you want
to upgrade to v2, comment out the to_v3 migration temporarily. (or more
likely, the only time you would want to upgrade to v2 would be when you
checked out an older version, which wouldn't have the v3 script anyway)
alternatively, a "migrate-to" argument to the sqlevolve action can be
set. this is how i intend to support downgrades anyway.
> 4) If you are using model fingerprints to identify changes, how do you
> propose to keep parallel changes synchronized? For example, Model 1
> goes from v3->v4, but that change relies upon Model 2 going from
> v6->v7. If the pair of changes share a common label, the label can be
> used as a point of synchronization; however, the fingerprint (by
> itself) doesn't provide such a reference point.
the fingerprint is for the whole app schema, not individual models.
this scenario couldn't exist.
> 5) What is the role of the post-state here? Is it your intention to
> use the post-state for validation that a migration has been
> successful?
yes, post-validation i believe is important. plus, it allows us to pick
the right script when you have a cyclical schema change. plus, it lets
you store reversal scripts, so you can have automatic downgrades as well
as upgrades. plus, it lets you store arbitrary graphs of upgrades, so
you can write a v1->v3 script along with v1->v2 and v2->v3. (important
for the larger databases)
> If so, how do I populate the post-state value before I
> have applied the migration for the first time?
you don't have to...just stick a dummy value in there (or leave it
blank) and run your script. (either manually or w/ the tool) then call
fingerprint, and copy-paste the new value over your dummy value.
derek
your schema integrity in assured. the data integrity is also assured
(meaning no invalid data). but whether every value in every field is
exactly the same between two clients? that's a whole other ballgame.
but really, when have two clients EVER had exactly the same data?
Here are my questions.
1) How one writes operation to reset all field values to default? Is
it a migration at all?
2) How about python-written migrations?
That doesn't mean that other people havent had a client want to drop a
feature - including all its data, and start again at a later point. I
have seen situations where a client has said "The data the system is
calculating for feature X is wrong. Please remove the feature X. We will
start again on it when we figure out exactly how it should be
calculated, assuming there is time in the project to revisit it."
> your schema integrity in assured.
In this case, correct.
> the data integrity is also assured
> (meaning no invalid data).
Sorry, no. In my simple example, the data is not in the state that the
developer expected. He asked for the column to be deleted, then
recreated. Bear in mind there may be many months between the two
requests. During this time there may or may not have been another
release to the production environment. So, when the dev adds the column
again, he expects the data now to be in the default state. This might
not be the case when it goes to production. The data that was old might
have been calculated incorrectly, or otherwise wrong - the cause of the
deletion request. Therefore you have invalid data - not in a database
sense (there will be no sql errors) but from the point of view of the
client and the application.
Now lets say that I have a varchar column.
I drop the column.
I recreate it as a tinyint with a default value (for whatever reason,
perhaps I'm taking out char constants and replacing them with int
constants, the labels of which are editable in another part of the app).
I am making the decision to do this, aware that I am dropping the old
data and starting new with a default value of a different type.
Depending on how often it gets released, it might work. If
schema-evolution decides that I didnt really want to drop the column,
and it's just going to alter it for me, it will fail. The rollout will
fail with type errors because it will be trying to convert varchar data
to tinyint. And it will be trying to do that because it did not do what
the developer thought it would do - drop the column and add it again.
You'd get something like the following example (postgres 8.2):
notnowlewis=# create table test_table(
notnowlewis(# my_column varchar(12)
notnowlewis(# );
CREATE TABLE
notnowlewis=# insert into test_table values('wooyay');
INSERT 0 1
notnowlewis=# alter table test_table alter column my_column type int
using my_column::int;
ERROR: invalid input syntax for integer: "wooyay"
The more I think about it, the more I come to the conclusion that any
migration that modifies data MUST be carried out in the sequence that
the developer specifies. It's the only way to make sure that the data
validates both from a database point of view and from an application
point of view. And deletion is as much a modification as merging or
splitting columns.
As I think I've said before (possibly not on this list) my ideal way of
solving this would be to run :
./manage.py generate-evolutions app_name
which would generate [sql / python code representing the changes] and
put it into a migration file. I could then review this to put in any
data manipulation I needed, if I needed to. For 95% of changes, I
probably wouldn't need to. Then I'd apply it by running
./manage.py syncdb
That way everything gets run in sequence and there is no possibility for
it surprising the developers. Yes, you might get a few more migrations
run and you'd have to run all non-applied migrations in sequence, but
you have the assurance that no matter where you are rolling out to, you
know exactly what is going to get run.
Don't get me wrong, I've very excited about the prospect of
schema-evolution getting merged :) But only if I can be sure that some
poor release manager somewhere isnt going to have a failed deployment,
and end up scratching his head thinking "I'm sure we dropped that
column, there should be no data there!" ;)
Cheers,
Mike
this was brought up when i first designed this for the af. my answer
was: it's pretty obvious when you're returning to an identical copy of
a previous schema. if the developer _really_ needs to force the user to
go through the drop/add, all he needs to do is make any trivial change
ot the schema elsewhere. for instance, adding an empty table titled
"PLEASE_DELETE_ME_AFTER_V2". ugly, yes, but it calmed the "what if"
concerns. (and we never had to)
but wearing my developer hat, i would have seriously reprimanded one of
my guys had written a feature that would fail on "i wasn't expecting
data here". if the schema can't verify it (such as is the case when
you're changing how a number is calculated), the app needs to.
> As I think I've said before (possibly not on this list) my ideal way of
> solving this would be to run :
>
> ./manage.py generate-evolutions app_name
>
> which would generate [sql / python code representing the changes] and
> put it into a migration file. I could then review this to put in any
> data manipulation I needed, if I needed to. For 95% of changes, I
> probably wouldn't need to. Then I'd apply it by running
>
> ./manage.py syncdb
>
> That way everything gets run in sequence and there is no possibility for
> it surprising the developers. Yes, you might get a few more migrations
yep. that's exactly the scenario i described for the "kaylee" developer
here: http://code.djangoproject.com/wiki/SchemaEvolutionDocumentation
and probably the ideal for large/serious development work. but i don't
think we need to force everyone to do it this way.
> Don't get me wrong, I've very excited about the prospect of
> schema-evolution getting merged :)
danke :)
just stick an "update mytable set col=value;" in your upgrade script.
(but if you're doing this outside of any schema changes, yes i would
argue that this is not a migration issue)
> 2) How about python-written migrations?
in the list of sql commands, pass in a function reference instead. (not
implemented yet, but that's how i plan on it working)
i originally thought of just adding "pre/post" function support, but
then thought "why limit it to that?" what if you wanted
sql/python/sql/python?
1) Would the app_name/schema_evolution.py be automatically generated/updated?
I don't want to be a monkey copying output of sqlevolve and
sqlfingerprint to that file. I better remove all parts of file that I
won't use.
2) There are a lot of cases of information loss with your proposed solution.
Automatic v1 => v3 wouldn't work exactly as v1 => v2 => v3 if you add
column in v2 and change default value of this column from null to same
value in v3. Or maybe someone don't want to keep old information about
very old field names changes in aka field. Model was deleted and
another created in future with the same name (name clashes and no
references to old table is left). Temporary model created to get
information from m2m field and copied to another table with
schema-evolution and then removed from models.py.
My position (and not only mine): Updates shall be consistent!
Semantically different actions for the same start and finish points
should not be possible at all! It can't be The Right Thing if it
breaks the update semantics: updates shouldn't influence each other
when applied together, the changes should always be semantically the
same (exactly the same actions are not required - optimizations are
possible!).
In my first example data about v2 default value is lost unless it's
stored in sql scripts. So that's why I want automatic collection of
upgrade and downgrade scripts in special folder (if user allowed
it/not denied it). But I don't want them to be SQL, because they would
be db specific and you lost information again!
So, I want to keep all history of my schema changes in
schema-evolution store, but not in models or backend-specific sql
scripts.
The good metaphor is Version Control System (VCS). Two most common
operation VCS allows are to update from any previous version to the
latest one, (with your current version of models.py that depends on
the latest db version), and to commit new version. Also to look
changes before update or commit sometimes. All other VCS features are
rarely needed.
no. i think that those who don't trust introspection (who are most
likely to be the heavy users of schema_evolution.py) aren't going to
trust the introspection code to auto-place itself in their deployment
scripts. (they might not notice an introspected change, and it get
accidentally leaked to development wrecking the lives of millions) a
copy-paste requirement is your protection from my introspected evil. :)
> 2) There are a lot of cases of information loss with your proposed solution.
> Automatic v1 => v3 wouldn't work exactly as v1 => v2 => v3 if you add
> column in v2 and change default value of this column from null to same
> value in v3.
> My position (and not only mine): Updates shall be consistent!
then don't use the full graph support. only write/approve v1=>v2,
v2=>v3, v3=>v4 scripts and you always have a guaranteed always-equal
migration for your apps. it's only there if you want it.
> But I don't want them to be SQL, because they would
> be db specific and you lost information again!
> ... keep all history of my schema changes ... but not in models or
> backend-specific sql scripts.
"db-specific == losing information" i don't understand...but i've no
interested in writing a new database access language. we've already got
a ton (including the ORM already built into django), and you can use
them in the function call handling i've already proposed.
> > 2) There are a lot of cases of information loss with your proposed solution.
> > Automatic v1 => v3 wouldn't work exactly as v1 => v2 => v3 if you add
> > column in v2 and change default value of this column from null to same
> > value in v3.
> > My position (and not only mine): Updates shall be consistent!
>
> then don't use the full graph support. only write/approve v1=>v2,
> v2=>v3, v3=>v4 scripts and you always have a guaranteed always-equal
> migration for your apps. it's only there if you want it.
Why should I write these scripts by hand? Or should I copy them to
file schema_evolution.py by hand?
I'm offering you to create such scripts automatically.
> > But I don't want them to be SQL, because they would
> > be db specific and you lost information again!
> > ... keep all history of my schema changes ... but not in models or
> > backend-specific sql scripts.
>
> "db-specific == losing information" i don't understand...but i've no
> interested in writing a new database access language. we've already got
> a ton (including the ORM already built into django), and you can use
> them in the function call handling i've already proposed.
Ok, explaining. You are creating application, then spreading it. Then
creating new version and spreading again. You don't want to create
evolution scripts for each backend (and each database version of that
backend) by yourself. Instead you want to create one script (at the
moment you do your upgrades), that will describe actions to be taken
on other systems. Do you agree with this use case?
If you don't want to create such language, then let me do it.
The language is all about 10 commands, corresponding to the commands
executed on databases:
Rename the Comment field "maker" to "author".
Change the Comment field "body" to a TextField().
Add the Comment field "email" as a CharField(maxlength=128, blank=True).
Drop the Comment field "address".
(command examples from
http://code.djangoproject.com/wiki/SchemaEvolutionProposal)
You should already have implementations of these commands.
For one of my projects I'm using system, that do upgrade-downgrade
scripts collection and management. It's dbvcs:
https://launchpad.net/dbvcs/ , another similar system is
http://www.aswmc.com/dbmigration/
The most important thing that is missing in those systems is automatic
creation of sqls, that present in the your schema evolution. It would
be great if we merged systems into one.
i have had this exact situation happen to me before. (rolling out a
change, then another change reversing it) actually several times. i
have NEVER had a scenario where client Y says "can you please make sure
i lose my data?"
your schema integrity in assured. the data integrity is also assured
(meaning no invalid data). but whether every value in every field is
exactly the same between two clients? that's a whole other ballgame.
but really, when have two clients EVER had exactly the same data?
Its not the name that is the problem. There are two problems with the syntax:
1) Linking the description of changes to the model definition,
2) The fact that the it is inadequate for describing the relationships
between fields during sequences of atomic changes.
> the only thing i'm against is completely duplicating the entire model
> structure somewhere else, just to store one tiny extra bit of metadata.
There is no need to duplicating the entire model structure. You just
need to describe the changes - in an atomic fashion that can be
applied sequentially. i.e.,
v1->v2: rename field 1, delete field 2
v2->v3: rename field 3, add field 4
and so on. I'm open to discussions on syntax and structure of these
descriptions, and the manner in which this manual description can be
supplemented or supplanted by introspection and/or SQL scripts.
However the broad idea - an annotation of the changes, in an atomic
form external to the model itself - is pretty much essential.
Yours,
Russ Magee %-)
I thought you were proposing moving away form ordered sequences? Besides:
migrations = {
'f1': stuff1,
'f2': stuff2,
'f3: stuff3,
}
The current state of the database has fingerprint f1, so you apply
'stuff1'. This gives the database fingerprint f2, so you apply stuff2,
and so on. It reads linearly, even though technically it isn't
(internally at a dictionary level). If you need non-linear upgrade
paths, you can introduce them by adding the new fingerprints wherever
they fit logically.
> (besides, it's easier to read, don't you think?)
Arguable. As shown above - if you describe a dictionary in the order
in which the states occur, it will essentially look (and behave) like
a list.
> well, show tables + show index. and yes, it will be different for
> different backends. but as you said, the SQL is different too. if
> you're writing migration scripts by hand (or even generating them via
> introspection), you have to store one copy for each backend you support
> anyway... so you want the fingerprints to be different.
SQL migrations will requires multiple scripts, but the base migration
definitions should be platform independent.
> no problem. migrations to later versions take priority, which following
> the django "do the right thing" philosophy. on the rare time you want
> to upgrade to v2, comment out the to_v3 migration temporarily. (or more
> likely, the only time you would want to upgrade to v2 would be when you
> checked out an older version, which wouldn't have the v3 script anyway)
I think you have missed my point. Here's a test case:
1) Create a base model at v1
2) Synchronize 2 databases (A & B)
3) Update model to v2 (delete field x)
4) Migrate database A to v2
5) Update model to v3 (add field x again)
6) Migrate database A to v3
7) Update model to v4 (some other change)
8) Migrate database A and B to v4
Database A is my development machine. It gets synced regularly, and so
migrations are written and applied as I work.
Database B is my production machine. It gets synched when features are stable.
v1 and v3 have the same fingerprint.
In step 8, I need to apply v3->v4 to Database A. However, on database
B I need to apply v1->v2, v2->v3, v3->v4.
Your suggestion is that I _could_ work around this by writing a v1->v4
script. However, that still won't help identify which database is in
which state, because their fingerprints are the same. A v1->v4 script
would work for B, but not for A.
Writing a v1->v4 script is also extra development work. I have already
written and validated that the sequence of v1->v2, v2->v3, v3->v4
scripts work on my test machine - not only is it less work to just use
the scripts that already exist, it could be argued that this is the
only responsible thing to do - running untested scripts on the live
server is a lot more risky than running scripts that have been tested
on the dev server.
This is one of the reasons that I suggested a human assigned 'tag' for
tracking changes. If the tag is human assigned, you can enforce
uniqueness. You can't do that with a fingerprint. The tag also
provides a clean name that can be used to store and retrieve sql
files, migration descriptions, etc.
> > If so, how do I populate the post-state value before I
> > have applied the migration for the first time?
>
> you don't have to...just stick a dummy value in there (or leave it
> blank) and run your script. (either manually or w/ the tool) then call
> fingerprint, and copy-paste the new value over your dummy value.
That was what I thought you would suggest. Not a fan, I'm afraid - it
this is a manual step, it's some pretty messy housekeeping; automating
it will require some gnarly rewriting of python code. Plus, it leaves
the initial run unvalidated, and subsequent runs are only as validated
as corresponding to the first run - the fact that the endpoint is
actually correct is not validated.
Its also a redundant validation, to an extent. You already have a
point of validation for the end of a migration path - a migration is
complete and correct when the models in models.py match the schema in
the database. Why not use that?
Yours,
Russ Magee %-)
I like all of these requirements. Here's some more I'd like to see:
* Identify a primary use-case, and make it dead simple for that case.
And simple for newbs.
* Provide, not only raw SQL migration entry, but also django ORM
migration entry.
* Add no extra configuation to the models.py.
I propose that the primary use-case should be considered
in-development migrations. At least personally, I find myself
constantly evolving my models.py during development. In this case, I
barely ever care about past evolutions, and introspection works 95% of
the time. I like the '--hint --evolve' idea a lot. However, I think
the standard suggestion should be './manage.py syncdb --hint', which
would generate the evolution and then list a summary of changes, and
then './manage.py syncdb --evolve' would finally enact them. This way
the user must see that summary and mistakes will be avoided. Also
"hint" seems like the wrong word. "suggest", "guide", "auto",
"build"? Hrm, none very good. I can't seem think of a better word;
so I'll ponder that a while...
Part of the allure of my proposal, to me anyway, is the idea that you
can define presync and postsync functions that share the same module
namespace. And each takes a "models" argument. The idea being that
you can use the previous ORM models.py in the presync, and then in the
postsync use the new ORM and the previously generated globals to
change up your data. Meanwhile you could also use the "cursor"
argument to run custom own sql commands. Although I also like the
idea of being able to have a simple sql file.
I don't like the bit about the Meta section. That adds configuration
to the models.py, even if it's not in the Model, proper, it's not that
far off from 'aka', in my view. And, although I like being able to
label snapshots of the schema, I don't like it being the primary
listing device. Each snapshot should work like an SVN version, in my
opinion, that way you can easily see the order without the need of the
Meta section. Perhaps evolution files could be listed thusly: "%s -
%s.py" % (version, label). Also it would be cool if the admin had a
special view for the evolutions of a model.
Nitpickery:
> mutations = [
> AddColumn('Author', 'dateofbirth', models.DateField, initial_value=None),
> DeleteColumn('Author', 'age')
> ]
>
> Initially, the evolution.mutation package will only include some
> simple transforms (Add, Delete, Rename, etc). However, the built in
> set can be easily expanded later on...
Since you have the signature of the past and present schemas, all you
really need to do is specify which tables are added, dropped, and
renamed. Take this structure as an alternative:
evolve( version = 2, drop = ['Author.age'], create = ['Author.dateofbirth'] )
The order of applied transformations shouldn't matter at this point if
you lock your data at the beginning and the end. This is acceptable
because it is easier, and just as robust, to define presync and
postsync functions that share a namespace, rather than to allow
arbitrary functions at any point in your transformation sequence.
And testing in this case is simpler, I'd imagine as you don't have to
define all sorts of test for all sorts of mutations.
> ... So, if pre_christmas_2007.py exists,
> ./manage.py syncdb --sqlhint will generate pre_christmas_2007.sql ...
What about "./manage.py syncdb --sqlevolve"? that fits the ./manage.py
patterns a bit more it seems to me.
Generally, I really like the whole idea, and I wish it was made, so
start doing that.
- Brantley Harris
re: re: pps
> > pps. Russell, you can't be '_strong_ -' , what's next '-2'?
>
> Of course. And then, -3. :-)
You crazy fool. You have no idea what you've unleased upon the world!
> "hint" seems like the wrong word. "suggest", "guide", "auto",
> "build"?
preview?
--
regards
kg
http://lawgon.livejournal.com
http://nrcfosshelpline.in/web/
I think that is the 'Alice' use case from the original spec. Although
Alice only does a field addition, I'd be happy to expand this use case
to include all additions, deletions, or property changes to fields.
> * Provide, not only raw SQL migration entry, but also django ORM
> migration entry.
Agreed. I thought this was a given, so I neglected to mention it.
> * Add no extra configuation to the models.py.
No particular disagreement here.
> In this case, I
> barely ever care about past evolutions, and introspection works 95% of
> the time. I like the '--hint --evolve' idea a lot. However, I think
> the standard suggestion should be './manage.py syncdb --hint', which
> would generate the evolution and then list a summary of changes, and
> then './manage.py syncdb --evolve' would finally enact them.
That's pretty much what I was driving at. Allowing '--hint --evolve'
as a single command might be a convenience, but there is something to
be said for putting a safety catch on these things.
> This way
> the user must see that summary and mistakes will be avoided. Also
> "hint" seems like the wrong word. "suggest", "guide", "auto",
> "build"? Hrm, none very good. I can't seem think of a better word;
> so I'll ponder that a while...
Meh. I think the bike shed should be green. I'm happy to go with any
suggestion. He who writes the code gets the command he wants :-)
> Part of the allure of my proposal, to me anyway, is the idea that you
> can define presync and postsync functions that share the same module
I'll come back to this idea in a little bit.
> I don't like the bit about the Meta section.
That was actually an afterthought on my part, and in retrospect, not a
very good one. My original idea was to keep the migration definition
in evolution/__init__.py. Thinking about the problem some more, a
location external to individual models is a much better place for
defining migration sequences (or graphs, if you take Derek's
suggestion). This is because migration is an app-wide task, not just a
model-base activity. The rename of Model1 to Model2 could also affect
Model3; migration tags really describe the evolution of an entire app,
not just a single model.
> Nitpickery:
> > mutations = [
> > AddColumn('Author', 'dateofbirth', models.DateField, initial_value=None),
> > DeleteColumn('Author', 'age')
> > ]
> >
> > Initially, the evolution.mutation package will only include some
> > simple transforms (Add, Delete, Rename, etc). However, the built in
> > set can be easily expanded later on...
>
> Since you have the signature of the past and present schemas, all you
> really need to do is specify which tables are added, dropped, and
> renamed. Take this structure as an alternative:
>
> evolve( version = 2, drop = ['Author.age'], create = ['Author.dateofbirth'] )
Your approach allow for three types of change - add, drop, and rename.
Then there is a specific pre-sync and post-sync. This covers a lot of
use cases; however:
* Every time I have written an API with a pre and post phase, I have
ended up needing a post-pre and pre-post, and post-post, and pre-pre,
and really-no-I-mean-it-pre....
* It doesn't allow for changes that must be performed in sequence -
first_name and last_name must be merged into name _before_ the search
vector field is added and populated.
* This approach doesn't allow for complex definitions that require
'pre/post-like' operations to be performed midway through a sequence:
'do this sequence of behaviour after the rename of X, but before the
deletion of Y'
A user defined migration could be:
- a migration in the classical sense (add, delete, rename)
- some raw SQL instructions that needs to be run on the cursor
- a set of Python instructions to modify data (such as your pre/post processing)
Essentially, anything you may want or desire to migrate from model
state A to model state B. You want a pre and post step? You can write
a 'PythonMutation' class, and install it as the first and last
mutation in an evolution. If you make it generic enough, we add it to
contrib.evolution.mutations for all to enjoy.
Now - I fully admit that there is some finessing to do here,
especially with regards to the capture of SQL generated by the
PythonMutation class. However I think this is a problem worth solving.
A separate thread on django-dev recently suggested that all
./manage.py commands should be able to output the SQL they would run
(e.g. - you can't easily capture the SQL output by syncdb); this would
also fit in nicely with some database mocking thoughts I have been
having recently.
Worst case (or for first pass), we could disable SQL generation for
evolutions with PythonMutations in them.
As far as development goes, we can get the framework incrementally.
First pass, we just have add and delete. Then we add rename. Then
change_properties. We set up a base framework, and let it expand (and
evolve :-) over time.
> The order of applied transformations shouldn't matter at this point if
> you lock your data at the beginning and the end. This is acceptable
> because it is easier, and just as robust, to define presync and
> postsync functions that share a namespace, rather than to allow
> arbitrary functions at any point in your transformation sequence.
I don't believe this is true. Even if your data is locked, there can
still be legitimate reasons for performing operations in sequence. I
gave a simple example previously; the more complex the changes, the
more likely that
> And testing in this case is simpler, I'd imagine as you don't have to
> define all sorts of test for all sorts of mutations.
Ah - but the fact that it is modularized makes it easier to test,
because we have a clean interface against which to test. Every
mutation must define get_sql - we can test that the output of this
method is correct. Test cases get smaller and faster.
Yours,
Russ Magee %-)
> I thought you were proposing moving away form ordered sequences? Besides:
i am moving towards graphs of schema revisions. but while lines are
always graphs (and graphs can have order), only the simplest graph can
be represented as a line. and graph paths are always represented as
pairs of nodes. there is no way of representing a directed graph
without specifying the start node and end node of each path:
http://en.wikipedia.org/wiki/Directed_graph#Directed_graph
the post validation is just free icing.
> migrations = {
> 'f1': stuff1,
> 'f2': stuff2,
> 'f3: stuff3,
> }
ie, this^^^ will never support anything but linear, one-way migrations.
which brings us to the greater point. no graph implementation can
guarantee you can't have multiple paths. that's up to the developer if
it's a constraint s/he wishes to impose. and multiple paths means you
can't have guaranteed data equivalence, as you pointed out. (even
though you can still have data validity, and schema consistency, which i
argue is good enough) but multiple paths is a feature, not a bug, of my
design. i'm not going to "fix" my implementation by removing support
for it.
yet you can easily still create single-path migrations. just define
your graph that way.
but on a larger "forest for the tree's" note russ, i'm hearing that:
1) no schema migration that supports multiple paths will ever be allowed
into django-proper. (not atomic, as you put it, even though i think
that's not what atomic means)
2) no introspection that works using my inadequate metadata syntax (or
any syntax similar to it, inside or outside models.py) will ever be
allowed in.
which means you're looking for something designed substantially
different from what i originally proposed last summer, what i've
developed, and what i have any interest in developing. should i go back
to just maintaining this for my own personal development use? can i
recommend dbmigration, which seems much more up your alley design-wise?
or does anyone have an interest in continued public development of
automated introspection + migration w/ multi-path support? (i promise
FK/M2M introspection is coming soon if there's a yes vote out
there....but it's mysql only right now and i won't bother finishing
postgres support unless i'm releasing publicly)
derek
p.s. not taking anything personal here, before anyone gets worried
about that again...just i think we might have irreconcilable differences
on the basic design elements, and with your "strong -3" on design i'm
not holding much hope that i'm doing more than pissing in the wind. :)
Well since the user defines the pre and post phases, he can append or
prepend code to either function, these aren't supposed to be stock
functions, so a post-post-sync is really just something appended to
the post-sync function.
> * It doesn't allow for changes that must be performed in sequence -
> first_name and last_name must be merged into name _before_ the search
> vector field is added and populated.
I just don't see why changes, within a unit-migration, need to be made
in any sort of ordered sequence, and your example doesn't really
change my mind. Can you suggest another example? I'm all ears on
this. It seems to me that everything productive can be done by
offering pre and post functions, and having to write my own class to
do something so simple as move some data seems way more complex than
needed.
For completeness: this can be done easily with a pre_sync that gathers
the "first_name" and "last_name" fields, and then a post_sync that
populates the "full_name" field, and populates/updates the search
vector field.
> Now - I fully admit that there is some finessing to do here,
> especially with regards to the capture of SQL generated by the
> PythonMutation class.
I wonder if the Django ORM could enter a sort of "theoretical" mode
that would just return SQL output.
> > The order of applied transformations shouldn't matter at this point if
> > you lock your data at the beginning and the end. This is acceptable
> > because it is easier, and just as robust, to define presync and
> > postsync functions that share a namespace, rather than to allow
> > arbitrary functions at any point in your transformation sequence.
>
> I don't believe this is true. Even if your data is locked, there can
> still be legitimate reasons for performing operations in sequence. I
> gave a simple example previously; the more complex the changes, the
> more likely that
Again, provide zamples or I will destroy Django.
Django exclaims, "Do what he says! He's crazy!"
i considered this. but i have my doubts that i could get the necessary
hooks pushed through to allow this to function as an external app.
if i have even a minor interest base however i'm willing to continue
publicly maintaining a patch against HEAD and/or the latest release. i
just want to know...
Hrm - I'd think again there. Personally, I'd *much* rather see
migration done as a third-party bit since it seems difficult to find
something we'd all want to see in core. I'd certainly push *strongly*
for any hooks you'd need to make S-E a sustainable external project.
One huge benefit of an external project is that we can gauge
acceptance *before* including it in Django. If it appears to be
popular, we might reconsider its inclusion.
> if i have even a minor interest base however i'm willing to continue
> publicly maintaining a patch against HEAD and/or the latest release. i
> just want to know...
There's a good deal of interest from me, at least. I like the project,
I just have doubts about it as a built-in part of Django.
So what hooks do you think you'd need? Would the sql-level hooks being
discussed in other threads be sufficient?
Jacob
would you allow a hook for still annotating in models.py?
derek