data migrations for large tables on Postgres

28 views
Skip to first unread message

Vernon D. Cole

unread,
May 14, 2015, 5:43:38 PM5/14/15
to django...@googlegroups.com
I have learned the hard way this week that data migrations in django 1.8, as wonderful has they are, do not scale.

My test data table is now sitting at about 90,000,000 rows.  I was able to add a "null=True" field in an instant, as documented.  Then came my attempt to fill it -- I tried using RunSQL, as suggested, and the migration ran for more than a day before crashing. The entire migration is treated as a single transaction, so none of the work was committed. My next attempt was to use RunPython with manual transaction control, but that cannot be done because the "atomic=False" argument to RunPython does not work on a Postgres database.

My final solution (which should be done propagating my new field about three days from now) was to have the migration module spawn a copy of itself as a no-wait subprocess. It then runs as a main program, it opens its own connection to the database, and does the conversion a chunk at a time with manual transaction control.

I think that my solution may be of benefit to others who might be able to adapt my code to their own situation -- but I am not sure how to publish it.  It is not a module, so publication on PyPi or such would be wrong.  What would be effective?

john

unread,
May 14, 2015, 6:16:43 PM5/14/15
to django...@googlegroups.com
As a newbie - would it be better to use pgadmin (or psql) to make the changes and migrate --fake so that Django would be happy?

Johnf
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/58bbdd64-2b6d-491f-9429-3cba8a54d94f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vernon D. Cole

unread,
May 14, 2015, 6:35:22 PM5/14/15
to django...@googlegroups.com
I was thinking of something like that ... but I am not a good enough plpgsql programmer to figure out how to do the chunking in SQL.  I needed a way to do the data calculations a thousand rows at a time, rather than all ninety-million in one gulp.  So I have Python do the chunking and SQL do the update. 

There was also the documentation aspect: "where do I store the SQL code and the instructions for how to do the migration manually?"  This way, everything is contained in the migration module.  The SQL is stored in Python triple-quoted string literals, along with the code to run it, and the database connection information comes right out of the django settings module.
Reply all
Reply to author
Forward
0 new messages