Re: advice on running migrations on a large database

2,287 views
Skip to first unread message

Andrew Godwin

unread,
May 19, 2013, 3:36:10 AM5/19/13
to south...@googlegroups.com
Hi Jonathan,

It sounds like you were adding a column with a default value, since you were able to cancel it and it was still there. Adding default values means your database must rewrite every single one of your rows, although even if you add a NULLable column MySQL still does this, which is why it takes 10/15 minutes "normally".

MySQL is also just this slow, and there's no way of cleaning up once they die, since MySQL's support for schema-altering statements (DDL) is very poor. PostgreSQL not only guarantees to add a NULLable column without locking a table and in only a single operation (so it usually takes one or two milliseconds) it also allows wrapping of DDL statements inside transactions - if you were using it when you killed the command, it would have rolled back to the previous state (it may have been faster, but if you're adding default values PostgreSQL still has to rewrite every row, which takes some time).

Don't worry about memory usage - only a single query is being run, so South won't be getting slow because of that.

If you want my honest advice for changing schema on large databases:

 - Use PostgreSQL. We moved Lanyrd to it because we had a ~10 million row table which we needed to add columns to, and it was taking 20 minutes per column under MySQL. It took less time to move the data across than to add our six new columns.

 - If you must use MySQL, consider using OneToOneFields rather than adding columns. It's incredibly inefficient to store things this way and slow to query, but it won't lock the entire old table while you add new data.

 - Generally, never, ever add columns with default values - especially not ones like NOW(). South does its best to stop you doing this - Django handles all defaults in its own ORM before things hit the database anyway, so defaults aren't needed on the DB's side - but if you add a null=False column (the default) South will have to add the column with a default, which is very very slow, and then it'll remove the default afterwards (which is fast), so that the new column has data in all of the existing rows.

Large datasets are generally a pain to handle, but these are my top tips. Most large MySQL installations aren't changing its schema - they're usually implementing a pseudo-database on top of it, with EAV or something like that. Your dataset may well be better suited to a different storage architecture than just plain relational, as well - perhaps JSON blobs if all you're doing is adding a lot of sparse, optional attributes a lot - but without a deeper knowledge of what you're trying to do it's hard to say anything there.

Andrew


On Sun, May 19, 2013 at 3:40 AM, Jonathan Morgan <jonathan....@gmail.com> wrote:
Hello,

I am using Django's ORM for a research project in which I am working with MySQL InnoDB database tables in excess of 2 million rows.  I have started using South for this project to manage database changes as I go (never used it before, pledged to the kickstarter for integrating it into django, decided I should get to know it - I like it so far!).

Today, I added two fields (one varchar, one text) to the model for the database table that has 2.5 million rows.  I generated the migration, then applied it.

Adding a column on this database normally takes a while - 20, 30 minutes.  The south migration ran for 4 hours before I killed it (I would have let it go longer out of curiosity, but I am on a pretty tight deadline).  Once I killed it, the columns were present in the information_schema for the database, but I wasn't sure what might have been screwed up by my killing the migration, so I dropped them, re-added them by hand, then did a fake migration to get south up to current.

So, a few questions:
- is this normal?  Might I have my database configured incorrectly?  If it would help and you want more information, I'd be happy to profile or test if you can give guidance as to how I'd do that.
- if you kill one of these migrations in the middle, what are your thoughts on cleaning up?  I usually just undo everything by hand, then do it over, to be safe.
- do you have any tips for using south with relatively large databases (I sometimes work with much larger)?
- are there ways to check on the progress of the migration?  It would be nice if there was a verbose flag that output progress as it progressed through the steps of a multi-step migration.

For context, the machine is running ubuntu 13.04, latest python (2.7.4), django (1.5.1), and mysql (5.5.31-00) for the distribution, has a 4-core i7 2600K, 20 GB of RAM, the database is only 3 GB, and I have mysql configured so that it caches up to 8GB of data in RAM, so effectively the whole database is loaded in RAM, and changes are only written to disk once a second.  So, a reasonably competent machine.

Also, in terms of my use of django in long-run situations, whenever I use django with large datasets and processes that run long, I usually have to periodically call django.db.reset_queries() and manually garbage collect, else memory gets chewed up and things slow down substantially.  Not sure if this might have some relevance to what is going on here, but figured I'd throw it out there, just in case.

Thanks for any help or advice you can provide!

Jonathan Morgan

--
You received this message because you are subscribed to the Google Groups "South Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to south-users...@googlegroups.com.
To post to this group, send email to south...@googlegroups.com.
Visit this group at http://groups.google.com/group/south-users?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Shai Berger

unread,
May 19, 2013, 5:13:51 AM5/19/13
to south...@googlegroups.com
http://code.google.com/p/openarkkit/ gives some support to schema changes on
large MySql databases. I don't have personal experience with it, but I've
heard a lot of praise, and rumors are that Facebook are using a fork of this
for their own schema changes. The author, Shlomi Noach
(http://code.openark.org/blog/) is a well-known MySql expert.

But yes, my basic advice would be in line with Andrew's: Just stir clear of
MySql, if you can.

Shai.

Jonathan Morgan

unread,
May 28, 2013, 9:51:26 AM5/28/13
to south...@googlegroups.com
Thanks for the help (and sorry for the late response - I forgot to set up email notifications on the group so I didn't see this response until now).  No defaults is a good one.  I didn't realize that it was the default value that caused the rewrite of every row, not simply the addition of a column.  I am ready to switch to postgre instead of MySQL - MySQL's way of dealing with UTF-8 is annoying at best, also - but for this project, I am on a pretty tight deadline, so I'm going to muddle through.

I think I need the relational aspect of some kind of RDBMS because I have relatively complicated relations in my data set (reddit posts, tied to users, comments, domains of links, and time-series data with time slices of varying lengths), and I relatively frequently have to walk the data to generate different flattened data sets for analysis (and I use django so I can do that in python instead of R or SPSS).  I also do a combination of hand-coding and automated coding and filtering of the data (which domains are "news", for example), and so I want to stay as normalized as I can so I can keep this coding single-source and consistent across different generated data tables.  Again, because of the tight deadline, in this case I've been adding columns to a particular data table so I can keep the data in a single table for ease of updating in statistics packages (sigh).

Anyway, thanks for advice, and keep up the good work on south.  The more I use it, the more I like it.  Looking forward to the kickstarter enhancements!

Jonathan Morgan

unread,
Dec 17, 2013, 12:49:24 PM12/17/13
to south...@googlegroups.com
By the way, I finally made time to switch to postgresql, and the difference in many things is night and day when dealing with large databases.  Reads and writes are both faster and remain quick as tables grow.  In particular, in a table of time-series data I was building that went from 0 rows to over 20 million, the increase in the time it took to write a new row as the table size increased was linear with a slope only slightly larger than 0 (where with mysql, it was closer to exponential once you got over a few million rows - processing a given time period went from taking 25 minutes at the beginning to taking over 4 hours 27 periods in - when I finally killed the process and switched databases to see how much postgresql would help).  And with no performance tuning at all for postgres.  Migration of a django app's database from mysql using https://github.com/philipsoutham/py-mysql2pgsql was also quick, efficient, and created no problems that I have found yet.  Very impressive.  To anyone reading this, if you are doing big data research and want/need a relational database, use postgresql, not mysql.
Reply all
Reply to author
Forward
0 new messages