index_together...shouldn't indexes be created AFTER populating the table?

174 views
Skip to first unread message

Chris Pagnutti

unread,
Nov 14, 2012, 3:14:48 PM11/14/12
to django...@googlegroups.com
Hi.  The new index_together feature is great, but I think it's best to create an index on a table AFTER the table is filled, and assuming there won't be many new inserts.  But in django, syncdb creates the index at the same time the table is created, and of course, the table is initially empty.  How does django deal with this?  Does it know to drop the index before inserting new records and re-create the index afterwards?

If not, can I syncdb, drop the index manually in mysql, populate the table, then re-create the index manually (as long as I use the same index name that django chose)?  Will django be none-the-wiser if I do this?

Javier Guerra Giraldez

unread,
Nov 14, 2012, 3:42:55 PM11/14/12
to django...@googlegroups.com
On Wed, Nov 14, 2012 at 3:14 PM, Chris Pagnutti
<chris.p...@gmail.com> wrote:
> The new index_together feature is great, but I think it's best to create an
> index on a table AFTER the table is filled, and assuming there won't be many
> new inserts.

AFAIK, this is an optimization advice applicable only for mostly-static tables.

> But in django, syncdb creates the index at the same time the
> table is created, and of course, the table is initially empty. How does
> django deal with this?

nothing to deal with, this is normal behaviour in all SQL-based databases.

> Does it know to drop the index before inserting new
> records and re-create the index afterwards?

hell, no!, that would be hideously inefficient!

what could be workable is to do that only on really big batch inserts,
only if the indexes don't carry a restriction (like UNIQUE indexes),
and really part of an off-line operation, not in the normal HTTP
request processing.

--
Javier

Chris Pagnutti

unread,
Nov 14, 2012, 3:49:23 PM11/14/12
to django...@googlegroups.com
Thanks for your reply.  The tables I'm dealing with are entirely static, but some have many millions of records, which is why I want my indexes to work as good as possible.  If I create my indexes manually AFTER populating the tables, will queries made via the django db api on those tables use those indexes properly?

I'm thinking the answer is yes, since the use of the indexes happens at the mysql level, and django just has to issue the select statement.


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.


Javier Guerra Giraldez

unread,
Nov 14, 2012, 4:36:00 PM11/14/12
to django...@googlegroups.com
On Wed, Nov 14, 2012 at 3:49 PM, Chris Pagnutti
<chris.p...@gmail.com> wrote:
> Thanks for your reply. The tables I'm dealing with are entirely static, but
> some have many millions of records, which is why I want my indexes to work
> as good as possible. If I create my indexes manually AFTER populating the
> tables, will queries made via the django db api on those tables use those
> indexes properly?

of course. although in your case what i'd do is:

- create the table with indexes
- suspend them
- bulk-insert the data
- resume (and rebuild) indexes

it's mostly the same as you propose, but i just like having as much of
the description as possible in the Django model declaration.

OTOH, if you insert millions of records offline, you might not want to
do it with Django. MySQL has the LOAD DATA INFILE command and
mysqlimport utility, for exactly this scenario. The docs also advice
using ALTER TABLE tbl_name DISABLE/ENABLE KEYS commands to
suspend/resume indexing.


> I'm thinking the answer is yes, since the use of the indexes happens at the
> mysql level, and django just has to issue the select statement.

exactly. the choice of indexing strategy is done by the RDMS on a
per-statement basis.

in fact, since the index_together wasn't available, i have done
exactly that (create without indexes, add them manually) more than
once. usually i intend to add the ALTER TABLE commands to
sql/modelname.sql files as described in
https://docs.djangoproject.com/en/1.4/howto/initial-data/#providing-initial-sql-data


finally, the top authority in bulk-insertion of data to Django is Cal
Leeming (a regular on this list); i haven't been able to see his
webinar yet, but i'm sure he has lots of insight to add to this.

--
Javier

Chris Pagnutti

unread,
Nov 14, 2012, 7:55:22 PM11/14/12
to django...@googlegroups.com
Awesome.  So I don't even have to declare the "index_together" in the django models?  I can just create them afterwards using pure mysql?

LOAD DATA INFILE is wicked fast!!! Thanks a ton for that one.
Message has been deleted

Jamie Lawrence

unread,
Nov 15, 2012, 1:05:54 AM11/15/12
to django...@googlegroups.com, django...@googlegroups.com
Keep in mind that whatever RDBMS you are using is completely separate from your application. You can alter it out from under the app in any way you see fit. The code offers some convenience methods for generating things, but that doesn't tie the two together more than an expectation on the application side.

You could generate a DB completely separately (in fact, I've built Django around existing DBs in the past).

If you will be operating at volume, and have these sorts of questions, I'd suggest consulting a local database person.

-j

--
Sent from a phone, please excuse typos and terseness.

On Nov 14, 2012, at 9:53 PM, in Dennis Lee Bieber <wlf...@ix.netcom.com> wrote:

> On Wed, 14 Nov 2012 15:42:55 -0500, Javier Guerra Giraldez
> <jav...@guerrag.com> declaimed the following in
> gmane.comp.python.django.user:
>
>> On Wed, Nov 14, 2012 at 3:14 PM, Chris Pagnutti
>> <chris.p...@gmail.com> wrote:
>>> The new index_together feature is great, but I think it's best to create an
>>> index on a table AFTER the table is filled, and assuming there won't be many
>>> new inserts.
>>
>> AFAIK, this is an optimization advice applicable only for mostly-static tables.
> Most of my books consider it an optimization for bulk-loading of
> table data (say, recreating the tables from an SQL dump file (something
> like the MySQL backup format -- a la:
>
> --
> -- Definition of table `conventions`
> --
>
> DROP TABLE IF EXISTS `conventions`;
> CREATE TABLE `conventions` (
> `ID` int(11) NOT NULL auto_increment,
> `name` char(100) NOT NULL default '',
> `URL` char(75) default NULL,
> `sortdate` date NOT NULL default '0000-00-00',
> `dates` char(50) NOT NULL default 'TBD',
> `site` char(75) NOT NULL default 'TBD',
> `notes` char(125) default NULL,
> PRIMARY KEY (`ID`),
> KEY `sortdate` (`sortdate`)
> ) ENGINE=MyISAM AUTO_INCREMENT=42 DEFAULT CHARSET=latin1;
>
> --
> -- Dumping data for table `conventions`
> --
>
> /*!40000 ALTER TABLE `conventions` DISABLE KEYS */;
> INSERT INTO `conventions`
> (`ID`,`name`,`URL`,`sortdate`,`dates`,`site`,`notes`) VALUES
> (28,'OklaCon X<br><i>It Came From Planet
> X!/i>','http://www.oklacon.com/','2012-10-22','October 17 - 22,
> 2012','Watonga, OK','Roman Nose State Park'),
> (2,'BayCon 2012','http://www.baycon.org/','2012-05-28','May 25 - 28,
> 2012','Santa Clara, CA',NULL),
> (3,'AnthroCon 2012<br><i>A Midsummer Night\'s
> Dream</i>','http://www.anthrocon.org/','2012-06-14','June 14 - 17,
> 2012','Pittsburgh, PA',NULL),
>
> <snip>
>
> (37,'Furry Connection
> North','http://www.furryconnectionnorth.com/','2012-04-15','April 13 -
> 15, 2012','',NULL),
> (41,'Wild Nights','http://www.wildnights.org/','2012-04-30','April 26 -
> 30, 2012','Robber\'s Cave State Park, OK','Another Camp Convention');
> /*!40000 ALTER TABLE `conventions` ENABLE KEYS */;
>
> (in which the /*!40000 ...*/ lines are conditional based upon the
> version of the MySQL server executing the statements, and are used to
> disable the indices at the start of the data load, then enable them at
> the end)
> --
> Wulfraed Dennis Lee Bieber AF6VN
> wlf...@ix.netcom.com HTTP://wlfraed.home.netcom.com/

Cal Leeming [Simplicity Media Ltd]

unread,
Nov 18, 2012, 4:27:42 PM11/18/12
to django...@googlegroups.com
Hello Chris,

You are correct that creating indexes AFTER importing data is generally the fastest approach, but there is currently no automated tool that will do this all for you.

I actually did a blog post about this not so long ago, and it goes into some detail;

One way is to dump the table structures, create a clone of the table, use SQLYog to compare the clone to the original and copy out only the INDEX statements, then dump the data too.

But, if you're worried about insert performance then you'd want to think about using a multi-threaded import approach, using tools such as mydumper (slightly unstable) or splitting the dump into X number of files based on line count. You'd also need to tinker with the innodb settings to ensure the disks are being flushed to all the time, increasing the checkpoint age, flush method etc.

I am by no means an expert on InnoDB (it's just been recently that I've started to dive deep into it), and Google will almost certainly offer infinitely better explanations on which parameters to tune etc.

Generally speaking, you only need to worry about this sort of thing once you have passed a couple of million rows or approx 20GB datadir size.. The amount of time this sort of thing consumes, not to mention the increased risk of error or missing data, really is a pain.. 

The end conclusion I came to is that mydumper/mysqldump/mysqlimport all suck, and there is a lot of room for improvement.. I'm hoping to build such a tool one day :)

Cal

On Wed, Nov 14, 2012 at 8:14 PM, Chris Pagnutti <chris.p...@gmail.com> wrote:
Hi.  The new index_together feature is great, but I think it's best to create an index on a table AFTER the table is filled, and assuming there won't be many new inserts.  But in django, syncdb creates the index at the same time the table is created, and of course, the table is initially empty.  How does django deal with this?  Does it know to drop the index before inserting new records and re-create the index afterwards?

If not, can I syncdb, drop the index manually in mysql, populate the table, then re-create the index manually (as long as I use the same index name that django chose)?  Will django be none-the-wiser if I do this?

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/django-users/-/gBx54Nzn5X8J.
Reply all
Reply to author
Forward
0 new messages