Bulk import of data

239 views
Skip to first unread message

Fabio Natali

unread,
Nov 25, 2011, 9:03:42 AM11/25/11
to django...@googlegroups.com
Hi everybody!

I have a CSV file and have to import it in my Django website.

Say I have three models involved: category, sub_category and product.

### models.py ###
class Category(models.Model):
name = models.CharField(max_lenth=100)

class SubCategory(models.Model):
name = models.CharField(max_length=100)
parent_category = models.ForeignKey(Category)

class Product(models.Model):
name = models.CharField(max_length=200)
sub_category = models.ForeignKey(SubCategory)
##########


Say this is my CSV file (category, sub_category, product):

### file.csv ###
clothing man;trousers;levis 501
clothing woman;shirt;nice shirt
[...]
##########


I am not sure on which way to go. Do you have any hints or web
references to look at?

Thanks, Fabio.

--
Fabio Natali

Tom Evans

unread,
Nov 25, 2011, 9:12:31 AM11/25/11
to django...@googlegroups.com

It's not that tricky, is it?

Read the CSV file, split out the fields.
Get or create the category
Get or create the subcategory
Get or create the product

in code:

import csv
data = csv.reader(open('/path/to/csv', 'r'), delimiter=';')
for row in data:
category = Category.objects.get_or_create(name=row[0])
sub_category = SubCategory.objects.get_or_create(name=row[1],
defaults={'parent_category': category})
product = Product.objects.get_or_create(name=row[2],
defaults={'sub_category': sub_category})

http://docs.python.org/library/csv.html

Cheers

Tom

Fabio Natali

unread,
Nov 25, 2011, 11:23:39 AM11/25/11
to django...@googlegroups.com
On 11/25/2011 03:12 PM, Tom Evans wrote:
[...]

> It's not that tricky, is it?
>
> Read the CSV file, split out the fields.
> Get or create the category
> Get or create the subcategory
> Get or create the product
>
> in code:
>
> import csv
> data = csv.reader(open('/path/to/csv', 'r'), delimiter=';')
> for row in data:
> category = Category.objects.get_or_create(name=row[0])
> sub_category = SubCategory.objects.get_or_create(name=row[1],
> defaults={'parent_category': category})
> product = Product.objects.get_or_create(name=row[2],
> defaults={'sub_category': sub_category})
>
> http://docs.python.org/library/csv.html

Hey Tom, that's very kind of you, so helpful and fast!

I'll use that in my real scenario (which is a bit more complicated).
I'll be back here soon, reporting success :-) or asking for more help!

Cheers!

--
Fabio Natali

Fabio Natali

unread,
Nov 26, 2011, 6:55:42 AM11/26/11
to django...@googlegroups.com

It works like a charm! Thanks again Tom.

--
Fabio Natali

Petr Přikryl

unread,
Nov 26, 2011, 5:44:50 PM11/26/11
to django...@googlegroups.com

>>> import csv
>>> data = csv.reader(open('/path/to/csv', 'r'), delimiter=';')
>>> for row in data:
>>> category = Category.objects.get_or_create(name=row[0])
>>> sub_category = SubCategory.objects.get_or_create(name=row[1],
>>> defaults={'parent_category': category})
>>> product = Product.objects.get_or_create(name=row[2],
>>> defaults={'sub_category': sub_category})

There are few potential problems with the cvs as used here.

Firstly, the file should be opened in binary mode. In Unix-based
systems, the binary mode is technically similar to text mode.
However, you may once observe problems when you move
the code to another environment (Windows).

Secondly, the opened file should always be closed -- especially
when building application (web) that may run for a long time.
You can do it like this:

...
f = open('/path/to/csv', 'rb')
data = csv.reader(f, delimiter=';')
for ...
...
f.close()

Or you can use the new Python construct "with".

P.

Andre Terra

unread,
Nov 27, 2011, 1:56:11 PM11/27/11
to django...@googlegroups.com
This should be run asynchronously (i.e. celery) when importing large files.

If you have a lot of categories/subcategories, you will need to bulk insert them instead of looping through the data and just using get_or_create. A single, long transaction will definitely bring great improvements to speed.

One tool is DSE, which I've mentioned before.

Good luck!


Cheers,
AT


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.


Anler Hernandez Peral

unread,
Nov 28, 2011, 3:28:40 AM11/28/11
to django...@googlegroups.com
Hi, this is probably not your case, but in case it is, here is my story: Creating a script for import CSV files is the best solution as long as they are few, but in my case, the problem was that I need to import nearly 40 VERY BIG CSV files, each one mapping a database table, and I needed to do it quickly. I thought that the best way was to use MySQL's "load data in local..." functionality since it works very fast and I could create only one function to import all the files. The problem was that my CSV files were pretty big and my database server were eating big amounts of memory and crashing my site so I ended up slicing each file in smaller chunks.
Again, this is a very specific need, but in case you find yourself in such situation, here's my base code from which you can extend ;)


--
anler

Fabio Natali

unread,
Nov 29, 2011, 12:49:52 PM11/29/11
to django...@googlegroups.com

Hey Petr! Thank you so much, I immediately followed your advice. File is
closed at the end of the story.

Cheers, Fabio.

--
Fabio Natali

Fabio Natali

unread,
Nov 29, 2011, 12:54:02 PM11/29/11
to django...@googlegroups.com
On 11/27/2011 07:56 PM, Andre Terra wrote:
> This should be run asynchronously (i.e. celery) when importing large files.
>
> If you have a lot of categories/subcategories, you will need to bulk
> insert them instead of looping through the data and just using
> get_or_create. A single, long transaction will definitely bring great
> improvements to speed.
>
> One tool is DSE, which I've mentioned before.

Hi Andre, I didn't forget tips you gave me in a previous thread about
Celery and DSE. Actually I've been studying them for a while. At the
moment things doesn't seem to heavy. I think I'll use them in future
develpment.

Thank you very much, Fabio.

--
Fabio Natali

Fabio Natali

unread,
Nov 29, 2011, 12:57:51 PM11/29/11
to django...@googlegroups.com
On 11/28/2011 09:28 AM, Anler Hernandez Peral wrote:
> Hi, this is probably not your case, but in case it is, here is my story:
> Creating a script for import CSV files is the best solution as long as
> they are few, but in my case, the problem was that I need to import
> nearly 40 VERY BIG CSV files, each one mapping a database table, and I
> needed to do it quickly. I thought that the best way was to use MySQL's
> "load data in local..." functionality since it works very fast and I
> could create only one function to import all the files. The problem was
> that my CSV files were pretty big and my database server were eating big
> amounts of memory and crashing my site so I ended up slicing each file
> in smaller chunks.
> Again, this is a very specific need, but in case you find yourself in
> such situation, here's my base code from which you can extend ;)
>
> https://gist.github.com/1dc28cd496d52ad67b29

Dear Anler, thank you for sharing your experience and your code. That's
very kind of you. I'll study it and ask you for questions.

Nathan McCorkle

unread,
Dec 2, 2011, 12:21:21 AM12/2/11
to django...@googlegroups.com
would interfacing with SQL via C or C++ be faster to parse and load
data in bulk? I have files that are only a few MB worth of text, but
can take hours to load due to the amount of parsing I do, and the
number of database entries each item in a file makes

--
Nathan McCorkle
Rochester Institute of Technology
College of Science, Biotechnology/Bioinformatics

Cal Leeming [Simplicity Media Ltd]

unread,
Dec 2, 2011, 6:22:43 AM12/2/11
to django...@googlegroups.com
Faster in what sense? Prototyping/development time, or run time?

If it's only a few MB, I see little reason to go as far as to writing it in C. Unless you are performing the same import tens of thousands of times, and the overhead in Python adds up so much that you get problems.

But, quite frankly, you'll max out MySQL INSERT performance before you max out Pythons performance lol - as long as you don't use the ORM for inserts :)

Cal

Nathan McCorkle

unread,
Dec 3, 2011, 9:37:14 PM12/3/11
to django...@googlegroups.com
On Fri, Dec 2, 2011 at 6:22 AM, Cal Leeming [Simplicity Media Ltd]
<cal.l...@simplicitymedialtd.co.uk> wrote:
> Faster in what sense? Prototyping/development time, or run time?

Well I can count the lines in each file in a few seconds, so I think
the SQL stuff is slowing everything down (using postgres through
psycodb2)

>
> If it's only a few MB, I see little reason to go as far as to writing it in
> C. Unless you are performing the same import tens of thousands of times, and
> the overhead in Python adds up so much that you get problems.
>
> But, quite frankly, you'll max out MySQL INSERT performance before you max
> out Pythons performance lol - as long as you don't use the ORM for inserts
> :)

when you say 'as long as you don't use the ORM for inserts', do you
mean don't do:
currentDataset.comment="blah"
currentDataset.name="abc12"
currentDataset.relatedObject=otherCurrentObject.id

etc,etc?

Are you saying I should be doing all that in python, but using raw SQL
instead of the fancy python object-like way? like this:
https://docs.djangoproject.com/en/dev/topics/db/sql/#executing-custom-sql-directly

Karen Tracey

unread,
Dec 3, 2011, 9:53:39 PM12/3/11
to django...@googlegroups.com
On Sat, Dec 3, 2011 at 9:37 PM, Nathan McCorkle <nmz...@gmail.com> wrote:
when you say 'as long as you don't use the ORM for inserts',

You do not want to be building and saving objects individually with the ORM. You want to be using some form of bulk insert. Django 1.4 will add bulk create capability (see https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create). Prior to 1.4 you can use the development trunk level code, or you can find other libraries that provide bulk insert (see for example http://ole-laursen.blogspot.com/2010/11/bulk-inserting-django-objects.html). http://www.caktusgroup.com/blog/2011/09/20/bulk-inserts-django/ mentions the kinds of speedups you can expect with using a bulk insert technique vs. creating objects individually with the ORM.

Karen
--
http://tracey.org/kmt/

Thomas Weholt

unread,
Dec 4, 2011, 5:51:07 AM12/4/11
to django...@googlegroups.com
I think it's been mentioned before in this thread, but DSE was made to
help solve problems this. The other solutions I've seen for this
problem, like the one mentioned by Karen, lack several of DSE's
features, like handling of default values defined in your model.

https://bitbucket.org/weholt/dse2

And yes, I'm the author of DSE ;-)

Regards,
Thomas

> --
> You received this message because you are subscribed to the Google Groups
> "Django users" group.
> To post to this group, send email to django...@googlegroups.com.
> To unsubscribe from this group, send email to
> django-users...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-users?hl=en.

--
Mvh/Best regards,
Thomas Weholt
http://www.weholt.org

Tom Evans

unread,
Dec 5, 2011, 6:41:17 AM12/5/11
to django...@googlegroups.com

OTOH, if (like the OP) you have a few small (<1k) CSV files to import,
then going through the ORM is easy, simple and quick to code. The fact
that you can get it to run 1 minute quicker by not using the ORM is
not relevant, unless importing CSV files is the main task of your
server, and it will be doing it 24x7.

Premature optimization is the root of all evil.

Cheers

Tom

Victor Hooi

unread,
Dec 5, 2011, 2:24:34 PM12/5/11
to django...@googlegroups.com
heya,

Hmm, I was previously under the impression that for these sorts of things (importing and instantiating models from CSV), the recommended way to create a ModelForm, and pass each line of the CSV through that, and have that handle model validation for you.

In our case, we have a CSV file, but quite a few of those lines are probably invalid/erroneous - we need to provide useful messages on those, and import the rest of the file.

Is the recommended way now to use bulk_create (https://docs.djangoproject.com/en/dev/ref/models/querysets/#bulk-create), process all the lines, and hit that in one call?

How would you handle invalid lines or validation errors for individual rows?

Cheers,
Victor
Reply all
Reply to author
Forward
0 new messages