Massive import in Django database

49 views
Skip to first unread message

John Carlo

unread,
Jun 11, 2014, 9:14:02 AM6/11/14
to django...@googlegroups.com
Hello everybody,

I've fallen in love with Django two years ago and I've been using it for my job projects. In the past I found very useful information in this group, so a big thank you guys!

I have a little doubt.
I have to import in Django db (sqlite for local development, mySql on the server) about 1.000.000 xml documents.

The model class is the following:

class Doc(models.Model):
    doc_code =  models.CharField(max_length=20, unique=True, primary_key=True, db_index = True) 
    doc_text = models.TextField(null=True, blank=True) 
    related_doc= models.ManyToManyField('self', null=True, blank=True, db_index = True) 

From what I know bulk insertion is not possibile because I have a ManyToManyField relation.

So I have this simple loop (in pseudo code)

for each xml:
   extract from the xml  date-> mydoc_code, mydoc_text, myRelated_doc_codes

   myDoc = Doc.object.get_or_create(doc_code = mydoc_code)[0]
   myDoc.doc_text = mydoc_text
   
   for reldoc_code in myRelated_doc_codes:
        myRelDoc =  Doc.object.get_or_create(doc_code = reldoc_code )[0]
        myDoc.related_doc.add(myRelDoc )

  myDoc.save()


I'm doing it right? Do you have some suggestions, recommendation? I fear that since I have 1.000.000 docs to import, it will take a loooot of time, especially during the get_or_create routines

thank you in advance everybody!

John




             

moqia...@gmail.com

unread,
Jun 11, 2014, 11:48:17 AM6/11/14
to django-users
Hi, John:
I think your code is right, except "Doc.object" should be "Doc.objects";
 
The following pseudo code maybe fater than what you write:
 
doc_map = {}
for each xml:
extract from the xml data -> mydoc_code, mydoc_text, myRelated_doc_codes
doc = Doc.objects.create(doc_code=mydoc_code, doc_text=mydoc_text)
doc_map[mydoc_code] = (doc, myRelated_doc_codes)
for (doc, rcodes) in doc_map.values():
for rcode in rcodes:
doc.related_doc.add(doc_map[rcode])
doc.save()
 
I have checked, It's okay;
The object have be cached in doc_map, and no need re-query related_codes for related_doc from database,  the speed should speed up.
 
With Regards.
 

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/5b88deaf-d806-4a64-9e8d-528d95599c80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

moqia...@gmail.com

unread,
Jun 11, 2014, 8:12:00 PM6/11/14
to django-users
Hi, John:
Sorry! The pseudo code write by me is not correct, and It's slow..   I will come back tonight.
 
With Regards,
Qiancong,Mo

Erik Cederstrand

unread,
Jun 12, 2014, 2:06:54 AM6/12/14
to Django Users
Den 11/06/2014 kl. 15.14 skrev John Carlo <johncar...@gmail.com>:

> Hello everybody,
>
> I've fallen in love with Django two years ago and I've been using it for my job projects. In the past I found very useful information in this group, so a big thank you guys!
>
> I have a little doubt.
> I have to import in Django db (sqlite for local development, mySql on the server) about 1.000.000 xml documents.
>
> The model class is the following:
>
> class Doc(models.Model):
> doc_code = models.CharField(max_length=20, unique=True, primary_key=True, db_index = True)
> doc_text = models.TextField(null=True, blank=True)
> related_doc= models.ManyToManyField('self', null=True, blank=True, db_index = True)
>
> From what I know bulk insertion is not possibile because I have a ManyToManyField relation.

Actually, you *can* bulk insert. You just have to extract the m2m relation into an intermediate model (https://docs.djangoproject.com/en/dev/topics/db/models/#extra-fields-on-many-to-many-relationships). Bulk insert Doc instances first, then the related_doc relations. But if it's a one-time import job, then just start it Friday afternoon and skip the extra complexity.

Erik
Reply all
Reply to author
Forward
0 new messages