bulk add m2m relationship for multiple instances

851 views
Skip to first unread message

yakka...@gmail.com

unread,
Aug 11, 2015, 10:47:39 PM8/11/15
to Django users
My app is getting killed by adding m2m relationships.  I can bulk create the objects but for the m2m I have to iterate over them which is super slow.  I'm looking for a way to bulk add the m2m relationships.  

Here is a example:
 class Sensor(models.Model):
    Name     = models.CharField( max_length=200 )
    Value    = models.FloatField()

class DataPoint(BaseModel):
    Taken_datetime = models.DateTimeField( blank=True, null=True )
    Sensors        = models.ManyToManyField( SensorVal, blank=True, null=True )   

for row in rows:
    dp = DataPoint.objects.get(Taken_datetime=row['date'])

    sensorToAdd = []
    for sensor in sensors:
        s = Sensor.objects.get(Name=sensor.name, Value=sensor.value  )
        sensorToAdd.append( s )

    dp.Sensors.add( sensorToAdd )

In the actually app I bulk create all the DataPoint and Sensor instances.  The problem is that |dp.Sensors.add( sensorToAdd )| does a lot of hits on the db.  I want a way bulk add all the sensors. 

Can I bulk add the m2m relationships in Django?  If not, can this be done in SQL and can you give me some pointers about where to get started?  I'm still a 

I'm a beginner with SQL but it seems like there should be a way to group all these calls together instead of creating one at a time. 


I'm using sqlite for development and postgres on the server. 

Here is the stack overflow I posted a few days ago: http://stackoverflow.com/questions/31907116/bulk-add-m2m-relationship-for-multiple-instances

Erik Cederstrand

unread,
Aug 12, 2015, 3:04:00 AM8/12/15
to Django Users
> Den 12/08/2015 kl. 04.47 skrev yakka...@gmail.com:
>
> for row in rows:
> dp = DataPoint.objects.get(Taken_datetime=row['date'])
>
> sensorToAdd = []
> for sensor in sensors:
> s = Sensor.objects.get(Name=sensor.name, Value=sensor.value )
> sensorToAdd.append( s )
>
> dp.Sensors.add( sensorToAdd )
>
> In the actually app I bulk create all the DataPoint and Sensor instances. The problem is that |dp.Sensors.add( sensorToAdd )| does a lot of hits on the db. I want a way bulk add all the sensors.

Try fetching all the data you need from the database up-front and place objects in a dict. Depending on the volume of your data, you may want to split this into reasonable batches:

from django.db.models import Q

datapoints = DataPoint.objects.filter(Taken_datetime__in={r['date'] for r in rows})
datapoints_map = {(d.Taken_datetime, d) for d in datapoints}

# To generate efficient SQL, make sure (name, value) pairs are unique
unique_sensor_values = {(s.name, s.value) for s in my_list_of_sensors}
sensors_q = Q()
for name, value in my_unique_sensor_values:
sensors |= Q(Name=name, Value=value)
sensors = Sensor.objects.filter(sensors_q)
sensors_map = {((s.name, s.value), s) for s in sensors}

This reduces your queries to only two. You can then bulk-insert m2m relations per-object like this:

for some_date, some_sensors in my_data:
dp = datapoints_map[some_date]
dp.Sensors.add(*[sensors_map[(s.name, s.value)] for s in some_sensors])


If you need better bulk-insert performance than this, you can convert the m2m relation between Sensor and DataPoint to an explicit m2m model. You can then bulk-insert all m2m relations in one go instead of per-object.

You should probably add indexes on DataPoint.Taken_datetime and Sensors.[Name, Value] to increase query performance.

Erik

yakka...@gmail.com

unread,
Aug 12, 2015, 2:00:08 PM8/12/15
to Django users
Hi Erik,

In the actually code I create and preload all the DataPoints and Sensors outside the loop.  I found a dict was too slow for DataPoints.  I ended up sorting the DataPoints query by date and using the fact that they were in the same order as the CSV to speed things up. 

Looping through the datapoinst and the dp.Sensors.add line is too slow. 

Can you explain what you mean by “If you need better bulk-insert performance than this, you can convert the m2m relation between Sensor and DataPoint to an explicit m2m model. You can then bulk-insert all m2m relations in one go instead of per-object”?   I’m not sure how to implement this. 

Here is what the data will look like
time  sensorA  sensorB
_______________________
1      4        5
2      6        4
3      9        2
4      3        7

The problem is that I'm calling DataPoint.add at lot.  It seems like there should be a more efficient want to add them instead of looping though each datapoint and calling .add.  I'd like to do something like bulk_create.   

Brian

Erik Cederstrand

unread,
Aug 12, 2015, 4:26:46 PM8/12/15
to Django Users

> Den 12/08/2015 kl. 20.00 skrev yakka...@gmail.com:
>
> In the actually code I create and preload all the DataPoints and Sensors outside the loop. I found a dict was too slow for DataPoints.

That's suspicious. Compared to loading data from the database, Python dicts are not slow, for any reasonable value of slow.

> Can you explain what you mean by “If you need better bulk-insert performance than this, you can convert the m2m relation between Sensor and DataPoint to an explicit m2m model. You can then bulk-insert all m2m relations in one go instead of per-object”? I’m not sure how to implement this.

Create an intermediate model for the m2m relation as described in https://docs.djangoproject.com/en/1.8/topics/db/models/#extra-fields-on-many-to-many-relationships

You can then bulk_create() on this model instead of using add(). Something like:


class Datapoint(models.Model):
sensors = models.ManyToManyField(Sensor, through='DatapointSensorRel')


class DatapointSensorRel(models.Model):
datapoint = models.ForeignKey(Datapoint)
sensor = models.ForeignKey(Sensor)


Used like this:

relations = [DatapointSensorRel(datapoint=d, sensor=s) for d, s in my_collected_relations]
DatapointSensorRel.objects.bulk_create(relations)


Erik

yakka...@gmail.com

unread,
Aug 12, 2015, 10:09:04 PM8/12/15
to Django users

I'll run a test with the dict vs list+position counter. I know I saw a speed improvement but I can't remember if that was the only thing I changed.


I'd have to change a lot of code if I change the DB scheme so I'm not wanting to create an intermediate table. I'm going to go down the SQL path. Let me know if you have any suggestions. I'm still a beginner with SQL.


Thanks for your help!!!


Brian



Derek

unread,
Aug 13, 2015, 2:47:38 AM8/13/15
to Django users

* Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple.

This may or may not apply to your use case.   There is also more discussion in this thread:
http://bytes.com/topic/python/answers/587277-how-much-slower-dict-indexing-vs-list-indexing

I have also found in some of my own cases (not involving m2m models though) that dropping down to raw SQL for bulk uploads is fast enough to justify doing it; but I am trading off against all the model checks/balances that Django provides.

Erik Cederstrand

unread,
Aug 13, 2015, 4:55:37 AM8/13/15
to Django Users

> Den 13/08/2015 kl. 04.09 skrev yakka...@gmail.com:
>
> I'll run a test with the dict vs list+position counter. I know I saw a speed improvement but I can't remember if that was the only thing I changed.
>
> I'd have to change a lot of code if I change the DB scheme so I'm not wanting to create an intermediate table. I'm going to go down the SQL path.

The intermediate model doesn't change the DB schema in your case. A models.ManyToManyField already implicitly creates a table in the DB to hold the m2m relation. The intermediate model just makes this explicit.

The only thing this changes in your code is that you can't do "my_datapoint.sensors.add(my_sensor)" anymore. You need to always create (and delete) a DatapointSensorRel explicitly.

Erik

yakka...@gmail.com

unread,
Sep 8, 2015, 1:17:18 PM9/8/15
to Django users
I reviewed my code and the slow speed I was talking about was when I said |if a in myList].  I switched to the dict and I haven't noticed a performance hit from keeping a pointer to a list. 

I ended up switching to Postgres COPY for the importing.  It's a lot faster. 

Brian

 


Reply all
Reply to author
Forward
0 new messages