Feature idea: Add support for PostgreSQL's COPY command in contrib.postgres

244 views
Skip to first unread message

Ben Welsh

unread,
Jul 18, 2015, 6:19:49 PM7/18/15
to django-d...@googlegroups.com
Hello,

I am a big fan of PostgreSQL's excellent bulk loader COPY. It can rapidly load a CSV file directly into the database, an oftentimes tedious task that I have to do frequently as part of my job. 

I am not a big fan of having to write my COPY commands in raw SQL. I'd much rather use Django's ORM. 

So last week I put together an app call django-postgres-copy that attempts to integrate COPY into Django, modeling its design on the excellent LayerMapping utility in contrib.gis, which I also use frequently. 

I wrote a blog post about the approach here


You can find more complete technical documentation here


And all of the code is up here on GitHub


Since Django has already begun to integrate other PostgreSQL-specific features in contrib.postgres, I'm curious if the core developers are be interested in adding COPY support as well. 

I'm not attached to the style of my library and I'd welcome a different approach if it got the job done. I'd be thrilled to have the opportunity to carry the torch and do whatever refactoring and additional coding is necessary to qualify it for a merge.

Please let me know what you think. And if I've overlooked some previous discussion or superior third-party library in this area, please forgive me. I searched around but was unable to find anything.

Sincerely,

Ben Welsh

thinkwel...@gmail.com

unread,
Jul 23, 2015, 8:05:18 AM7/23/15
to Django developers (Contributions to Django itself), ben....@gmail.com
That looks interesting - I might be able to use your module for a project I'll be working on soon. Two questions I had about data transformation from the temp column to model column.

1. Can you create datetime stamps from the text data?
2. Can you skip rows where a column == "unwanted data"?

Pardon the simple questions - I'm not too much of a django guru.  Looks like an interesting project!

Ben Welsh

unread,
Jul 27, 2015, 6:14:58 PM7/27/15
to Django developers (Contributions to Django itself), thinkwel...@gmail.com
Glad you're interested!

1. Yes you can. You can use the field or model method transformations to provide any SQL you'd like. PostgreSQL has a to_timestamp function that I bet could do what you want.

2. There isn't a trick to exclude rules based on a test. Part of what makes this method fast is that it doesn't evaluate every row in the CSV. Though I suspect there could be some way to achieve this goal by fiddling with the bulk insert from the temporary table to the model table. You could subclass that function and do it yourself now. And I'm open to ideas about how that portion of the code could be surfaced to make configuring it less work. 

Aaron Williams

unread,
Jul 28, 2015, 7:59:11 PM7/28/15
to Django developers (Contributions to Django itself), ben....@gmail.com
+1 for this feature addition.

I work with a lot of public data and I almost always go through the steps of loading data into PostgreSQL and building from there. COPY reduces data load time significantly so a core load command for Django is welcome.

I've used LOAD DATA INFILE for MySQL on occasion too, but I'd rather stick to Django's ORM (and PostgreSQL) instead of writing raw SQL as Ben mentioned.

Another neat idea is to borrow from GeoDjango's ogrinspect and do the same thing for CSVs (I wrote a simple implementation here). These two tools together will be huge leap in productivity for the data journalism / open data communities.

I should add I just used django-postgres-copy for a project I'm working on and it ran like a charm.

William Chambers

unread,
Aug 8, 2015, 4:13:36 PM8/8/15
to Django developers (Contributions to Django itself)
+1 for this integration. This is a great feature that I've been using on projects as well. I'd love to be able to do it within django rather than hack my own sql scripts around it. The efficiency/speed up is very real. 

This solves the problem that when working with data where people replicate across machines (because of time/space/security/cost issues ) Sending a csv via box/dropbox and then loading it in with postgres COP is great way to make sure that everyone has the information they need in a way that they can query efficiently and build on using django!


On Saturday, July 18, 2015 at 3:19:49 PM UTC-7, Ben Welsh wrote:
Reply all
Reply to author
Forward
0 new messages