Best practices for large files eg. 80 millions lines for awk into DB

P Jakobsen

unread,

Mar 11, 2016, 5:53:24 PM3/11/16

to drake-workflow

I have directories with hundreds of csv files, about 3 GB worth of data. I need to import this into a database, but without writing to a massive file as an intermediary step. Can you recommend a good way to do this, eg. in memory process where you can monitor the progress of the work being done in real time. I'm not fond of looking at my current drake step, which just stalls as awk does it's work on 80 million lines of text data.

I love the concept of Drake, really nice effort, but the documentation is a bit of a slog, so excuse me for taking the lazy way out here. ;)

Thanks,

Peder J.

Aaron Crow

unread,

Mar 17, 2016, 9:08:17 AM3/17/16

to P Jakobsen, drake-workflow

Hi Peder, thanks for your kind words about Drake.

Without knowing more details, your task sounds like a more general data import challenge than anything else; not really something Drake was built for specifically. You might be best off creating a custom standalone script that handles the import for you, while outputting whatever progress info you want.

--
You received this message because you are subscribed to the Google Groups "drake-workflow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to drake-workflo...@googlegroups.com.
Visit this group at https://groups.google.com/group/drake-workflow.
For more options, visit https://groups.google.com/d/optout.

Peder Jakobsen | gmail

unread,

Mar 18, 2016, 1:12:15 PM3/18/16

to Aaron Crow, drake-workflow

Hi Aaron, OK, important to know that Drake is not designed for this use case. I ended up writing an awk script that does the trick.

Peder :)

Reply all

Reply to author

Forward