read_csv :: I/O or CPU bound

Andrew D Mann

unread,

Aug 6, 2019, 9:14:00 AM8/6/19

to modi...@googlegroups.com

Hey -

Super impressed at the active development efforts in Modin.

I have a quick question - how does Modin speed up the read_csv - I am interested in the actual mechanics of this - a low level explanation would be amazing. I am super interested in this and how it compares to dasks lazy eval which I have found to be very useful.

I had thought that reading a csv was I/O and not CPU bound - assuming that by utalising all cores of a machine we can increase the throughput of the I/O operation from disk?

Keep up the great work! Once some time opens up for me later in the year going to dive into the source as keen to contribute!

Andrew

Sent from Yahoo Mail for iPhone

Devin Petersohn

unread,

Aug 6, 2019, 1:24:32 PM8/6/19

to Andrew D Mann, modin-dev

Hi Andrew, thanks for the questions!

I am super interested in this and how it compares to dasks lazy eval which I have found to be very useful.

Here is a high level overview of the differences with Dask Dataframe: https://github.com/modin-project/modin/issues/515. In summary, Dask Dataframe is meant to work alongside pandas (not to scale it as a drop-in replacement), and scales a limited subset of the API. We do have ongoing work to bring Modin to Dask Futures: https://github.com/modin-project/modin/pull/732, so Dask itself can still use the things we have built in Modin.

I had thought that reading a csv was I/O and not CPU bound - assuming that by utalising all cores of a machine we can increase the throughput of the I/O operation from disk?

This is a really common misconception, reading a CSV into a tabular format is CPU bound because of the parsing. You can test this yourself by running `%time _ = open("file.csv", "r").read()` and compare the time to `%time _ = pandas.read_csv("file.csv")`. Parallelism allows us to scale the parsing of the file between processes, and each worker only reads a subset of the data.

Once some time opens up for me later in the year going to dive into the source as keen to contribute!

Sound great! You can reach out in the mailing list or on our Discourse page: https://discuss.modin.org, we'd love to have you! Currently, we are in the midst of a large internal refactor to improve the performance and limit communication, in addition to a start toward supporting some lazy evaluation.

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/1739700887.2404086.1565097236132%40mail.yahoo.com.

Andrew D Mann

unread,

Aug 6, 2019, 1:45:28 PM8/6/19

to Devin Petersohn, modin-dev

Hey Devin - thanks so much for the quick response - super interesting stuff! I will definitely be reaching out in the coming moths to identify where an extra pair of hands is needed. With lazy eval incorporated into the modin framework this becomes very exciting.