MemoryError when calling pd.read

darek...@gmail.com

unread,

Jul 16, 2018, 9:22:14 PM7/16/18

to modin-dev

When I read a 400Mb compressed file (4GB uncompressed) using:

import modin.pandas as pd

df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

I get a MemoryError:

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 872, in _process_task

function_name, args)

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 801, in _get_arguments_for_execution

argument = self.get_object([arg])[0]

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 462, in get_object

final_results = self.retrieve_and_deserialize(plain_object_ids, 0)

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 395, in retrieve_and_deserialize

timeout, self.serialization_context)

File "pyarrow/_plasma.pyx", line 433, in pyarrow._plasma.PlasmaClient.get

File "pyarrow/serialization.pxi", line 441, in pyarrow.lib.deserialize

File "pyarrow/serialization.pxi", line 404, in pyarrow.lib.deserialize_from

File "pyarrow/serialization.pxi", line 257, in pyarrow.lib.SerializedPyObject.deserialize

File "pyarrow/serialization.pxi", line 174, in pyarrow.lib.SerializationContext._deserialize_callback

File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/serialization.py", line 101, in _deserialize_pandas_dataframe

return pdcompat.serialized_dict_to_dataframe(data)

File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/pandas_compat.py", line 457, in serialized_dict_to_dataframe

for block in data['blocks']]

File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/pandas_compat.py", line 457, in <listcomp>

for block in data['blocks']]

File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/pandas_compat.py", line 482, in _reconstruct_block

block = _int.make_block(builtin_pickle.loads(block_arr),

MemoryError

It works fine on the same machine using:

import pandas as pd

df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

Thanks

pschaf...@berkeley.edu

unread,

Jul 16, 2018, 9:51:01 PM7/16/18

to modin-dev

Hi,

I couldn't reproduce this issue on a smaller compressed CSV file. Is the file you're using available somewhere so I can take a closer look? Also, could you let me know what compression the file is using and how much RAM your system has?

Best,

Peter

darek...@gmail.com

unread,

Jul 17, 2018, 1:14:57 AM7/17/18

to modin-dev

While looking for a GBs+ public data set, I have downloaded: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-07.csv

and ran a simple code:

import modin.pandas as pd

file = '/home/user/yellow_tripdata_2017-07.csv.gz'

df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

print( f'Sample:\n {df.iloc[0:5,:]}')

and I end up with this error, but it works fine in Pandas. Thx!

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.

Waiting for redis server at 127.0.0.1:65021 to respond...

Waiting for redis server at 127.0.0.1:44721 to respond...

Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.

======================================================================

View the web UI at http://localhost:8888/notebooks/ray_ui38057.ipynb?token=699445b79f160adb0f1a13e2f2c6812edeafa7c9c86cc490

======================================================================

Len: 9710124

Traceback (most recent call last):

File "bakeoffModin.py", line 15, in <module>

print( f'Sam:\n{df.iloc[0:5,:]}')

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 230, in __str__

return repr(self)

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 325, in __repr__

return repr(self._repr_helper_())

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 235, in _repr_helper_

return to_pandas(self)

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 227, in to_pandas

pandas_df = pandas.concat(ray.get(df._row_partitions), copy=False)

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 198, in _get_row_partitions

empty_rows_mask = self._row_metadata._lengths > 0

TypeError: '>' not supported between instances of 'list' and 'int'

pschaf...@berkeley.edu

unread,

Jul 17, 2018, 3:54:52 AM7/17/18

to modin-dev

I wasn't able to reproduce the MemoryError on 8GB RAM. However, it looks like there's a bug with slicing in iloc. Would you like to open an issue? If not, let me know and I can take it from here. We'll try to have this resolved by the end of the week.

Thanks for your help!

darek...@gmail.com

unread,

Jul 17, 2018, 2:27:16 PM7/17/18

to modin-dev

I tried to use these data sets: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv and smaller datasets work in both (Pandas and modin.pandas), larger data sets crash with MemoryError using both methods.

darek...@gmail.com

unread,

Jul 17, 2018, 2:27:47 PM7/17/18

to modin-dev

I will open an issue. Thanks!!!

darek...@gmail.com

unread,

Jul 17, 2018, 4:44:33 PM7/17/18

to modin-dev

I was hoping to use modin.pandas to read large files that result in MemoryError in plain Pandas. Is there a way of handling those large files using modin? Thanks

Devin Petersohn

unread,

Jul 17, 2018, 5:49:28 PM7/17/18

to darek...@gmail.com, modin-dev

There is, though we don't yet expose this to the user.

We can expose something like this in the next release as a potential setting. Essentially it would use the hard disk as the source of memory. There would be a slight performance cost to doing this, and it would still be experimental. This doesn't always work and we are working with the Ray and Arrow teams to get a more consistent way to exceed memory.

For now, what you can do is the following:

# Do this before you import modin

import ray

# This is ~4x your 4GB file

memory_needs = 2**34

ray.init(plasma_directory="/tmp", object_store_memory=memory_needs)

import modin.pandas as pd

df = pd.read_csv(...)

This doesn't always work, as sometimes plasma has trouble allocating some amount of space in /tmp. Let me know if this works for you, otherwise we can keep trying some different ray flags to try to get it to work.

Also, feel free to open a feature request on the GitHub in the issues page!

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To post to this group, send email to modi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/96d8e8db-a08b-45d9-beaf-e9cc47563209%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

pschaf...@berkeley.edu

unread,

Jul 17, 2018, 7:50:58 PM7/17/18

to modin-dev

In addition, you might have better luck with uncompressed files. The current version falls back to the Pandas read_csv implementation for compressed files which causes significant memory overhead.

I've created an issue, so we'll add fixes to improve memory use:

https://github.com/modin-project/modin/issues/44

darek...@gmail.com

unread,

Jul 17, 2018, 9:08:21 PM7/17/18

to modin-dev

Having large uncompressed files is a waste of disk, so it's not practical. Thanks for the suggestion.

Devin Petersohn

unread,

Jul 18, 2018, 4:55:32 PM7/18/18

to darek...@gmail.com, modin-dev

Have you tried parquet? Our parquet reader is quite efficient. It also has really efficient on-disk compression ratios.

The file reader for compressed files does not work in-parallel yet. If the uncompressed file doesn't fit in-memory, parquet is probably the way to go (along with the additional ray configurations earlier in the thread).

To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/4e568c7e-fa50-44d8-a6c0-c85ca3627803%40googlegroups.com.

darek...@gmail.com

unread,

Jul 18, 2018, 5:19:17 PM7/18/18

to modin-dev

That's what I am trying to do, read many large .csv.gz and write them as one Parquet but if I can't read .csv, then I can't write Parquet. :(

I currently use Spark, and it works fine, but Spark is very slow and complex to setup and run. I am looking for a way of doing it in pure Python.

I have tried Dask but it has the same problem with .gz files, it tries to put them into memory.

Thanks!!

darek...@gmail.com

unread,

Jul 18, 2018, 5:53:48 PM7/18/18

to modin-dev

This does NOT seem to work either, I get a lot of these errors:

Failed to get objectid ObjectID(9f03bd2003accb5d8bfe70dac580ce5b11f80e0d) as argument 0 for remote function modin.pandas.utils._build_coord_df. It was created by remote function modin.pandas.utils._build_col_widths which failed with:

Remote function modin.pandas.utils._build_col_widths failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 329, in _build_col_widths

for d in df_col]))

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2768, in get

raise RayGetError(object_ids[i], value)

ray.worker.RayGetError: Could not get objectid ObjectID(787923d9e6e087d260077442b618fbfdfec6577c). It was created by remote function modin.pandas.utils._deploy_func which failed with:

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(a0d6418eaf61c47379706ff69d0dbd66a3b91994) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks

return _create_blocks_helper(df, npartitions, axis)

NameError: name '_create_blocks_helper' is not defined

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(de4cd4caad80d84c154dee565eebb3f8bfcaa19a) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks

return _create_blocks_helper(df, npartitions, axis)

NameError: name '_create_blocks_helper' is not defined

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(65c7a0ac299a683ac94f38aae9e96f8dddda2dd8) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks

return _create_blocks_helper(df, npartitions, axis)

NameError: name '_create_blocks_helper' is not defined

Devin Petersohn

unread,

Jul 18, 2018, 5:56:51 PM7/18/18

to darek...@gmail.com, modin-dev

Interesting, this shouldn't be happening. Did you install from pip? What command caused this?

To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/5f8efa80-9718-456a-b6c0-f28e6bd57dd6%40googlegroups.com.

darek...@gmail.com

unread,

Jul 18, 2018, 6:54:22 PM7/18/18

to modin-dev

I have installed it thru pip, I am running:

import ray

# This is ~4x your 4GB file

memory_needs = 2**34

ray.init(plasma_directory="/tmp", object_store_memory=memory_needs)

# Pandas on Ray import

import modin.pandas as pd

file = '/home/user/large_file.csv.zip'

df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

print( f'Len: {len(df)}')

print( f'Sam:\n{df.iloc[0:5,:]}')

Devin Petersohn

unread,

Jul 18, 2018, 7:03:56 PM7/18/18

to darek...@gmail.com, modin-dev

This is a strange error, somehow the pip install must be truncating some of the files. I can't reproduce this for some reason. I will keep working to see if I can reproduce.

Could you try to install from pip this way: pip install -U git+git://github.com/modin-project/modin

To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/18f47d71-69dd-4736-94c8-94b41a3004ae%40googlegroups.com.

darek...@gmail.com

unread,

Jul 19, 2018, 1:28:38 PM7/19/18

to modin-dev

After install from pip this way: pip install -U git+git://github.com/modin-project/modin, it died with this error:

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.

Waiting for redis server at 127.0.0.1:62871 to respond...

Waiting for redis server at 127.0.0.1:37440 to respond...

Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.

======================================================================

View the web UI at http://localhost:8888/notebooks/ray_ui57652.ipynb?token=c83a90853d775dd1aaddfb94cb02c0189c44fad70db3b074

======================================================================

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.

Waiting for redis server at 127.0.0.1:18813 to respond...

Waiting for redis server at 127.0.0.1:52432 to respond...

Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.

======================================================================

View the web UI at http://localhost:8888/notebooks/ray_ui31034.ipynb?token=c277ba61d5681e3acff20c56273bb2851748acce23f29303

======================================================================

/ray/src/local_scheduler/local_scheduler.cc:1186: process_message of type 1 took 8522 milliseconds.

/ray/src/local_scheduler/local_scheduler.cc:177: Killed worker pid 127743 which hadn't started yet.

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks

return _create_blocks_helper(df, npartitions, axis)

NameError: name '_create_blocks_helper' is not defined

Suppressing duplicate error message.

Traceback (most recent call last):

File "bakeoffModin.py", line 21, in <module>

print( f'Len: {len(df)}')

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 5047, in __len__

/ray/src/local_scheduler/local_scheduler_algorithm.cc:713: Lost connection to the plasma manager, local scheduler is exiting. Error: IOError: Connection reset by peer

return len(self._row_metadata)

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/index_metadata.py", line 199, in __len__

return int(sum(self._lengths))

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/index_metadata.py", line 66, in _get__lengths

self._lengths_cache = ray.get(self._lengths_cache)

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2771, in get

value = worker.get_object([object_ids])[0]

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 490, in get_object

i + ray._config.worker_fetch_request_size())])

File "pyarrow/_plasma.pyx", line 553, in pyarrow._plasma.PlasmaClient.fetch

File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status

pyarrow.lib.ArrowIOError: Broken pipe

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler[0x442535]

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(_ZN3ray8internal7CerrLogD1Ev+0x8e)[0x44f17e]

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(_Z28fetch_object_timeout_handlerP11aeEventLoopxPv+0x261)[0x453b41]

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(aeProcessEvents+0x2b4)[0x4827a4]

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(aeMain+0x2b)[0x4829bb]

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(main+0x5a7)[0x442eb7]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f15c7e36b97]

/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler[0x444e91]

Devin Petersohn

unread,

Jul 19, 2018, 4:29:55 PM7/19/18

to darek...@gmail.com, modin-dev

This is an odd error. Does it fail this way when you run from an interpreter? Does the error also occur when you run from different directories? I still haven't been able to reproduce the error. Do you get the same issue with python -m pip as you do with pip?

To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/1961201e-0d44-4b6b-b304-344abf09c9ba%40googlegroups.com.

Robert Nishihara

unread,

Jul 19, 2018, 5:55:43 PM7/19/18

to Devin Petersohn, darek...@gmail.com, modin-dev

When you do

import modin

print(modin.__file__)

Does that print the file path that you're expecting? Sometimes this kind of error can happen if you have multiple versions present.

To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/CAFDZhWD7Pdp5Az-22M46HHqOfy0%2BTmppEe_f_8owQsdTjZFFbw%40mail.gmail.com.

darek...@gmail.com

unread,

Jul 19, 2018, 8:10:59 PM7/19/18

to modin-dev

It prints: /usr/local/lib/python3.6/dist-packages/modin/__init__.py, since the packages are installed globally, it makes sense.

darek...@gmail.com

unread,

Jul 19, 2018, 8:18:00 PM7/19/18

to modin-dev

Using Python3.6.5 interpreter on Ubuntu 18.04 box with 16GB of ram I get:

Remote function modin.pandas.utils._build_row_lengths failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 338, in _build_row_lengths

for d in df_row]))

File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2768, in get

raise RayGetError(object_ids[i], value)

ray.worker.RayGetError: Could not get objectid ObjectID(0f84356ff12c6af967c11a4c0026429062c3e18b). It was created by remote function modin.pandas.utils._deploy_func which failed with:

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(805e64cf42b95d099fc422e79f110392bb151841) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks

return _create_blocks_helper(df, npartitions, axis)

NameError: name '_create_blocks_helper' is not defined

darek...@gmail.com

unread,

Jul 20, 2018, 1:22:09 PM7/20/18

to modin-dev

Using:

####################################3

import ray

# This is ~4x your 4GB file

memory_needs = 2**34

ray.init(plasma_directory="/tmp", object_store_memory=memory_needs)

####################################

on a smaller file, does NOT work either, it throws:

Traceback (most recent call last):

File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks

return _create_blocks_helper(df, npartitions, axis)

NameError: name '_create_blocks_helper' is not defined

Removing import ray, works on a small file but not a 4GB file.

Devin Petersohn

unread,

Jul 25, 2018, 9:59:49 PM7/25/18

to darek...@gmail.com, modin-dev

I was finally able to reproduce the error you had and have submitted a patch. https://github.com/modin-project/modin/pull/60

We are releasing a point release later this week so it should be resolved for you then! Thanks for your patience!

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To post to this group, send email to modi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/8a93177d-0a53-43e7-b192-77bf2852f5ae%40googlegroups.com.

Reply all

Reply to author

Forward

MemoryError when calling pd.read_csv

darek...@gmail.com

pschaf...@berkeley.edu

darek...@gmail.com

pschaf...@berkeley.edu

darek...@gmail.com

darek...@gmail.com

darek...@gmail.com

Devin Petersohn

pschaf...@berkeley.edu

darek...@gmail.com

Devin Petersohn

darek...@gmail.com

darek...@gmail.com

Devin Petersohn

darek...@gmail.com

Devin Petersohn

darek...@gmail.com

Devin Petersohn

Robert Nishihara

darek...@gmail.com

darek...@gmail.com

darek...@gmail.com

Devin Petersohn