MemoryError when calling pd.read_csv

480 views
Skip to first unread message

darek...@gmail.com

unread,
Jul 16, 2018, 9:22:14 PM7/16/18
to modin-dev
When I read a 400Mb compressed file (4GB uncompressed) using:

import modin.pandas as pd
df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

I get a MemoryError: 

Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 872, in _process_task
    function_name, args)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 801, in _get_arguments_for_execution
    argument = self.get_object([arg])[0]
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 462, in get_object
    final_results = self.retrieve_and_deserialize(plain_object_ids, 0)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 395, in retrieve_and_deserialize
    timeout, self.serialization_context)
  File "pyarrow/_plasma.pyx", line 433, in pyarrow._plasma.PlasmaClient.get
  File "pyarrow/serialization.pxi", line 441, in pyarrow.lib.deserialize
  File "pyarrow/serialization.pxi", line 404, in pyarrow.lib.deserialize_from
  File "pyarrow/serialization.pxi", line 257, in pyarrow.lib.SerializedPyObject.deserialize
  File "pyarrow/serialization.pxi", line 174, in pyarrow.lib.SerializationContext._deserialize_callback
  File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/serialization.py", line 101, in _deserialize_pandas_dataframe
    return pdcompat.serialized_dict_to_dataframe(data)
  File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/pandas_compat.py", line 457, in serialized_dict_to_dataframe
    for block in data['blocks']]
  File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/pandas_compat.py", line 457, in <listcomp>
    for block in data['blocks']]
  File "/usr/local/lib/python3.6/dist-packages/ray/pyarrow_files/pyarrow/pandas_compat.py", line 482, in _reconstruct_block
    block = _int.make_block(builtin_pickle.loads(block_arr),
MemoryError

It works fine on the same machine using:

import pandas as pd
df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

Thanks

pschaf...@berkeley.edu

unread,
Jul 16, 2018, 9:51:01 PM7/16/18
to modin-dev
Hi,

I couldn't reproduce this issue on a smaller compressed CSV file. Is the file you're using available somewhere so I can take a closer look? Also, could you let me know what compression the file is using and how much RAM your system has?

Best,
Peter

darek...@gmail.com

unread,
Jul 17, 2018, 1:14:57 AM7/17/18
to modin-dev
While looking for a GBs+ public data set, I have downloaded: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-07.csv

and ran a simple code:

import modin.pandas as pd

file = '/home/user/yellow_tripdata_2017-07.csv.gz'

df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

print( f'Sample:\n {df.iloc[0:5,:]}')

and I end up with this error, but it works fine in Pandas. Thx!

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:65021 to respond...
Waiting for redis server at 127.0.0.1:44721 to respond...
Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.

======================================================================
======================================================================

Len: 9710124
Traceback (most recent call last):
  File "bakeoffModin.py", line 15, in <module>
    print( f'Sam:\n{df.iloc[0:5,:]}')
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 230, in __str__
    return repr(self)
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 325, in __repr__
    return repr(self._repr_helper_())
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 235, in _repr_helper_
    return to_pandas(self)
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 227, in to_pandas
    pandas_df = pandas.concat(ray.get(df._row_partitions), copy=False)
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 198, in _get_row_partitions
    empty_rows_mask = self._row_metadata._lengths > 0
TypeError: '>' not supported between instances of 'list' and 'int'

pschaf...@berkeley.edu

unread,
Jul 17, 2018, 3:54:52 AM7/17/18
to modin-dev
I wasn't able to reproduce the MemoryError on 8GB RAM. However, it looks like there's a bug with slicing in iloc. Would you like to open an issue? If not, let me know and I can take it from here. We'll try to have this resolved by the end of the week.

Thanks for your help!

darek...@gmail.com

unread,
Jul 17, 2018, 2:27:16 PM7/17/18
to modin-dev
I tried to use these data sets: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv and smaller datasets work in both (Pandas and modin.pandas), larger data sets crash with MemoryError using both methods.

darek...@gmail.com

unread,
Jul 17, 2018, 2:27:47 PM7/17/18
to modin-dev
I will open an issue. Thanks!!!

darek...@gmail.com

unread,
Jul 17, 2018, 4:44:33 PM7/17/18
to modin-dev
I was hoping to use modin.pandas to read large files that result in MemoryError in plain Pandas. Is there a way of handling those large files using modin? Thanks

Devin Petersohn

unread,
Jul 17, 2018, 5:49:28 PM7/17/18
to darek...@gmail.com, modin-dev
There is, though we don't yet expose this to the user.

We can expose something like this in the next release as a potential setting. Essentially it would use the hard disk as the source of memory. There would be a slight performance cost to doing this, and it would still be experimental. This doesn't always work and we are working with the Ray and Arrow teams to get a more consistent way to exceed memory.

For now, what you can do is the following:

# Do this before you import modin
import ray

# This is ~4x your 4GB file
memory_needs = 2**34
ray.init(plasma_directory="/tmp", object_store_memory=memory_needs)

import modin.pandas as pd

df = pd.read_csv(...)

This doesn't always work, as sometimes plasma has trouble allocating some amount of space in /tmp. Let me know if this works for you, otherwise we can keep trying some different ray flags to try to get it to work.

Also, feel free to open a feature request on the GitHub in the issues page!

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To post to this group, send email to modi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/96d8e8db-a08b-45d9-beaf-e9cc47563209%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

pschaf...@berkeley.edu

unread,
Jul 17, 2018, 7:50:58 PM7/17/18
to modin-dev
In addition, you might have better luck with uncompressed files. The current version falls back to the Pandas read_csv implementation for compressed files which causes significant memory overhead.

I've created an issue, so we'll add fixes to improve memory use:

darek...@gmail.com

unread,
Jul 17, 2018, 9:08:21 PM7/17/18
to modin-dev
Having large uncompressed files is a waste of disk, so it's not practical. Thanks for the suggestion.

Devin Petersohn

unread,
Jul 18, 2018, 4:55:32 PM7/18/18
to darek...@gmail.com, modin-dev
Have you tried parquet? Our parquet reader is quite efficient. It also has really efficient on-disk compression ratios. 

The file reader for compressed files does not work in-parallel yet. If the uncompressed file doesn't fit in-memory, parquet is probably the way to go (along with the additional ray configurations earlier in the thread).


darek...@gmail.com

unread,
Jul 18, 2018, 5:19:17 PM7/18/18
to modin-dev
That's what I am trying to do, read many large .csv.gz and write them as one Parquet but if I can't read .csv, then I can't write Parquet. :(
I currently use Spark, and it works fine, but Spark is very slow and complex to setup and run. I am looking for a way of doing it in pure Python.
I have tried Dask but it has the same problem with .gz files, it tries to put them into memory.
Thanks!!

darek...@gmail.com

unread,
Jul 18, 2018, 5:53:48 PM7/18/18
to modin-dev
This does NOT seem to work either, I get a lot of these errors:

Failed to get objectid ObjectID(9f03bd2003accb5d8bfe70dac580ce5b11f80e0d) as argument 0 for remote function modin.pandas.utils._build_coord_df. It was created by remote function modin.pandas.utils._build_col_widths which failed with:
Remote function modin.pandas.utils._build_col_widths failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 329, in _build_col_widths
    for d in df_col]))
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2768, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(787923d9e6e087d260077442b618fbfdfec6577c). It was created by remote function modin.pandas.utils._deploy_func which failed with:

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(a0d6418eaf61c47379706ff69d0dbd66a3b91994) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:
Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks
    return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined


Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(de4cd4caad80d84c154dee565eebb3f8bfcaa19a) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:
Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks
    return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(65c7a0ac299a683ac94f38aae9e96f8dddda2dd8) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:
Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks
    return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined

Devin Petersohn

unread,
Jul 18, 2018, 5:56:51 PM7/18/18
to darek...@gmail.com, modin-dev
Interesting, this shouldn't be happening. Did you install from pip? What command caused this?

darek...@gmail.com

unread,
Jul 18, 2018, 6:54:22 PM7/18/18
to modin-dev
I have installed it thru pip, I am running:  

import ray

# This is ~4x your 4GB file
memory_needs = 2**34
ray.init(plasma_directory="/tmp", object_store_memory=memory_needs)

# Pandas on Ray import
import modin.pandas as pd

file = '/home/user/large_file.csv.zip'

df = pd.read_csv( file, quotechar='"', encoding='latin-1', index_col=False, sep='\t' )

print( f'Len: {len(df)}')
print( f'Sam:\n{df.iloc[0:5,:]}')


Devin Petersohn

unread,
Jul 18, 2018, 7:03:56 PM7/18/18
to darek...@gmail.com, modin-dev
This is a strange error, somehow the pip install must be truncating some of the files. I can't reproduce this for some reason. I will keep working to see if I can reproduce.

Could you try to install from pip this way: pip install -U git+git://github.com/modin-project/modin



darek...@gmail.com

unread,
Jul 19, 2018, 1:28:38 PM7/19/18
to modin-dev
After install from pip this way: pip install -U git+git://github.com/modin-project/modin, it died with this error:

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:62871 to respond...
Waiting for redis server at 127.0.0.1:37440 to respond...
Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.

======================================================================
======================================================================

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:18813 to respond...
Waiting for redis server at 127.0.0.1:52432 to respond...
Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.

======================================================================
======================================================================

/ray/src/local_scheduler/local_scheduler.cc:1186: process_message of type 1 took 8522 milliseconds.
/ray/src/local_scheduler/local_scheduler.cc:177: Killed worker pid 127743 which hadn't started yet.
Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks
    return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined

Suppressing duplicate error message.
Traceback (most recent call last):
  File "bakeoffModin.py", line 21, in <module>
    print( f'Len: {len(df)}')
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/dataframe.py", line 5047, in __len__
/ray/src/local_scheduler/local_scheduler_algorithm.cc:713: Lost connection to the plasma manager, local scheduler is exiting. Error: IOError: Connection reset by peer
    return len(self._row_metadata)
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/index_metadata.py", line 199, in __len__
    return int(sum(self._lengths))
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/index_metadata.py", line 66, in _get__lengths
    self._lengths_cache = ray.get(self._lengths_cache)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2771, in get
    value = worker.get_object([object_ids])[0]
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 490, in get_object
    i + ray._config.worker_fetch_request_size())])
  File "pyarrow/_plasma.pyx", line 553, in pyarrow._plasma.PlasmaClient.fetch
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Broken pipe
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler[0x442535]
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(_ZN3ray8internal7CerrLogD1Ev+0x8e)[0x44f17e]
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(_Z28fetch_object_timeout_handlerP11aeEventLoopxPv+0x261)[0x453b41]
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(aeProcessEvents+0x2b4)[0x4827a4]
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(aeMain+0x2b)[0x4829bb]
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler(main+0x5a7)[0x442eb7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f15c7e36b97]
/usr/local/lib/python3.6/dist-packages/ray/local_scheduler/../core/src/local_scheduler/local_scheduler[0x444e91]

Devin Petersohn

unread,
Jul 19, 2018, 4:29:55 PM7/19/18
to darek...@gmail.com, modin-dev
This is an odd error. Does it fail this way when you run from an interpreter? Does the error also occur when you run from different directories? I still haven't been able to reproduce the error. Do you get the same issue with python -m pip as you do with pip?


Robert Nishihara

unread,
Jul 19, 2018, 5:55:43 PM7/19/18
to Devin Petersohn, darek...@gmail.com, modin-dev
When you do

    import modin
    print(modin.__file__)

Does that print the file path that you're expecting? Sometimes this kind of error can happen if you have multiple versions present.

darek...@gmail.com

unread,
Jul 19, 2018, 8:10:59 PM7/19/18
to modin-dev
It prints: /usr/local/lib/python3.6/dist-packages/modin/__init__.py, since the packages are installed globally, it makes sense.

darek...@gmail.com

unread,
Jul 19, 2018, 8:18:00 PM7/19/18
to modin-dev
Using Python3.6.5 interpreter on Ubuntu 18.04 box with 16GB of ram I get:

Remote function modin.pandas.utils._build_row_lengths failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 338, in _build_row_lengths
    for d in df_row]))
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2768, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(0f84356ff12c6af967c11a4c0026429062c3e18b). It was created by remote function modin.pandas.utils._deploy_func which failed with:

Remote function modin.pandas.utils._deploy_func failed with:

Failed to get objectid ObjectID(805e64cf42b95d099fc422e79f110392bb151841) as argument 1 for remote function modin.pandas.utils._deploy_func. It was created by remote function modin.pandas.utils.create_blocks which failed with:
Remote function modin.pandas.utils.create_blocks failed with:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks
    return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined


darek...@gmail.com

unread,
Jul 20, 2018, 1:22:09 PM7/20/18
to modin-dev
Using:

####################################3
import ray

# This is ~4x your 4GB file
memory_needs = 2**34
ray.init(plasma_directory="/tmp", object_store_memory=memory_needs)

####################################

on a smaller file, does NOT work either, it throws:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/modin/pandas/utils.py", line 380, in create_blocks
    return _create_blocks_helper(df, npartitions, axis)
NameError: name '_create_blocks_helper' is not defined

Removing import ray, works on a small file but not a 4GB file.

Devin Petersohn

unread,
Jul 25, 2018, 9:59:49 PM7/25/18
to darek...@gmail.com, modin-dev
I was finally able to reproduce the error you had and have submitted a patch. https://github.com/modin-project/modin/pull/60

We are releasing a point release later this week so it should be resolved for you then! Thanks for your patience!

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To post to this group, send email to modi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages