big slowdown on some loop/iteration ops

Apr 30, 2020, 4:48:31 PM
to modin-dev

trying to work with modin/ray on AWS SageMaker

any sort of loop iteration seems exceptionally slow - much slower than pandas
seems to run on 1 or 2 processes only, and even so, takes a lot of time

instance is a ml.m5.12xlarge
configured w swap 1x memory, but using none

I load modin as:

import ray
import modin.pandas as pd

sucessfully spawns 48 ray procs (it looks like, using top in a terminal)

I had to massively reduce the size of my example dataframe just to work with the replicated problem:

import numpy as np

NROWS = 5000
# generate random data for columns

data_ndarray = np.random.rand(NCOLS,NROWS)

df = pd.DataFrame(data= {
    'col1': data_ndarray[0],
    'col2': data_ndarray[1],
    'col3': data_ndarray[2],
    'col4': data_ndarray[3],
    'col5': data_ndarray[4],

this gives:

(5, 5000)
<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    5000 non-null   float64
 1   col2    5000 non-null   float64
 2   col3    5000 non-null   float64
 3   col4    5000 non-null   float64
 4   col5    5000 non-null   float64
dtypes: float64(5)


for r in df.itertuples(index=True):


26.5 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


df['test'] = False

# standard loop

def f_std_loop(_df):
    #_df['test'] = False
    for row in range(0, len(_df)):
        if _df['col1'].iloc[row] == 1:
            _df['test'].iloc[row] = True

%%timeit -r 2 -n 1


38.5 s ± 9.28 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

then itertuples:

# itertuples loop

def f_row_iter(col0):
    if col0 == 1:
        return True
    return False
def f_itertuples_loop(_df):
    result_series = []
    for row in _df.itertuples(index=False,name="TestName"):
    _df['test'] = result_series

%%timeit -r 2 -n 1


28.7 s ± 203 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

apply is faster, but not sure if it will scale:

%%timeit -r 2 -n 1
# apply, row-wise
# this is like a for loop. not good?
df['test'] = df.apply(lambda r: True if r['col1'] == 1 else False,axis=1)


248 ms ± 5.35 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

# vectorization

# vectorize
def f_vec_iter(_df,col0):
    _df.loc[col0 >= 0.85,'test'] = True

df['test'] = False

%%timeit -r 7 -n 2 f_vec_iter(df,df['col1'])


141 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

test    742
dtype: int64

Python Environment is as follows:

     active environment : run-nsf
    active env location : /home/ec2-user/anaconda3/envs/run-nsf
            shell level : 2
       user config file : /home/ec2-user/.condarc
 populated config files : /home/ec2-user/.condarc
          conda version : 4.5.12
    conda-build version : 3.10.5
         python version :
       base environment : /home/ec2-user/anaconda3  (writable)
           channel URLs :
          package cache : /home/ec2-user/anaconda3/pkgs
       envs directories : /home/ec2-user/anaconda3/envs
               platform : linux-64
             user-agent : conda/4.5.12 requests/2.20.0 CPython/3.6.6 Linux/4.14.171-105.231.amzn1.x86_64 amzn/2018.03 glibc/2.17
                UID:GID : 500:500
             netrc file : None
           offline mode : False

# conda environments:
base                     /home/ec2-user/anaconda3
JupyterSystemEnv         /home/ec2-user/anaconda3/envs/JupyterSystemEnv
R                        /home/ec2-user/anaconda3/envs/R
amazonei_mxnet_p27       /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27
amazonei_mxnet_p36       /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36
amazonei_tensorflow_p27     /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p27
amazonei_tensorflow_p36     /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p36
chainer_p27              /home/ec2-user/anaconda3/envs/chainer_p27
chainer_p36              /home/ec2-user/anaconda3/envs/chainer_p36
mxnet_p27                /home/ec2-user/anaconda3/envs/mxnet_p27
mxnet_p36                /home/ec2-user/anaconda3/envs/mxnet_p36
python2                  /home/ec2-user/anaconda3/envs/python2
python3                  /home/ec2-user/anaconda3/envs/python3
pytorch_p27              /home/ec2-user/anaconda3/envs/pytorch_p27
pytorch_p36              /home/ec2-user/anaconda3/envs/pytorch_p36
run-nsf               *  /home/ec2-user/anaconda3/envs/run-nsf
tensorflow_p27           /home/ec2-user/anaconda3/envs/tensorflow_p27
tensorflow_p36           /home/ec2-user/anaconda3/envs/tensorflow_p36

sys.version: 3.6.6 |Anaconda, Inc.| (default, Oct  9 ...
sys.prefix: /home/ec2-user/anaconda3
sys.executable: /home/ec2-user/anaconda3/bin/python
conda location: /home/ec2-user/anaconda3/lib/python3.6/site-packages/conda
conda-build: /home/ec2-user/anaconda3/bin/conda-build
conda-convert: /home/ec2-user/anaconda3/bin/conda-convert
conda-develop: /home/ec2-user/anaconda3/bin/conda-develop
conda-env: /home/ec2-user/anaconda3/bin/conda-env
conda-index: /home/ec2-user/anaconda3/bin/conda-index
conda-inspect: /home/ec2-user/anaconda3/bin/conda-inspect
conda-metapackage: /home/ec2-user/anaconda3/bin/conda-metapackage
conda-render: /home/ec2-user/anaconda3/bin/conda-render
conda-server: /home/ec2-user/anaconda3/bin/conda-server
conda-skeleton: /home/ec2-user/anaconda3/bin/conda-skeleton
conda-verify: /home/ec2-user/anaconda3/bin/conda-verify
user site dirs: 

AWS_PATH: /opt/aws
CIO_TEST: <not set>
CONDA_BACKUP_JAVA_HOME: /usr/lib/jvm/java
CONDA_EXE: /home/ec2-user/anaconda3/bin/conda
CONDA_PREFIX: /home/ec2-user/anaconda3/envs/run-nsf
CONDA_PREFIX_1: /home/ec2-user/anaconda3/envs/JupyterSystemEnv
CONDA_PYTHON_EXE: /home/ec2-user/anaconda3/bin/python
CONDA_ROOT: /home/ec2-user/anaconda3
CUDA_PATH: /usr/local/cuda-10.0
JAVA_LD_LIBRARY_PATH: /home/ec2-user/anaconda3/envs/run-nsf/lib/server
LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/efa/lib:/opt/amazon/efa/lib:/opt/amazon/efa/lib64:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/::/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow
MANPATH: /opt/aws/neuron/share/man:
MODULEPATH: /usr/share/Modules/modulefiles:/etc/modulefiles
PATH: /home/ec2-user/anaconda3/envs/run-nsf/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/home/ec2-user/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ec2-user/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin:/home/ec2-user/anaconda3/bin/:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/opt/aws/bin
PKG_CONFIG_PATH: /usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:
SSL_CERT_FILE: <not set>

WARNING: could not import _license.show_info
# try:
# $ conda install -n root _license

Devin Petersohn

Apr 30, 2020, 6:15:14 PM
to Nerdromancer, modin-dev
Hi Nerdromancer, thanks for the detailed question!

The short answer is twofold: we cannot speed up python for loops, and we cannot do any naive parallelism because subsequent steps in the loop may rely on some previous state.

The only way we can parallelize iteritems and itertuples is to introduce purely lazy computation. We are planning to do this as soon as this summer, but even if laziness is introduced, there is still a challenge with loops that depend on previous row state. What you are seeing with the slow runtime is that you are paying the distributed overheads each time you check a value and set another.

In your case, the data manipulation is embarrassingly parallel, which means that each iteration does not rely on the state of the previous. In that case I recommend using apply or the vectorized solution, as you did. `apply` is run in-parallel in your case so the overheads are paid only once.

In short, looping is something we will be iterating on (pun intended). It is a difficult operation to make run fast in a distributed environment, and hopefully laying out these challenges at a high level here will help a bit.


You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
To view this discussion on the web visit

Devin Petersohn

Apr 30, 2020, 6:41:54 PM
to Marc Cooperman, modin-dev
has anyone reported an issue with loop iteration being much slower than pandas?

Looking through the issue tracker (, we don't have one specific to itertuples or iterrows or looping. Feel free to create one if you'd like! As I mentioned before, there are overheads to running modin, the more you can give it at one time the better. Since in a loop it is only possible to interact with one row, you are paying cost to retrieve that row from somewhere else, then paying to put it back. Normally we can leave the rows where they are and operate on them in-parallel, but with loops we cannot inspect the entire python code block to see what is happening. That is why you are seeing the slowdown compared to pandas.

I've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook

It sounds interesting, feel free to share more about your experience as you go. 


On Thu, Apr 30, 2020 at 3:19 PM Marc Cooperman <> wrote:
thank you.

that's pretty clear.

one thing is that it's not just 'as slow as pandas' it's massively slower, in iterating loops.

i've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook
and am now attempting to use modin and pandas 'in parallel' (pun intended) with different module aliases for parts of the work that depend on either model of execution, where I haven't figured out how to convert.

has anyone reported an issue with loop iteration being much slower than pandas?


Yours truly,


Marc S. Cooperman
NOTE: My UPenn alumni email address changed to (from


May 6, 2020, 6:01:40 PM
to modin-dev
with a bit more experience and using %%prun it seems a pretty common and widespread 'thread' (pun intended) is large delays / time spent in cloudpicklee and ray getobjects - so clearly overhead related to passing object copies among the subworkers?

although some cases eeem to concentrate CPU time in the master or a single process, while others distribute it out.

there ought to be some caching used for something like - that's slow! should only recalculate if dataframe is 'dirty'.
Although there still seems to be a bug with datatypes after an astype, where categories incorrectly displayed as object (string) in info, vs the .dtype on on the series.

It's almost an inversion - everything cheap in pandas is expensive in modin/ray and vice versa. I know that's probably a gross simplification...

On Thursday, April 30, 2020 at 6:41:54 PM UTC-4, Devin Petersohn wrote:
has anyone reported an issue with loop iteration being much slower than pandas?

Looking through the issue tracker (, we don't have one specific to itertuples or iterrows or looping. Feel free to create one if you'd like! As I mentioned before, there are overheads to running modin, the more you can give it at one time the better. Since in a loop it is only possible to interact with one row, you are paying cost to retrieve that row from somewhere else, then paying to put it back. Normally we can leave the rows where they are and operate on them in-parallel, but with loops we cannot inspect the entire python code block to see what is happening. That is why you are seeing the slowdown compared to pandas.

I've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook

It sounds interesting, feel free to share more about your experience as you go. 


On Thu, Apr 30, 2020 at 3:19 PM Marc Cooperman <> wrote:
thank you.

that's pretty clear.

one thing is that it's not just 'as slow as pandas' it's massively slower, in iterating loops.

i've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook
and am now attempting to use modin and pandas 'in parallel' (pun intended) with different module aliases for parts of the work that depend on either model of execution, where I haven't figured out how to convert.

has anyone reported an issue with loop iteration being much slower than pandas?


Yours truly,


Marc S. Cooperman
NOTE: My UPenn alumni email address changed to (from

To unsubscribe from this group and stop receiving emails from it, send an email to
