big slowdown on some loop/iteration ops

119 views
Skip to first unread message

Nerdromancer

unread,
Apr 30, 2020, 4:48:31 PM4/30/20
to modin-dev
hi.


trying to work with modin/ray on AWS SageMaker

any sort of loop iteration seems exceptionally slow - much slower than pandas
seems to run on 1 or 2 processes only, and even so, takes a lot of time

instance is a ml.m5.12xlarge
configured w swap 1x memory, but using none

I load modin as:

import ray
ray.init()
import modin.pandas as pd

sucessfully spawns 48 ray procs (it looks like, using top in a terminal)

I had to massively reduce the size of my example dataframe just to work with the replicated problem:

import numpy as np

NROWS = 5000
NCOLS = 5
# generate random data for columns

data_ndarray = np.random.rand(NCOLS,NROWS)
print(data_ndarray.shape)

df = pd.DataFrame(data= {
    'col1': data_ndarray[0],
    'col2': data_ndarray[1],
    'col3': data_ndarray[2],
    'col4': data_ndarray[3],
    'col5': data_ndarray[4],
})

this gives:

(5, 5000)
<class 'modin.pandas.dataframe.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    5000 non-null   float64
 1   col2    5000 non-null   float64
 2   col3    5000 non-null   float64
 3   col4    5000 non-null   float64
 4   col5    5000 non-null   float64
dtypes: float64(5)


then:

%%timeit
for r in df.itertuples(index=True):
    #print(r)
    pass

gives:

26.5 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

next:

df['test'] = False

# standard loop

def f_std_loop(_df):
    #_df['test'] = False
    for row in range(0, len(_df)):
        if _df['col1'].iloc[row] == 1:
            _df['test'].iloc[row] = True

%%timeit -r 2 -n 1
f_std_loop(df)

gives:

38.5 s ± 9.28 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

then itertuples:

# itertuples loop

def f_row_iter(col0):
    if col0 == 1:
        return True
    return False
    
def f_itertuples_loop(_df):
    result_series = []
    for row in _df.itertuples(index=False,name="TestName"):
        #print(row)
        result_series.append(f_row_iter(row.col1))
    _df['test'] = result_series

%%timeit -r 2 -n 1
f_itertuples_loop(df)

gives:

28.7 s ± 203 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

apply is faster, but not sure if it will scale:

%%timeit -r 2 -n 1
# apply, row-wise
# this is like a for loop. not good?
df['test'] = df.apply(lambda r: True if r['col1'] == 1 else False,axis=1)

gives:

248 ms ± 5.35 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

# vectorization

# vectorize
def f_vec_iter(_df,col0):
    _df.loc[col0 >= 0.85,'test'] = True

df['test'] = False

%%timeit -r 7 -n 2 f_vec_iter(df,df['col1'])

gives:

141 ms ± 5.09 ms per loop (mean ± std. dev. of 7 runs, 2 loops each)

df[df['test']][['test']].sum()
Output:
test    742
dtype: int64

Python Environment is as follows:

     active environment : run-nsf
    active env location : /home/ec2-user/anaconda3/envs/run-nsf
            shell level : 2
       user config file : /home/ec2-user/.condarc
 populated config files : /home/ec2-user/.condarc
          conda version : 4.5.12
    conda-build version : 3.10.5
         python version : 3.6.6.final.0
       base environment : /home/ec2-user/anaconda3  (writable)
           channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/free/linux-64
                          https://repo.anaconda.com/pkgs/free/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/pro/linux-64
                          https://repo.anaconda.com/pkgs/pro/noarch
          package cache : /home/ec2-user/anaconda3/pkgs
       envs directories : /home/ec2-user/anaconda3/envs
                          /home/ec2-user/.conda/envs
               platform : linux-64
             user-agent : conda/4.5.12 requests/2.20.0 CPython/3.6.6 Linux/4.14.171-105.231.amzn1.x86_64 amzn/2018.03 glibc/2.17
                UID:GID : 500:500
             netrc file : None
           offline mode : False

# conda environments:
#
base                     /home/ec2-user/anaconda3
JupyterSystemEnv         /home/ec2-user/anaconda3/envs/JupyterSystemEnv
R                        /home/ec2-user/anaconda3/envs/R
amazonei_mxnet_p27       /home/ec2-user/anaconda3/envs/amazonei_mxnet_p27
amazonei_mxnet_p36       /home/ec2-user/anaconda3/envs/amazonei_mxnet_p36
amazonei_tensorflow_p27     /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p27
amazonei_tensorflow_p36     /home/ec2-user/anaconda3/envs/amazonei_tensorflow_p36
chainer_p27              /home/ec2-user/anaconda3/envs/chainer_p27
chainer_p36              /home/ec2-user/anaconda3/envs/chainer_p36
mxnet_p27                /home/ec2-user/anaconda3/envs/mxnet_p27
mxnet_p36                /home/ec2-user/anaconda3/envs/mxnet_p36
python2                  /home/ec2-user/anaconda3/envs/python2
python3                  /home/ec2-user/anaconda3/envs/python3
pytorch_p27              /home/ec2-user/anaconda3/envs/pytorch_p27
pytorch_p36              /home/ec2-user/anaconda3/envs/pytorch_p36
run-nsf               *  /home/ec2-user/anaconda3/envs/run-nsf
tensorflow_p27           /home/ec2-user/anaconda3/envs/tensorflow_p27
tensorflow_p36           /home/ec2-user/anaconda3/envs/tensorflow_p36

sys.version: 3.6.6 |Anaconda, Inc.| (default, Oct  9 ...
sys.prefix: /home/ec2-user/anaconda3
sys.executable: /home/ec2-user/anaconda3/bin/python
conda location: /home/ec2-user/anaconda3/lib/python3.6/site-packages/conda
conda-build: /home/ec2-user/anaconda3/bin/conda-build
conda-convert: /home/ec2-user/anaconda3/bin/conda-convert
conda-develop: /home/ec2-user/anaconda3/bin/conda-develop
conda-env: /home/ec2-user/anaconda3/bin/conda-env
conda-index: /home/ec2-user/anaconda3/bin/conda-index
conda-inspect: /home/ec2-user/anaconda3/bin/conda-inspect
conda-metapackage: /home/ec2-user/anaconda3/bin/conda-metapackage
conda-render: /home/ec2-user/anaconda3/bin/conda-render
conda-server: /home/ec2-user/anaconda3/bin/conda-server
conda-skeleton: /home/ec2-user/anaconda3/bin/conda-skeleton
conda-verify: /home/ec2-user/anaconda3/bin/conda-verify
user site dirs: 

AWS_PATH: /opt/aws
CIO_TEST: <not set>
CONDA_BACKUP_JAVA_HOME: /usr/lib/jvm/java
CONDA_BACKUP_JAVA_LD_LIBRARY_PATH: 
CONDA_DEFAULT_ENV: run-nsf
CONDA_EXE: /home/ec2-user/anaconda3/bin/conda
CONDA_MKL_INTERFACE_LAYER_BACKUP: 
CONDA_PREFIX: /home/ec2-user/anaconda3/envs/run-nsf
CONDA_PREFIX_1: /home/ec2-user/anaconda3/envs/JupyterSystemEnv
CONDA_PROMPT_MODIFIER: (run-nsf) 
CONDA_PYTHON_EXE: /home/ec2-user/anaconda3/bin/python
CONDA_ROOT: /home/ec2-user/anaconda3
CONDA_SHLVL: 2
CUDA_PATH: /usr/local/cuda-10.0
JAVA_LD_LIBRARY_PATH: /home/ec2-user/anaconda3/envs/run-nsf/lib/server
LD_LIBRARY_PATH: /usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.0/lib:/usr/local/cuda-10.0/efa/lib:/opt/amazon/efa/lib:/opt/amazon/efa/lib64:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/:/usr/lib64/openmpi/lib/:/usr/local/lib:/usr/lib:/usr/local/mpi/lib:/lib/::/home/ec2-user/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow
MANPATH: /opt/aws/neuron/share/man:
MODULEPATH: /usr/share/Modules/modulefiles:/etc/modulefiles
PATH: /home/ec2-user/anaconda3/envs/run-nsf/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/home/ec2-user/anaconda3/bin/:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/home/ec2-user/src/cntk/bin:/usr/local/mpi/bin:/opt/aws/neuron/bin:/home/ec2-user/anaconda3/envs/JupyterSystemEnv/bin:/home/ec2-user/anaconda3/bin/:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/usr/local/cuda/bin:/usr/local/bin:/opt/aws/bin:/usr/local/mpi/bin:/usr/libexec/gcc/x86_64-amazon-linux/4.8.5:/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/aws/bin:/opt/aws/bin
PKG_CONFIG_PATH: /usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:/usr/local/lib/pkgconfig:
PYTHON_INSTALL_LAYOUT: amzn
PYTHON_VERSION: 3.6
REQUESTS_CA_BUNDLE: <not set>
SSL_CERT_FILE: <not set>


WARNING: could not import _license.show_info
# try:
# $ conda install -n root _license


# packages in environment at /home/ec2-user/anaconda3/envs/run-nsf:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
absl-py                   0.9.0                     <pip>
aiohttp                   3.6.2                     <pip>
alabaster                 0.7.10           py36h306e16b_0  
anaconda-client           1.6.14                   py36_0  
anaconda-project          0.8.2            py36h44fb852_0  
argparse                  1.4.0                     <pip>
asn1crypto                0.24.0                   py36_0  
astor                     0.8.1                     <pip>
astroid                   1.6.3                    py36_0  
astropy                   3.0.2            py36h3010b51_1  
async-timeout             3.0.1                     <pip>
attrs                     18.1.0                   py36_0  
Automat                   0.3.0                     <pip>
autovizwidget             0.15.0                    <pip>
awscli                    1.18.24                   <pip>
babel                     2.5.3                    py36_0  
backcall                  0.1.0                    py36_0  
backports                 1.0              py36hfa02d7e_1  
backports.shutil_get_terminal_size 1.0.0            py36hfea85ff_2  
bazel                     0.15.0                        0    conda-forge
bcrypt                    3.1.7                     <pip>
beautifulsoup4            4.6.0            py36h49b8c8c_1  
bitarray                  0.8.1            py36h14c3975_1  
bkcharts                  0.2              py36h735825a_0  
blas                      1.0                         mkl  
blaze                     0.11.3           py36h4e06776_0  
bleach                    2.1.3                    py36_0  
blist                     1.3.6            py36h8c4c3a4_1    conda-forge
blosc                     1.14.3               hdbcaa40_0  
bokeh                     1.0.4                     <pip>
bokeh                     1.4.0                    py36_0  
boto                      2.48.0           py36h6e4cd66_1  
boto3                     1.12.41            pyh9f0ad1d_0    conda-forge
botocore                  1.15.41            pyh9f0ad1d_0    conda-forge
bottleneck                1.2.1            py36haac1ea0_0  
bzip2                     1.0.6                h14c3975_5  
ca-certificates           2020.1.1                      0  
cached-property           1.5.1                     <pip>
cairo                     1.14.12              h8948797_3  
certifi                   2019.11.28               py36_0  
cffi                      1.11.5           py36h9745a5d_0  
characteristic            14.3.0                    <pip>
chardet                   3.0.4            py36h0f667ec_1  
click                     6.7              py36h5253387_0  
cloudpickle               0.5.3                    py36_0  
clyent                    1.2.2            py36h7e57e65_1  
colorama                  0.3.9            py36h489cec4_0  
contextlib2               0.5.5            py36h6c84a62_0  
cryptography              2.8                       <pip>
cryptography              2.3.1            py36hc365091_0  
curl                      7.61.0               h84994c4_0  
cycler                    0.10.0           py36h93f1223_0  
cython                    0.29.16          py36h831f99a_0    conda-forge
cytoolz                   0.9.0.1          py36h14c3975_0  
dask                      1.2.2                      py_0  
dask-core                 1.2.2                      py_0  
datashape                 0.5.4            py36h3ad6b5c_0  
dbus                      1.13.2               h714fa37_1  
decorator                 4.3.0                    py36_0  
defusedxml                0.6.0                      py_0  
dill                      0.3.1.1          py36h9f0ad1d_1    conda-forge
distributed               1.28.1                   py36_0  
docker                    4.2.0                     <pip>
docker-compose            1.25.4                    <pip>
dockerpty                 0.4.1                     <pip>
docopt                    0.6.2                     <pip>
docutils                  0.14             py36hb0f60f5_0  
entrypoints               0.2.3            py36h1aec115_2  
enum34                    1.1.9                     <pip>
environment-kernels       1.1.1                     <pip>
et_xmlfile                1.0.1            py36hd6bccc3_0  
expat                     2.2.9                he1b5a44_2    conda-forge
fastcache                 1.0.2            py36h14c3975_2  
filelock                  3.0.4                    py36_0  
flask                     1.0.2                    py36_1  
flask-cors                3.0.4                    py36_0  
fontconfig                2.13.0               h9420a91_0  
freetype                  2.10.0               he983fc9_1    conda-forge
fribidi                   1.0.5             h516909a_1002    conda-forge
fsspec                    0.7.2                      py_0    conda-forge
gast                      0.2.2                     <pip>
get_terminal_size         1.0.0                haa9412d_0  
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
gevent                    1.3.0            py36h14c3975_0  
glib                      2.56.2            had28632_1001    conda-forge
glob2                     0.6              py36he249c77_0  
gmp                       6.1.2                h6c8ec71_1  
gmpy2                     2.0.8            py36hc8893dd_2  
google                    2.0.3                     <pip>
google-pasta              0.1.8                     <pip>
graphite2                 1.3.11               h16798f4_2  
graphviz                  2.40.1               h21bd128_2  
greenlet                  0.4.13           py36h14c3975_0  
grpcio                    1.10.1                    <pip>
gst-plugins-base          1.14.0               hbbd80ab_1  
gstreamer                 1.14.0               hb453b48_1  
h5py                      2.8.0            py36h989c5e5_3  
harfbuzz                  1.9.0             he243708_1001    conda-forge
hdf5                      1.10.2               hba1933b_1  
hdijupyterutils           0.15.0                    <pip>
heapdict                  1.0.0                    py36_2  
horovod                   0.19.0                    <pip>
html5lib                  1.0.1            py36h2f9c1c0_0  
icu                       58.2                 h9c2bf20_1  
idna                      2.6              py36h82fb2a8_1  
idna-ssl                  1.1.0                     <pip>
imageio                   2.3.0                    py36_0  
imagesize                 1.0.0                    py36_0  
importlib-metadata        1.5.0                     <pip>
intel-openmp              2018.0.0                      8  
ipykernel                 5.2.0            py36h95af2a2_1    conda-forge
ipyparallel               6.2.4            py36h9f0ad1d_0    conda-forge
ipython                   7.13.0           py36h9f0ad1d_2    conda-forge
ipython_genutils          0.2.0            py36hb52b0d5_0  
ipywidgets                7.5.1                      py_0    conda-forge
isort                     4.3.4                    py36_0  
itsdangerous              0.24             py36h93cc618_1  
jbig                      2.1                  hdba287a_0  
jdcal                     1.4                      py36_0  
jedi                      0.12.0                   py36_1  
jinja2                    2.10             py36ha16c418_0  
jmespath                  0.9.4                      py_0  
joblib                    0.14.1                     py_0  
jpeg                      9b                   h024ee3a_2  
jsonschema                2.6.0            py36h006f8b5_0  
jupyter_client            6.1.3                      py_0    conda-forge
jupyter_console           5.2.0                    py36_1    conda-forge
jupyter_core              4.6.3            py36h9f0ad1d_1    conda-forge
jupyterlab                0.32.1                   py36_0  
jupyterlab_launcher       0.10.5                   py36_0  
Keras                     2.2.4                     <pip>
Keras-Applications        1.0.8                     <pip>
Keras-Preprocessing       1.1.0                     <pip>
kiwisolver                1.0.1            py36h764f252_0  
krb5                      1.14.2               hcdc1b81_6  
lazy-object-proxy         1.3.1            py36h10fcdad_0  
ld_impl_linux-64          2.33.1               h53a641e_7  
libblas                   3.8.0                    15_mkl    conda-forge
libcblas                  3.8.0                    15_mkl    conda-forge
libcurl                   7.61.0               h1ad7b7a_0  
libedit                   3.1.20170329         h6b74fdf_2  
libffi                    3.2.1                hd88cf55_4  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran               3.0.0                         1    conda-forge
libgfortran-ng            7.2.0                hdf63c60_3  
libiconv                  1.15              h516909a_1005    conda-forge
liblapack                 3.8.0                    15_mkl    conda-forge
libllvm8                  8.0.1                hc9558a2_0    conda-forge
libpng                    1.6.37               hed695b0_0    conda-forge
libsodium                 1.0.16               h1bed415_0  
libssh2                   1.8.0                h9cfc8f7_4  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.0.9                he85c1e1_1  
libtool                   2.4.6                h544aabb_3  
libuuid                   1.0.3                h1bed415_2  
libxcb                    1.13                 h1bed415_1  
libxml2                   2.9.8                h26e45fe_1  
libxslt                   1.1.32               h1312cb7_0  
llvmlite                  0.31.0           py36hfa65bc7_1    conda-forge
locket                    0.2.0            py36h787c0ad_1  
lxml                      4.2.1            py36h23eabaa_0  
lzo                       2.10                 h49e0be7_2  
Markdown                  3.2.1                     <pip>
markupsafe                1.0              py36hd9260cd_1  
matplotlib                3.1.1                    py36_0    conda-forge
matplotlib-base           3.1.1            py36hfd891ef_0    conda-forge
mccabe                    0.6.1            py36h5ad9710_1  
memory_profiler           0.57.0                     py_0    conda-forge
missingno                 0.4.2                      py_1    conda-forge
mistune                   0.8.3            py36h14c3975_1  
mkl                       2020.0                      166  
mkl-service               2.3.0            py36he904b0f_0  
mkl_fft                   1.0.15           py36ha843d7b_0  
mkl_random                1.1.0            py36hd6b4f25_0  
mock                      4.0.1                     <pip>
modin                     0.7.3                     <pip>
more-itertools            4.1.0                    py36_0  
mpc                       1.0.3                hec55b23_5  
mpfr                      3.1.5                h11a74b3_2  
mpi                       1.0                     openmpi    conda-forge
mpmath                    1.0.0            py36hfeacd6b_2  
msgpack                   0.6.0                     <pip>
msgpack-python            0.5.6            py36h6bb024c_0  
multidict                 4.7.5                     <pip>
multipledispatch          0.5.0                    py36_0  
nb_conda                  2.2.1                    py36_0  
nb_conda_kernels          2.2.2                    py36_0  
nbconvert                 5.4.1                    py36_3  
nbformat                  4.4.0            py36h31c9010_0  
ncurses                   6.1                  hf484d3e_0  
networkx                  2.1                      py36_0  
nltk                      3.3.0                    py36_0  
nodejs                    12.4.0               he1b5a44_0    conda-forge
nose                      1.3.7            py36hcdf7029_2  
notebook                  6.0.3                    py36_0    conda-forge
numba                     0.48.0           py36hb3f55d8_0    conda-forge
numexpr                   2.7.1            py36h423224d_0  
numpy                     1.16.4           py36h7e9f1db_0  
numpy-base                1.16.4           py36hde5b4d6_0  
numpydoc                  0.8.0                    py36_0  
odo                       0.5.1            py36h90ed295_0  
olefile                   0.45.1                   py36_0  
opencv-python             3.4.2.17                  <pip>
openjdk                   11.0.1            h516909a_1016    conda-forge
openmpi                   4.0.1                hc99cbb1_2    conda-forge
openpyxl                  2.5.3                    py36_0  
openssl                   1.0.2u               h7b6447c_0  
opt-einsum                3.1.0                     <pip>
packaging                 20.1                      <pip>
packaging                 17.1                     py36_0  
pandas                    1.0.3            py36h830a2c2_0    conda-forge
pandoc                    1.19.2.1             hea2e7c5_1  
pandocfilters             1.4.2            py36ha6701b7_1  
pango                     1.42.3               h8589676_0  
paramiko                  2.7.1                     <pip>
parso                     0.2.0                    py36_0  
partd                     0.3.8            py36h36fd896_0  
patchelf                  0.9                  hf79760b_2  
path.py                   11.0.1                   py36_0  
pathlib2                  2.3.2                    py36_0  
patsy                     0.5.0                    py36_0  
pcre                      8.42                 h439df22_0  
pep8                      1.7.1                    py36_0  
pexpect                   4.5.0                    py36_0  
pickleshare               0.7.4            py36h63277f8_0  
pillow                    5.4.1            py36h34e0f95_0  
pip                       19.3.1                    <pip>
pip                       19.3.1                   py36_0  
pixman                    0.34.0               hceecf20_3  
pkginfo                   1.4.2                    py36_1  
plotly                    4.5.2                     <pip>
pluggy                    0.6.0            py36hb689045_0  
ply                       3.11                     py36_0  
prometheus_client         0.7.1                      py_0    conda-forge
prompt-toolkit            3.0.5                      py_0    conda-forge
prompt_toolkit            1.0.15           py36h17d85b1_0  
protobuf                  3.8.0                     <pip>
protobuf3-to-dict         0.1.5                     <pip>
psutil                    5.4.5            py36h14c3975_0  
psycopg2                  2.7.5                     <pip>
ptyprocess                0.5.2            py36h69acd42_0  
py                        1.5.3                    py36_0  
py-spy                    0.3.3                     <pip>
py4j                      0.10.7                    <pip>
pyarrow                   0.17.0                    <pip>
pyasn1                    0.4.8                     <pip>
pycodestyle               2.4.0                    py36_0  
pycosat                   0.6.3            py36h0a5515d_0  
pycparser                 2.18             py36hf9f622e_1  
pycrypto                  2.6.1            py36h14c3975_8  
pycurl                    7.43.0.2         py36hb7f436b_0  
pyflakes                  1.6.0            py36h7bd6a15_0  
pygal                     2.4.0                     <pip>
pygments                  2.2.0            py36h0d3125c_0  
pykerberos                1.2.1            py36h14c3975_0  
pylint                    1.8.4                    py36_0  
pympler                   0.8                        py_0    conda-forge
PyNaCl                    1.3.0                     <pip>
pyodbc                    4.0.23           py36hf484d3e_0  
pyopenssl                 18.0.0                   py36_0  
pyparsing                 2.2.0            py36hee85983_1  
pyqt                      5.9.2            py36h751905a_0  
pysocks                   1.6.8                    py36_0  
pyspark                   2.3.2                     <pip>
pytables                  3.4.3            py36h02b9ad4_2  
pytest                    3.5.1                    py36_0  
pytest-arraydiff          0.2                      py36_0  
pytest-astropy            0.3.0                    py36_0  
pytest-doctestplus        0.1.3                    py36_0  
pytest-openfiles          0.3.0                    py36_0  
pytest-remotedata         0.2.1                    py36_0  
python                    3.6.6                h6e4f718_2  
python-dateutil           2.7.3                    py36_0  
python_abi                3.6                     1_cp36m    conda-forge
pytz                      2018.4                   py36_0  
pywavelets                0.5.2            py36he602eb0_0  
pyyaml                    3.12             py36hafb9ca4_1  
PyYAML                    5.3.1                     <pip>
pyzmq                     17.0.0           py36h14c3975_0  
qt                        5.9.6                h8703b6f_2  
qtawesome                 0.4.4            py36h609ed8c_0  
qtconsole                 4.3.1            py36h8f73b5b_0  
qtpy                      1.4.1                    py36_0  
ray                       0.8.4                     <pip>
readline                  7.0                  ha6073c6_4  
redis                     3.4.1                     <pip>
regex                     2020.4.4         py36h8c4c3a4_0    conda-forge
requests                  2.20.0                   py36_0  
requests-kerberos         0.12.0                    <pip>
retrying                  1.3.3                     <pip>
rope                      0.10.7           py36h147e2ec_0  
rsa                       3.4.2                     <pip>
ruamel_yaml               0.15.35          py36h14c3975_1  
s3fs                      0.4.2                      py_0    conda-forge
s3transfer                0.3.3            py36h9f0ad1d_1    conda-forge
sagemaker                 1.51.3                    <pip>
sagemaker-pyspark         1.2.8                     <pip>
scikit-image              0.16.2           py36hb3f55d8_0    conda-forge
scikit-learn              0.20.3                    <pip>
scikit-learn              0.22.2.post1     py36hcdab131_0    conda-forge
scipy                     1.4.1            py36h2d22cac_3    conda-forge
seaborn                   0.10.0                     py_1    conda-forge
send2trash                1.5.0                    py36_0  
setuptools                45.2.0                    <pip>
setuptools                39.1.0                   py36_0  
simplegeneric             0.8.1                    py36_2  
singledispatch            3.4.0.3          py36h7a266c3_0  
sip                       4.19.8           py36hf484d3e_0  
six                       1.11.0           py36h372c433_1  
smdebug-rulesconfig       0.1.2                     <pip>
snappy                    1.1.7                hbae5bb6_3  
snowballstemmer           1.2.1            py36h6febd40_0  
sortedcollections         0.6.1                    py36_0  
sortedcontainers          1.5.10                   py36_0  
sparkmagic                0.12.5                    <pip>
sphinx                    1.7.4                    py36_0  
sphinxcontrib             1.0              py36h6d0f590_1  
sphinxcontrib-websupport  1.0.1            py36hb5cb234_1  
spyder                    3.2.8                    py36_0  
SQLAlchemy                1.2.11                    <pip>
sqlalchemy                1.2.7            py36h6b74fdf_0  
sqlite                    3.28.0               h8b20d00_0    conda-forge
statsmodels               0.10.2           py36hc1659b7_0    conda-forge
sympy                     1.1.1            py36hc6d1c1c_0  
tbb                       2020.1               hc9558a2_0    conda-forge
tbb4py                    2020.1           py36hc9558a2_0    conda-forge
tblib                     1.3.2            py36h34cf8b6_0  
tensorboard               1.15.0                    <pip>
tensorflow                1.15.2                    <pip>
tensorflow-estimator      1.15.1                    <pip>
tensorflow-serving-api    1.15.0                    <pip>
termcolor                 1.1.0                     <pip>
terminado                 0.8.1                    py36_1  
testpath                  0.3.1            py36h8cadb63_0  
texttable                 1.6.2                     <pip>
tk                        8.6.10               hed695b0_0    conda-forge
toolz                     0.9.0                    py36_0  
tornado                   5.0.2                    py36_0  
tqdm                      4.45.0             pyh9f0ad1d_0    conda-forge
traitlets                 4.3.2            py36h674d592_0  
typing                    3.6.4                    py36_0  
typing-extensions         3.7.4.2                   <pip>
unicodecsv                0.14.1           py36ha668878_0  
unixodbc                  2.3.6                h1bed415_0  
urllib3                   1.23                     py36_0  
wcwidth                   0.1.7            py36hdf4376a_0  
webencodings              0.5.1            py36h800622e_1  
websocket-client          0.57.0                    <pip>
werkzeug                  0.14.1                   py36_0  
wheel                     0.31.1                   py36_0  
widgetsnbextension        3.5.1                    py36_0    conda-forge
wrapt                     1.12.1           py36h8c4c3a4_1    conda-forge
xlrd                      1.1.0            py36h1db9f0c_1  
xlsxwriter                1.0.4                    py36_0  
xlwt                      1.3.0            py36h7b00a1f_0  
xz                        5.2.4                h14c3975_4  
yaml                      0.1.7                had09818_2  
yarl                      1.4.2                     <pip>
zeromq                    4.2.5                h439df22_0  
zict                      0.1.3            py36h3a3bf81_0  
zipp                      3.0.0                     <pip>
zlib                      1.2.11               ha838bed_2  

Devin Petersohn

unread,
Apr 30, 2020, 6:15:14 PM4/30/20
to Nerdromancer, modin-dev
Hi Nerdromancer, thanks for the detailed question!

The short answer is twofold: we cannot speed up python for loops, and we cannot do any naive parallelism because subsequent steps in the loop may rely on some previous state.

The only way we can parallelize iteritems and itertuples is to introduce purely lazy computation. We are planning to do this as soon as this summer, but even if laziness is introduced, there is still a challenge with loops that depend on previous row state. What you are seeing with the slow runtime is that you are paying the distributed overheads each time you check a value and set another.

In your case, the data manipulation is embarrassingly parallel, which means that each iteration does not rely on the state of the previous. In that case I recommend using apply or the vectorized solution, as you did. `apply` is run in-parallel in your case so the overheads are paid only once.

In short, looping is something we will be iterating on (pun intended). It is a difficult operation to make run fast in a distributed environment, and hopefully laying out these challenges at a high level here will help a bit.

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/f90c3054-7349-4fdd-b406-da66d73dcf8b%40googlegroups.com.

Devin Petersohn

unread,
Apr 30, 2020, 6:41:54 PM4/30/20
to Marc Cooperman, modin-dev
has anyone reported an issue with loop iteration being much slower than pandas?

Looking through the issue tracker (https://github.com/modin-project/modin/issues), we don't have one specific to itertuples or iterrows or looping. Feel free to create one if you'd like! As I mentioned before, there are overheads to running modin, the more you can give it at one time the better. Since in a loop it is only possible to interact with one row, you are paying cost to retrieve that row from somewhere else, then paying to put it back. Normally we can leave the rows where they are and operate on them in-parallel, but with loops we cannot inspect the entire python code block to see what is happening. That is why you are seeing the slowdown compared to pandas.

I've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook

It sounds interesting, feel free to share more about your experience as you go. 

Devin

On Thu, Apr 30, 2020 at 3:19 PM Marc Cooperman <mscoo...@gmail.com> wrote:
thank you.

that's pretty clear.

one thing is that it's not just 'as slow as pandas' it's massively slower, in iterating loops.

i've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook
and am now attempting to use modin and pandas 'in parallel' (pun intended) with different module aliases for parts of the work that depend on either model of execution, where I haven't figured out how to convert.

has anyone reported an issue with loop iteration being much slower than pandas?

thanks




Yours truly,

Marc

Marc S. Cooperman
NOTE: My UPenn alumni email address changed to marc.s.coo...@alumni.upenn.edu (from marc.s.coo...@wharton.upenn.edu)


Nerdromancer

unread,
May 6, 2020, 6:01:40 PM5/6/20
to modin-dev
with a bit more experience and using %%prun it seems a pretty common and widespread 'thread' (pun intended) is large delays / time spent in cloudpicklee and ray getobjects - so clearly overhead related to passing object copies among the subworkers?

although some cases eeem to concentrate CPU time in the master or a single process, while others distribute it out.

there ought to be some caching used for something like DataFrame.info() - that's slow! should only recalculate if dataframe is 'dirty'.
Although there still seems to be a bug with datatypes after an astype, where categories incorrectly displayed as object (string) in info, vs the .dtype on on the series.

It's almost an inversion - everything cheap in pandas is expensive in modin/ray and vice versa. I know that's probably a gross simplification...



On Thursday, April 30, 2020 at 6:41:54 PM UTC-4, Devin Petersohn wrote:
has anyone reported an issue with loop iteration being much slower than pandas?

Looking through the issue tracker (https://github.com/modin-project/modin/issues), we don't have one specific to itertuples or iterrows or looping. Feel free to create one if you'd like! As I mentioned before, there are overheads to running modin, the more you can give it at one time the better. Since in a loop it is only possible to interact with one row, you are paying cost to retrieve that row from somewhere else, then paying to put it back. Normally we can leave the rows where they are and operate on them in-parallel, but with loops we cannot inspect the entire python code block to see what is happening. That is why you are seeing the slowdown compared to pandas.

I've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook

It sounds interesting, feel free to share more about your experience as you go. 

Devin

On Thu, Apr 30, 2020 at 3:19 PM Marc Cooperman <mscoo...@gmail.com> wrote:
thank you.

that's pretty clear.

one thing is that it's not just 'as slow as pandas' it's massively slower, in iterating loops.

i've tried restarting my instance and putting some defensive coding about how modin/ray get loaded/unloaded per notebook
and am now attempting to use modin and pandas 'in parallel' (pun intended) with different module aliases for parts of the work that depend on either model of execution, where I haven't figured out how to convert.

has anyone reported an issue with loop iteration being much slower than pandas?

thanks


Yours truly,

Marc

Marc S. Cooperman
NOTE: My UPenn alumni email address changed to marc.s.c...@alumni.upenn.edu (from marc.s.co...@wharton.upenn.edu)



To unsubscribe from this group and stop receiving emails from it, send an email to modi...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages