Freeze Support in apply function

170 views
Skip to first unread message

var...@kaleyra.com

unread,
Dec 11, 2019, 9:24:27 AM12/11/19
to modin-dev

Hey There I am trying to use Google's PhoneLib Library to validate the phonenumbers from csv
the csv looks like 
Mobile
919100010000
919100010001
919100010002
919100010003
This is the code I'm using for validating the numbers.


import time


import modin.pandas as pd
import phonenumbers

start_time

start_time = time.time()

data = pd.read_csv('5lac.csv')
data['Mobile'] = data['Mobile'].astype(str).apply(lambda x: '+' + x)
data['valid1'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'IN')))
data.to_csv("5lac1.csv")

I think there is a bug in the multiprocessing module instead of doing a freeze support all the cores are trying to access the same df simultaneously.
Here is the link to freeze support : https://docs.python.org/2/library/multiprocessing.html#multiprocessing.freeze_support

Below is the error which raises it.






import
time

import modin.pandas as pd
import phonenumbers

start_time = time.time()

data = pd.read_csv('5lac.csv')
data['Mobile'] = data['Mobile'].astype(str).apply(lambda x: '+' + x)
data['valid1'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'IN')))
data.to_csv("5lac1.csv")




Task exception was never retrieved
future
: <Task finished name='Task-7' coro=<_wrap_awaitable() done, defined at /usr/local/lib/python3.8/asyncio/tasks.py:677> exception=RuntimeError('\n        An attempt has been made to start a new process before the\n        current process has finished its bootstrapping phase.\n\n        This probably means that you are not using fork to start your\n        child processes and you have forgotten to use the proper idiom\n        in the main module:\n\n            if __name__ == \'__main__\':\n                freeze_support()\n                ...\n\n        The "freeze_support()" line can be omitted if the program\n        is not going to be frozen to produce an executable.')>
Traceback (most recent call last):
 
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
   
return (yield from awaitable.__await__())
 
File "/home/user/.local/lib/python3.8/site-packages/distributed/nanny.py", line 251, in start
    response
= await self.instantiate()
 
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/nanny.py", line 334, in instantiate
    result
= await self.process.start()
 
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/nanny.py", line 522, in start
    await
self.process.start()
 
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/process.py", line 34, in _call_and_set_future
    res
= func(*args, **kwargs)
 
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/process.py", line 202, in _start
    process
.start()
 
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 121, in start
   
self._popen = self._Popen(self)
 
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 290, in _Popen
   
return Popen(process_obj)
 
File "/usr/local/lib/python3.8/multiprocessing/popen_forkserver.py", line 35, in __init__
   
super().__init__(process_obj)
 
File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
   
self._launch(process_obj)
 
File "/usr/local/lib/python3.8/multiprocessing/popen_forkserver.py", line 42, in _launch
    prep_data
= spawn.get_preparation_data(process_obj._name)
 
File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main
()
 
File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
   
raise RuntimeError('''
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.


        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:


            if __name__ == '
__main__':
                freeze_support()
                ...


        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.


I also tried implementing  the if __name__ == '__main__': freeze_support() 
and then calling the code it gives the same error. 

Devin Petersohn

unread,
Dec 11, 2019, 1:36:53 PM12/11/19
to var...@kaleyra.com, modin-dev
Thanks for reaching out!

This issue has been raised in the GitHub and is a consequence of the way that Windows handles MultiProcessing Python scripts. Here is a link to the issue: https://github.com/modin-project/modin/issues/843

Would you be able to try the fix in the issue and let me know if it works? It's a matter of just wrapping the entire script in an `if __name__ == '__main__'` block. I tested it locally for that simple example, so please let me know if it works for you!

Devin




This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete all copies of it from your system and notify the sender immediately by return E-mail. The sender does not accept liability for any errors or omissions.

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/bc5485ad-1164-4dac-855d-2c6d255a5e1c%40googlegroups.com.

var...@kaleyra.com

unread,
Dec 12, 2019, 1:21:58 AM12/12/19
to modin-dev
Hey @Devin Petersohn!

Thanks for helping out.

I was using CentOS not Windows but the if __name__ == '__main__' and then the code did work fine. Although I think there are some performance issues.

Here is the link to the sequential numbers I generated (https://send.firefox.com/download/195d144af1ac717d/#1Q4o8dFXFFuwFmsBTjGJJw) for benchmarking.
The code that I am using is as follows:

import phonenumbers
import time

if __name__ == '__main__':
 
import modin.pandas as pd

 start_time
= time.time()

 data
= pd.read_csv('5lac.csv')
 data
['Mobile'] = data['Mobile'].astype(str).apply(lambda x: '+' + x)
 data
['valid1'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'IN')))
 data
.to_csv("5lac1.csv")

 
print("--- %s seconds ---" % (time.time() - start_time)

The time took on a server(CentOS) with 8 cores and 12GB ram is --- 49.38035869598389 seconds ---.
The time took on windows with 8 cores and 16RB ram is --- 113.19851970672607 seconds ---.
The time took on Google Colab with 12GB Ram is 
CPU times: user 5.6 s, sys: 335 ms, total: 5.93 s Wall time: 50.4 s

The output for all was as follows

UserWarning: The Dask Engine for Modin is experimental.
UserWarning: Large object of size 8.00 MB detected in task graph:
 
(0, <function PandasQueryCompiler.setitem.<locals> ... 7b7734d5f26bb')
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers


    future = client.submit(func, big_data)    # bad


    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
UserWarning: `DataFrame.to_csv` defaulting to pandas implementation.
To request implementation, send an email to feature_...@modin.org.
--- 49.38035869598389 seconds ---



I think this was significantly more compared to the time taken my doing multiprocessing manually so I tried it. The code for the same is as below.

from multiprocessing import Pool
import pandas as pd
import phonenumbers
import time

start_time
= time.time()


def fn(num):
 
return phonenumbers.is_valid_number(phonenumbers.parse(num))


def fn1(num):
 
return phonenumbers.parse(num).country_code


df
= pd.read_csv('5lac.csv')

with Pool(16) as pool:
 df
['Mobile'] = df['Mobile'].astype(str).apply(lambda x: '+' + x)
 df
['Valid'] = pool.map(fn, df['Mobile'])
 df
['Country Code'] = pool.map(fn1, df['Mobile'])

print("--- %s seconds ---" % (time.time() - start_time))

The slight change being the map function applies the array that is column of dataframe to function in a parellel way.
The time took on a server(CentOS) with 8 cores and 12GB ram is: --- 10.573974847793579 seconds ---
The time took on windows with 8 cores and 16RB ram is:--- 18.407839059829712 seconds ---
The time took on Google Colab with a single core is: --- 35.12605667114258 seconds ---

I have a few additional queries apart from the above problem for you
  1. Do you think the Modin would be able to perform better than manual multiprocessing?
  2. Do you think adding GPU would increase the performance of manual multiprocessing or even modin in this case?

Looking forward to a faster programming solution. I would love to know your views on all the above things.

Vraj Shah
To unsubscribe from this group and stop receiving emails from it, send an email to modi...@googlegroups.com.

var...@kaleyra.com

unread,
Dec 19, 2019, 12:26:37 AM12/19/19
to modin-dev

Devin Petersohn

unread,
Dec 19, 2019, 12:33:24 PM12/19/19
to var...@kaleyra.com, modin-dev
Sorry for the late reply, the end of the semester can get busy for me.

Do you think the Modin would be able to perform better than manual multiprocessing?

Yes, but there are some inefficiencies in how we currently handle some of the cases you presented (e.g. re-assignment of a column). The reason I am confident Modin performs better is because of the memory management and optimization work we've been doing. The cost of the multiprocessing is evident if you assign the result to a new variable instead of re-assigning the column. Obviously we need to improve the efficiency of re-assigning the columns, but this shows that entering/exiting the Multiprocessing pool incurs overheads that Modin does not have. With local testing on 8 cores and 1GB of data, Modin is ~3x better than multiprocessing pools on the apply functionality.

Do you think adding GPU would increase the performance of manual multiprocessing or even modin in this case?

This is a bit more of a complicated question. Modin could hypothetically use GPUs if we integrated cuDF kernels into Modin, but currently we do not. GPUs are good at accelerating certain computations, but it's not likely to work on your existing libraries unless they already support GPUs (e.g. the phonenumbers library).

Let me know if this helps! I created an issue on the GitHub about the re-assignment performance:  https://github.com/modin-project/modin/issues/913. I thought we already had one but I couldn't find it. Let me know if you have any other questions!

Devin




This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete all copies of it from your system and notify the sender immediately by return E-mail. The sender does not accept liability for any errors or omissions.

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/fc397614-7a5a-4325-89c6-1a61b43e0f94%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages