Hey There I am trying to use Google's PhoneLib Library to validate the phonenumbers from csv
the csv looks like
Mobile
919100010000
919100010001
919100010002
919100010003
This is the code I'm using for validating the numbers.
import time
import modin.pandas as pd
import phonenumbers
start_time
start_time = time.time()
data = pd.read_csv('5lac.csv')
data['Mobile'] = data['Mobile'].astype(str).apply(lambda x: '+' + x)
data['valid1'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'IN')))
data.to_csv("5lac1.csv")
I think there is a bug in the multiprocessing module instead of doing a freeze support all the cores are trying to access the same df simultaneously.
Here is the link to freeze support : https://docs.python.org/2/library/multiprocessing.html#multiprocessing.freeze_support
Below is the error which raises it.
import time
import modin.pandas as pd
import phonenumbers
start_time = time.time()
data = pd.read_csv('5lac.csv')
data['Mobile'] = data['Mobile'].astype(str).apply(lambda x: '+' + x)
data['valid1'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'IN')))
data.to_csv("5lac1.csv")
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<_wrap_awaitable() done, defined at /usr/local/lib/python3.8/asyncio/tasks.py:677> exception=RuntimeError('\n An attempt has been made to start a new process before the\n current process has finished its bootstrapping phase.\n\n This probably means that you are not using fork to start your\n child processes and you have forgotten to use the proper idiom\n in the main module:\n\n if __name__ == \'__main__\':\n freeze_support()\n ...\n\n The "freeze_support()" line can be omitted if the program\n is not going to be frozen to produce an executable.')>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/home/user/.local/lib/python3.8/site-packages/distributed/nanny.py", line 251, in start
response = await self.instantiate()
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/nanny.py", line 334, in instantiate
result = await self.process.start()
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/nanny.py", line 522, in start
await self.process.start()
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/process.py", line 34, in _call_and_set_future
res = func(*args, **kwargs)
File "/home/varshaj/.local/lib/python3.8/site-packages/distributed/process.py", line 202, in _start
process.start()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/usr/local/lib/python3.8/multiprocessing/context.py", line 290, in _Popen
return Popen(process_obj)
File "/usr/local/lib/python3.8/multiprocessing/popen_forkserver.py", line 35, in __init__
super().__init__(process_obj)
File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/usr/local/lib/python3.8/multiprocessing/popen_forkserver.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "/usr/local/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.
--
This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete all copies of it from your system and notify the sender immediately by return E-mail. The sender does not accept liability for any errors or omissions.
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/bc5485ad-1164-4dac-855d-2c6d255a5e1c%40googlegroups.com.
import phonenumbers
import time
if __name__ == '__main__':
import modin.pandas as pd
start_time = time.time()
data = pd.read_csv('5lac.csv')
data['Mobile'] = data['Mobile'].astype(str).apply(lambda x: '+' + x)
data['valid1'] = data['Mobile'].apply(lambda x: phonenumbers.is_valid_number(phonenumbers.parse(x, 'IN')))
data.to_csv("5lac1.csv")
print("--- %s seconds ---" % (time.time() - start_time)
CPU times: user 5.6 s, sys: 335 ms, total: 5.93 s Wall time: 50.4 s
UserWarning: The Dask Engine for Modin is experimental.
UserWarning: Large object of size 8.00 MB detected in task graph:
(0, <function PandasQueryCompiler.setitem.<locals> ... 7b7734d5f26bb')
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
UserWarning: `DataFrame.to_csv` defaulting to pandas implementation.
To request implementation, send an email to feature_...@modin.org.
--- 49.38035869598389 seconds ---
from multiprocessing import Pool
import pandas as pd
import phonenumbers
import time
start_time = time.time()
def fn(num):
return phonenumbers.is_valid_number(phonenumbers.parse(num))
def fn1(num):
return phonenumbers.parse(num).country_code
df = pd.read_csv('5lac.csv')
with Pool(16) as pool:
df['Mobile'] = df['Mobile'].astype(str).apply(lambda x: '+' + x)
df['Valid'] = pool.map(fn, df['Mobile'])
df['Country Code'] = pool.map(fn1, df['Mobile'])
print("--- %s seconds ---" % (time.time() - start_time))
The slight change being the map function applies the array that is column of dataframe to function in a parellel way.
To unsubscribe from this group and stop receiving emails from it, send an email to modi...@googlegroups.com.
Do you think the Modin would be able to perform better than manual multiprocessing?
Do you think adding GPU would increase the performance of manual multiprocessing or even modin in this case?
This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete all copies of it from your system and notify the sender immediately by return E-mail. The sender does not accept liability for any errors or omissions.
--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/fc397614-7a5a-4325-89c6-1a61b43e0f94%40googlegroups.com.