ga@ga-HP-Z820:/mnt/fastssd/nzv$ python FindNZV.py --dataFileMask='HIGGS.csv' --xcolsStart=0 --xcolsEnd=29 --numProcesses=16
Unit tests...
91
98
998
Starting reading CSV file to find nzv positive columns at 2017-01-27 12:09:19.874483...
Finished reading CSV file.
Time started / ended / elapsed: 2017-01-27 12:09:19.874483 / 2017-01-27 12:11:10.975666 / 0:01:51.101183
99008.83773667828 rows/s were processed.
Starting actually finding nzv positive columns at 2017-01-27 12:11:10.975901...
Finished finding nzv positive columns.
Time started / ended / elapsed: 2017-01-27 12:11:10.975901 / 2017-01-27 12:11:20.476087 / 0:00:09.500186
1157871.9616647507 rows/s were processed.
0 columns are nzv positive out of 29
Columns that are nzv positive:
[]
It was 9.5 seconds, versus single worker speed of 66 seconds. 66/9.5 = 6.9x parallel speedup. This was on a 16 physical core dual Xeon host. 1 million rows per second is not shabby!
The answers were always “no nzv postive columns” which is correct, in all cases. Nice.
Checking correctness using adult.csv, we can see the results are identical to the R. Of the numeric columns, capitalgain and capitalloss are the NZV columns:
ga@ga-HP-Z820:/mnt/fastssd/nzv$ python FindNZV.py --dataFileMask=adult.csv --xcolsList 0 2 4 10 11 12 --numProcesses=16
Unit tests...
91
98
998
xcolsStart: 0
xcolsEnd: 29
xcolsList: [0, 2, 4, 10, 11, 12]
Starting reading CSV file to find nzv positive columns at 2017-01-27 13:22:42.320275...
Finished reading CSV file.
Time started / ended / elapsed: 2017-01-27 13:22:42.320275 / 2017-01-27 13:22:42.387539 / 0:00:00.067264
484077.66412940057 rows/s were processed.
Starting actually finding nzv positive columns at 2017-01-27 13:22:42.387647...
Finished finding nzv positive columns.
Time started / ended / elapsed: 2017-01-27 13:22:42.387647 / 2017-01-27 13:22:42.432991 / 0:00:00.045344
718088.3909668312 rows/s were processed.
xcols: [0, 2, 4, 10, 11, 12]
results: [False, False, False, True, True, False]
2 columns are nzv positive out of 6
Columns that are nzv positive:
[10 11]
Index([' capitalgain', ' capitalloss'], dtype='object')