I have nearZeroVar implemented in python+numpy+multiprocessing

437 views
Skip to first unread message

Geoffrey Anderson

unread,
Jan 27, 2017, 2:09:29 PM1/27/17
to H2O Open Source Scalable Machine Learning - h2ostream
Hello @ErinLeDell @Arno:

Backstory: 
I needed a really fast nearZeroVariance for a project, but there was none in existence. 
The R version of caret::nearZeroVar is slow by my tests, much too slow for my current project with 2MM x 4K dataset.
I thank and appreciate "topepo" -- Max -- for being the trailblazing innovator with caret's original NearZeroVar.
R+GBM supervised learning accuracy was greatly improved when I used caret::nearZeroVar on a previous project, versus plain GBM.
Stated another way, GBM was merely mediocre when I used it all by itself.
I definitely want to use NZV in H2O. 
I saw that Arno and Erin are listed on the NZV JIRA entry.

What's new:
Using python+numpy+multiprocessing, I have recently finished developing a correct and fast parallel nearZeroVariance implementation. 
Unit tests show the results are correct using R caret nearZeroVar as the reference.
On Higgs dataset I am seeing 1.1 million rows per second frequency on Xeon 16 cores shared memory host for the NZV computation.
It is 6x to 13x faster than R+caret on same host.
I did the research to ensure this new impl works with the two threshold parameters, like caret does.
I have numeric and missing values being handled identically to caret.
I do not have categorical data working yet but I expect it is just a categorical-to-integer pre-processing before passing to my routines.
Higgs and adult are the UCI datasets I have developed on.
I am using a parallel map, basically.  There is no need for reduce.  
Every column of dataset is processed independently by the parallel map.

What's next:
I hope to attempt an H2O submission for the first time by rewriting my nearZeroVariance code in java on H2O using mapreduce.
I am not a java person per se, but I have used java on the job (a long time ago) for one or two production apps for finance.  
I would like to try my hand on H2O with an NZV submission.
Soon I intend to watch some videos by Clif on how to contribute to H2O.
Question: Is there a virtualbox file I can DL, that has everything a dev needs to make it quick for me to start developing this for H2O?  I have Linux 16.04LTS.


Thank you

Geoffrey Anderson


ps. Below are screen captures using datasets HIGGS and Adult.

ga@ga-HP-Z820:/mnt/fastssd/nzv$ python FindNZV.py --dataFileMask='HIGGS.csv' --xcolsStart=0 --xcolsEnd=29 --numProcesses=16

Unit tests...

91

98

998

Starting reading CSV file to find nzv positive columns at 2017-01-27 12:09:19.874483...

Finished reading CSV file.

Time started / ended / elapsed: 2017-01-27 12:09:19.874483 / 2017-01-27 12:11:10.975666 / 0:01:51.101183

99008.83773667828 rows/s were processed.


Starting actually finding nzv positive columns at 2017-01-27 12:11:10.975901...

Finished finding nzv positive columns.

Time started / ended / elapsed: 2017-01-27 12:11:10.975901 / 2017-01-27 12:11:20.476087 / 0:00:09.500186

1157871.9616647507 rows/s were processed.

0 columns are nzv positive out of 29

Columns that are nzv positive:

[]


It was 9.5 seconds, versus single worker speed of 66 seconds.  66/9.5 = 6.9x parallel speedup. This was on a 16 physical core dual Xeon host. 1 million rows per second is not shabby!


The answers were always “no nzv postive columns” which is correct, in all cases.  Nice.



Checking correctness using adult.csv, we can see the results are identical to the R. Of the numeric columns, capitalgain and capitalloss are the NZV columns:


ga@ga-HP-Z820:/mnt/fastssd/nzv$ python FindNZV.py --dataFileMask=adult.csv --xcolsList 0 2 4 10 11 12 --numProcesses=16

Unit tests...

91

98

998

xcolsStart: 0

xcolsEnd: 29

xcolsList: [0, 2, 4, 10, 11, 12]

Starting reading CSV file to find nzv positive columns at 2017-01-27 13:22:42.320275...

Finished reading CSV file.

Time started / ended / elapsed: 2017-01-27 13:22:42.320275 / 2017-01-27 13:22:42.387539 / 0:00:00.067264

484077.66412940057 rows/s were processed.


Starting actually finding nzv positive columns at 2017-01-27 13:22:42.387647...

Finished finding nzv positive columns.

Time started / ended / elapsed: 2017-01-27 13:22:42.387647 / 2017-01-27 13:22:42.432991 / 0:00:00.045344

718088.3909668312 rows/s were processed.

xcols: [0, 2, 4, 10, 11, 12]

results: [False, False, False, True, True, False]

2 columns are nzv positive out of 6

Columns that are nzv positive:

[10 11]

Index([' capitalgain', ' capitalloss'], dtype='object')

Reply all
Reply to author
Forward
0 new messages