Unable to train the model in python

Anshul Gupta

unread,

Jun 5, 2018, 1:52:07 AM6/5/18

to H2O Open Source Scalable Machine Learning - h2ostream

I am implementing random forest classifier in python and is stuck in training my model.

modelRF = h2o4gpu.solvers.xgboost.RandomForestClassifier()

modelRF.fit(X = featureNames, y = dependentVar)

# last line is throwing an error:-

ValueError                                Traceback (most recent call last)
<ipython-input-20-740c890799b1> in <module>()
     19     )
     20     print("herer")
---> 21     modelRF.fit(X = featureNames, y = dependentVar)#, sample_weight=None) #training_frame = train, validation_frame = valid)
     22 
     23     # Variable Importance

/usr/local/lib/python3.6/dist-packages/h2o4gpu/solvers/xgboost.py in fit(self, X, y, sample_weight)
    315 
    316     def fit(self, X, y=None, sample_weight=None):
--> 317         res = self.model.fit(X, y, sample_weight)
    318         self.set_attributes()
    319         return res

/usr/local/lib/python3.6/dist-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, early_stopping_threshold, early_stopping_limit, verbose, xgb_model, sample_weight_eval_set)
    516                 xgb_options.update({"eval_metric": eval_metric})
    517 
--> 518         self._le = XGBLabelEncoder().fit(y)
    519         training_labels = self._le.transform(y)
    520 

/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py in fit(self, y)
     93         self : returns an instance of self.
     94         """
---> 95         y = column_or_1d(y, warn=True)
     96         self.classes_ = np.unique(y)
     97         return self

/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape ()

please help

Anshul Gupta

unread,

Jun 6, 2018, 1:10:37 PM6/6/18

to H2O Open Source Scalable Machine Learning - h2ostream

The Errors are all Sorted .

The final code that i was able to implement is given below:

import h2o4gpu

import pickle

import os

import numpy as np

from hitRatio import hitRatio . # This is my other python package which i am using, you dont need to import this

from math import sqrt as sqrt, ceil

dependentVar = 'linked' # I have declared it according to my code

numberOfVars = 40 . # I have declared it according to my code

import pandas as pd

trainData = pd.read_csv('Data.csv')

import time

train_X = np.array(trainData[featureNames])

train_y = np.array(trainData[dependentVar])

t1 = time.time()

modelRF = h2o4gpu.solvers.xgboost.RandomForestClassifier()

modelRF.fit(X = train_X, y = train_y)

t2 = time.time()

print('Time taken to train the model is {}'.format(round(t2-t1,4)))

t3=time.time()

modelRF.predict_proba(train_X)

#print(modelRF.predict_proba(train_X))

t4=time.time()

print('Time taken to predict the model is {}'.format(round(t4-t3,4)))

THIS IS HOW THE MODEL IS TRAINED AND PREDICTED THE DATA.

This model was able to speed up my task to 180x faster.

Thanks

Anshul Gupta

Lauren DiPerna

unread,

Jun 6, 2018, 1:38:59 PM6/6/18

to Anshul Gupta, H2O Open Source Scalable Machine Learning - h2ostream

Thanks so much for posting your solution!

- lauren

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anshul Gupta

unread,

Jun 7, 2018, 1:18:22 PM6/7/18

to H2O Open Source Scalable Machine Learning - h2ostream

This above code is running fine for dataset having size 23GB but when i tried running for the data 35GB the code is getting break in between. I think the reason is pandas dataframe cannot read the dataset having very large sizes of multiple GBs.
Please kindly suggest some alternativ for the same .

Thanks

-Anshul

To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.

Darren Cook

unread,

Jun 7, 2018, 1:41:39 PM6/7/18

to h2os...@googlegroups.com

> This above code is running fine for dataset having size 23GB but when i tried
> running for the data 35GB the code is getting break in between. I think the
> reason is pandas dataframe cannot read the dataset having very large sizes of
> multiple GBs.

How much GPU memory do you have?

(Does H2O4GPU automatically chunk training data to match the GPU memory
size?)

How much main memory do you have?

Darren

Anshul Gupta

unread,

Jun 11, 2018, 8:20:31 AM6/11/18

to H2O Open Source Scalable Machine Learning - h2ostream

I have GPU memory of 16GB.

So, how to run my model on large datsets.

I think the answer is slicing the small data blocks from the whole data and then predicting and storing the answer.
So, how to do this above steps.

Reply all

Reply to author

Forward