Unable to train the model in python

540 views
Skip to first unread message

Anshul Gupta

unread,
Jun 5, 2018, 1:52:07 AM6/5/18
to H2O Open Source Scalable Machine Learning - h2ostream
I am implementing random forest classifier in python and is stuck in training my model. 
modelRF = h2o4gpu.solvers.xgboost.RandomForestClassifier()
modelRF.fit(X = featureNames, y = dependentVar)
# last line is throwing an error:-

ValueError                                Traceback (most recent call last)
<ipython-input-20-740c890799b1> in <module>()
     19     )
     20     print("herer")
---> 21     modelRF.fit(X = featureNames, y = dependentVar)#, sample_weight=None) #training_frame = train, validation_frame = valid)
     22 
     23     # Variable Importance

/usr/local/lib/python3.6/dist-packages/h2o4gpu/solvers/xgboost.py in fit(self, X, y, sample_weight)
    315 
    316     def fit(self, X, y=None, sample_weight=None):
--> 317         res = self.model.fit(X, y, sample_weight)
    318         self.set_attributes()
    319         return res

/usr/local/lib/python3.6/dist-packages/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, early_stopping_threshold, early_stopping_limit, verbose, xgb_model, sample_weight_eval_set)
    516                 xgb_options.update({"eval_metric": eval_metric})
    517 
--> 518         self._le = XGBLabelEncoder().fit(y)
    519         training_labels = self._le.transform(y)
    520 

/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py in fit(self, y)
     93         self : returns an instance of self.
     94         """
---> 95         y = column_or_1d(y, warn=True)
     96         self.classes_ = np.unique(y)
     97         return self

/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape ()

please help 


Anshul Gupta

unread,
Jun 6, 2018, 1:10:37 PM6/6/18
to H2O Open Source Scalable Machine Learning - h2ostream
The Errors are all Sorted .
The final code that i was able to implement is given below:

import h2o4gpu
import pickle
import os
import numpy as np
from hitRatio import hitRatio . # This is my other python package which i am using, you dont need to import this 
from math import sqrt as sqrt, ceil

dependentVar = 'linked'  # I have declared it according to my code
numberOfVars = 40 .  #  I have declared it according to my code
import pandas as pd
trainData = pd.read_csv('Data.csv')

import time
train_X = np.array(trainData[featureNames])
train_y = np.array(trainData[dependentVar])
t1 = time.time()
modelRF = h2o4gpu.solvers.xgboost.RandomForestClassifier()
modelRF.fit(X = train_X, y = train_y)
t2 = time.time()
print('Time taken to train the model is {}'.format(round(t2-t1,4)))  

t3=time.time()
modelRF.predict_proba(train_X)
#print(modelRF.predict_proba(train_X))
t4=time.time()
print('Time taken to predict the model is {}'.format(round(t4-t3,4)))

THIS IS HOW THE MODEL IS TRAINED AND PREDICTED THE DATA.

This model was able to speed up my task to 180x faster.

Thanks 
Anshul Gupta

Lauren DiPerna

unread,
Jun 6, 2018, 1:38:59 PM6/6/18
to Anshul Gupta, H2O Open Source Scalable Machine Learning - h2ostream
Thanks so much for posting your solution!

- lauren

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Anshul Gupta

unread,
Jun 7, 2018, 1:18:22 PM6/7/18
to H2O Open Source Scalable Machine Learning - h2ostream
This above code is running fine for dataset having size 23GB but when i tried running for the data 35GB the code is getting break in between. I think the reason is pandas dataframe cannot read the dataset having very large sizes of multiple GBs. 
Please kindly suggest some alternativ for the same .

Thanks 
-Anshul  
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.

Darren Cook

unread,
Jun 7, 2018, 1:41:39 PM6/7/18
to h2os...@googlegroups.com
> This above code is running fine for dataset having size 23GB but when i tried
> running for the data 35GB the code is getting break in between. I think the
> reason is pandas dataframe cannot read the dataset having very large sizes of
> multiple GBs.

How much GPU memory do you have?

(Does H2O4GPU automatically chunk training data to match the GPU memory
size?)

How much main memory do you have?

Darren

Anshul Gupta

unread,
Jun 11, 2018, 8:20:31 AM6/11/18
to H2O Open Source Scalable Machine Learning - h2ostream
I have GPU memory of 16GB.
So, how to run my model on large datsets.
I think the answer is slicing the small data blocks from the whole data and then predicting and storing the answer.
So, how to do this above steps.
Reply all
Reply to author
Forward
0 new messages