Difference in Model Metrics on different Operating Systems

Aditya Kumar

unread,

Oct 6, 2020, 6:03:04 AM10/6/20

to h2os...@googlegroups.com

Hi,

I have been using h2o on different operating systems to train the same model. Given the same input data and same model parameters(options), I see that we get few differences in the model metrics. I have seen differences mostly in AUC, PRAUC, AIC, GINI values of the models. Often, these differences are with 3 or more decimal values and sometimes these differences are also with 1 or 2 decimal values.

I did provide seed values while building these models and made sure that I use the same h2o version. Still, I see these differences.

For example, for the same model, I can see the following different values on Windows 10 and Windows 2016.

1. Training Metrics differences:

Windows 10
AUC: 0.826815
pr_auc: 0.666271
Gini: 0.653631

Windows 2016
AUC: 0.826828
pr_auc: 0.666263
Gini: 0.653656

These are some examples of some miner differences but I sometimes, I do see differences in 1-2 decimal point.

I have few questions on these:

1. Is it expected to have some model metrics differences between different operating systems(Windows 10, 2016, 2019, Linux flavours) ? If yes, then what is the reason for it? If no, then why am I getting these differences?

2. If we are providing the same seed value, input data and select same model parameters, I expect that we get exactly same model and same model metrics. If it is not the case, then what could be the reason for the differences?

3. Can Java version or Java vendor impact the model metrics?

Thanks,

Aditya

Darren Cook

unread,

Oct 6, 2020, 6:24:03 AM10/6/20

to h2os...@googlegroups.com

> I have been using h2o on different operating systems to train the same

> model....

>
> I did provide seed values while building these models and made sure that I
> use the same h2o version. Still, I see these differences.

Hello Aditya, what type of model were you building?

For deep learning you must only use one thread (see the reproducible flag).

For GBM there is a FAQ entry:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/reproducibility.html

Search the docs for the other algorithms.

If using grids and autoML, they also use a random seed, so
reproducibility gets even more complicated.

Generally, reproducibility is not a goal worth pursuing, as it slows
down model building.

Darren

aditya...@gmail.com

unread,

Oct 6, 2020, 1:13:18 PM10/6/20

to H2O Open Source Scalable Machine Learning - h2ostream

Hi Darren,

Thanks for the reply.

I am mostly using GLM(Logistic Regression) however sometimes I do see the same issue with other models as well. I have not used Deep Learning or GBM so far. Also, no AutoML or Grid search is being used.

I checked the shared link. As per the shared link if the input data is same, same parameters, same seed if sampling is used and no early stopping is used, then it should give use same result.

I understand that from model building perspective, it does not look like a big issue but to some extent it tells that build models are different. Does these small differences have impact on model?

Also, I would like to know what causes these differences during model building?

Thanks,

Aditya

Darren Cook

unread,

Oct 6, 2020, 2:23:35 PM10/6/20

to h2os...@googlegroups.com

> I am mostly using GLM(Logistic Regression)...

It might be worth posting some code and sample data (that shows the
difference you see), either here or on StackOverflow. I think of GLM as
deterministic, but maybe some solver/family combinations are not.

Darren

aditya...@gmail.com

unread,

Oct 9, 2020, 2:42:57 AM10/9/20

to H2O Open Source Scalable Machine Learning - h2ostream

Hi Darren,

I have created a sample code to reproduce the issue and I am sharing this over here. Its a code snippet for Logistic Regression with CV and without CV. I am also attaching the console output on Windows 10 and Windows 2016 with and without CV. Here, I am also attaching my sample dataset.

Here, I am seeing precision value difference between AUC and PRAUC values. They are different for training as well as nfold(cv). If we look a little deeply, I have also see the precision values differences between for Gini(Nfold) and AIC (Nfold, test and train).

Also, I have provided the model output differences on Windows 10 and Windows 2016 but I have seen these differences on Windows 2019 and Linux as well.

void trainGLMLogisticModel(String modelName, String url) throws Exception {

H2oApi h2o = url != null ? new H2oApi(url) : new H2oApi();

// Utility var:

JobV3 job = null;

// STEP 0: init a session

String sessionId = h2o.newSession().sessionKey;

// STEP 1: import raw file

ImportFilesV3 importBody = h2o.importFiles("C:\\SampleDataSets\\DirectBankUSAx.csv");

System.out.println("import: " + importBody);

// STEP 2: parse setup

ParseSetupV3 parseSetupBody = h2o.guessParseSetup(H2oApi.stringArrayToKeyArray(importBody.destinationFrames, FrameKeyV3.class));

System.out.println("parseSetupBody: " + parseSetupBody);

String completeFrameName = modelName + ".data";

String trainingFrameName = modelName + ".train";;

String validationFrameName = modelName + ".validation";;

// STEP 3: parse into columnar Frame

ParseV3 parseParms = new ParseV3();

H2oApi.copyFields(parseParms, parseSetupBody);

parseParms.destinationFrame = H2oApi.stringToFrameKey(completeFrameName);

parseParms.blocking = true; // alternately, call h2o.waitForJobCompletion(parseSetupBody.job)

ParseV3 parseBody = h2o.parse(parseParms);

System.out.println("parseBody: " + parseBody);

String seed = "17706";

String ratio = "0.70";

// STEP 4: Split into test and train datasets

String tmpVec = "tmp_" + UUID.randomUUID().toString();

String splitExpr = "(, " +

" (tmp= " + tmpVec + " (h2o.runif " + completeFrameName + " " + seed + "))" +

" (assign " + trainingFrameName +

" (rows " + completeFrameName + " (<= " + tmpVec + " " + ratio + ")))" +

" (assign " + validationFrameName +

" (rows " + completeFrameName + " (> " + tmpVec + " "+ ratio +")))" +

" (rm " + tmpVec + "))";

RapidsSchemaV3 rapidsParms = new RapidsSchemaV3();

rapidsParms.sessionId = sessionId;

rapidsParms.ast = splitExpr;

h2o.rapidsExec(rapidsParms);

System.out.println("Split data into train and test");

// STEP 5: Train the model (NOTE: step 4 is polling, which we don't require because we specified blocking for the parse above)

GLMParametersV3 glmParms = new GLMParametersV3();

//comment below 3 lines if you dont want to provide model name. IT will take a default name.

ModelKeyV3 modelKey = new ModelKeyV3();

modelKey.name = modelName; //Provide the model name.

glmParms.modelId = modelKey;

glmParms.seed = 15341;

glmParms.family = GLMFamily.binomial;

glmParms.trainingFrame = H2oApi.stringToFrameKey(trainingFrameName);

glmParms.validationFrame = H2oApi.stringToFrameKey(validationFrameName);

ColSpecifierV3 responseColumn = new ColSpecifierV3();

responseColumn.columnName = "TrialRespond";

glmParms.responseColumn = responseColumn;

glmParms.solver = GLMSolver.IRLSM;

glmParms.alpha = new double[] {1.0};

glmParms.lambda = new double[] {0.01};

glmParms.lambdaSearch = false;

glmParms.earlyStopping = true;

glmParms.nlambdas = -1;

glmParms.standardize = true;

glmParms.missingValuesHandling = GLMMissingValuesHandling.MeanImputation;

glmParms.maxIterations = 700;

glmParms.objectiveEpsilon = 0.43;

glmParms.gradientEpsilon = -1.0;

glmParms.link = GLMLink.logit;

glmParms.maxActivePredictors = 500;

glmParms.nfolds = 5;

System.out.println("About to train GLM. . .");

GLMV3 glmBody = h2o.train_glm(glmParms);

System.out.println("glmBody: " + glmBody);

// STEP 6: poll for completion

job = h2o.waitForJobCompletion(glmBody.job.key);

System.out.println("GLM build done.");

//Delete all the frames

deleteFrame(h2o, completeFrameName);

deleteFrame(h2o, trainingFrameName);

deleteFrame(h2o, validationFrameName);

// STEP 99: end the session

h2o.endSession();

}

Some output differences which I have observed:

No Cross Validation - Training Values

frame id: AdityaGLMLogistic.train

Windows 10

AUC: 0.82425916

pr_auc: 0.08179572

Windows 2016

AUC: 0.8242178

pr_auc: 0.081791736

With Cross Validation - Training Values

frame id: AdityaGLMLogistic.train

Windows 10

AUC: 0.82425916

pr_auc: 0.08179572

Windows 2016

AUC: 0.8242178

pr_auc: 0.081791736

With Cross Validation - Cross Validation Training Values

frame id: AdityaGLMLogistic_cv_1_train

Windows 10

AUC: 0.82000774

pr_auc: 0.079940684

Windows 2016

AUC: 0.8200086

pr_auc: 0.07994653

And similarly, we have differences for all the other CV training values.

Thanks,

Aditya

Aditya Kumar

unread,

Oct 9, 2020, 2:47:46 AM10/9/20

to H2O Open Source Scalable Machine Learning - h2ostream

Attaching the datasets and console outputs

Thanks,

Aditya

--
You received this message because you are subscribed to the Google Groups "H2O Open Source Scalable Machine Learning - h2ostream" group.
To unsubscribe from this group and stop receiving emails from it, send an email to h2ostream+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/h2ostream/3a6accde-7bdd-4462-85e3-a8458776f36an%40googlegroups.com.

LR_Model_Windows10_ConsoleLog2_nocv.txt

LR_Model_Windows2016_ConsoleLog2_nocv.txt

LR_Model_Windows2016_ConsoleLog1.txt

LR_Model_Windows10_ConsoleLog1.txt

DirectBankUSAx.zip

Darren Cook

unread,

Oct 9, 2020, 10:07:43 AM10/9/20

to h2os...@googlegroups.com

> I have created a sample code to reproduce the issue and I am sharing this
> over here. Its a code snippet for Logistic Regression with CV and without
> CV. I am also attaching the console output on Windows 10 and Windows 2016
> with and without CV. Here, I am also attaching my sample dataset.

Hello Aditya,
I hope someone here can help - I've not used the Java API, so cannot
comment if something is strange. I *think* you should get the same
results between Windows 10 and Windows 2016 (assuming they are both
64-bit versions), so it will be interesting to hear why, if that is not
a valid assumption.

Darren

> *No Cross Validation - Training Values*
> *frame id: AdityaGLMLogistic.train*
> *Windows 10*
> AUC: 0.82425916
> pr_auc: 0.08179572
>
> *Windows 2016*
> AUC: 0.8242178
> pr_auc: 0.081791736
>
> *With Cross Validation - Training Values*
> *frame id: AdityaGLMLogistic.train*
> *Windows 10*
> AUC: 0.82425916
> pr_auc: 0.08179572
>
> *Windows 2016*
> AUC: 0.8242178
> pr_auc: 0.081791736
>
> *With Cross Validation - Cross Validation Training Values*
> *frame id: AdityaGLMLogistic_cv_1_train*
> *Windows 10*
> AUC: 0.82000774
> pr_auc: 0.079940684
>
> *Windows 2016*
> AUC: 0.8200086
> pr_auc: 0.07994653
>
> *And similarly, we have differences for all the other CV training values.*

>
> Thanks,
> Aditya
>
> On Tuesday, 6 October 2020 at 23:53:35 UTC+5:30 dar...@dcook.org wrote:
>
>>> I am mostly using GLM(Logistic Regression)...
>>
>> It might be worth posting some code and sample data (that shows the
>> difference you see), either here or on StackOverflow. I think of GLM as
>> deterministic, but maybe some solver/family combinations are not.
>>
>> Darren
>>
>

--
Darren Cook, Software Researcher/Developer

Tom Kraljevic

unread,

Oct 9, 2020, 10:35:32 AM10/9/20

to Darren Cook, h2os...@googlegroups.com

one thing i can think of is, if the hardware is different (for example, a different number of cores), then the parallelism can change, and the internal in-memory mapreduce behavior can change.
this can result in subtle numerical differences.

i would not expect the java api to be different from R or python, since those just call into java via a socket anyway.

tom

aditya...@gmail.com

unread,

Oct 9, 2020, 10:59:21 AM10/9/20

to H2O Open Source Scalable Machine Learning - h2ostream

Many Thanks Tom, Darren.

It seems hardware does play a role. My Windows 10 has i7 processor with 4 Cores and 8 Logical Processors whereas my Windows 2016 has Xeon processor with 2 processors and 4 virtual processors(its a virtual machine). Apart from Hardware, what could possibly impact these subtle differences? Will Operating System(Windows, Linux) or Java Vendor(Azul or Oracle or other Java) will make any difference?

Is it possible, to be more deterministic?

Darren,

I had tried the same thing with h2o flows and got the same behavior. My initial question and numbers were from h2o flows.