Inquiry about Y Prediction Calculation

12 views
Skip to first unread message

Natchaphon Leungbootnak

unread,
Apr 28, 2025, 4:07:04 PMApr 28
to Biogeme
Dear Sir,

I have used Biogeme and RStudio to run the MNL model to predict choices between ML and GPL. I ran the model with the training set and then used the coefficient results to predict choice (Y) in the test set to calculate the accuracy percentage and confusion matrix.

As a result, both Biogeme and RStudio had the same coefficient values but had very different accuracy percentages and confusion matrices. I wonder if my Biogeme code using the coefficients to calculate predicted Y is correct. I have attached my code here:

utilities = {1: utility_ML, 2: utility_GPL}
log_choice_probability = loglogit(utilities, None, choice)
biogeme_train = BIOGEME(database_train, log_choice_probability)
biogeme_train.modelName = 'RUMMNL'
results = biogeme_train.estimate()

outcome = results.getHtml(onlyRobust=False)
match = re.search(r'<strong>Optimization time</strong>: </td> <td>(.*?)</td>', outcome)

#Confusion matrix
prob_1 = logit(utilities, None, 1)
prob_2 = logit(utilities, None, 2)

simulate ={'Prob. 1':  prob_1 ,
           'Prob. 2':  prob_2 ,}

biogeme_test = BIOGEME(database_test, simulate)
biogeme_test.modelName = "RUMMNL_test"
betaValues = results.getBetaValues()
simulatedValues = biogeme_test.simulate(betaValues)

prob_max = simulatedValues.idxmax(axis=1)
prob_max = prob_max.replace({'Prob. 1': 1, 'Prob. 2': 2})

compared_data = {'y_Actual':    df_test['choice12'],
        'y_Predicted': prob_max}

df_y = pd.DataFrame(compared_data, columns=['y_Actual','y_Predicted'])
confusion_matrix = pd.crosstab(df_y['y_Actual'], df_y['y_Predicted'], rownames=['Actual'], colnames=['Predicted'])
accuracy = np.trace(confusion_matrix) / confusion_matrix.sum().sum() * 100

Best regards,
Natchaphon Leungbootnak

Michel Bierlaire

unread,
Apr 28, 2025, 4:16:18 PMApr 28
to natcha...@gmail.com, Michel Bierlaire, Biogeme
Accuracy percentage and confusion matrices are just nonsense.
The only context where considering the highest probability item makes sense is in classification, where you need to decide if the picture is a dog or not. It is a one-shot application of the model. And even in this context, the concept of confusion matrix is completely misleading. Anyway, in discrete choice, it simply does not make any sense.
> --
> You received this message because you are subscribed to the Google Groups "Biogeme" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biogeme+u...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/biogeme/35d4ec00-80b4-4ba7-8084-4bcbe6db7a9fn%40googlegroups.com.

Michel Bierlaire
Transport and Mobility Laboratory
School of Architecture, Civil and Environmental Engineering
EPFL - Ecole Polytechnique Fédérale de Lausanne
http://transp-or.epfl.ch
http://people.epfl.ch/michel.bierlaire

Reply all
Reply to author
Forward
0 new messages