Panel data error in RP/SP model

9 views
Skip to first unread message

Matthias

unread,
Oct 3, 2025, 2:14:29 AM (8 days ago) Oct 3
to Biogeme
I want to fit a mixed-logit for SP/RP data.
As the RP data misses all the socio-demographics and has no variation in the choice variables, I want to do this in two steps: Estimate an SP only model (this works fine), fix everything but the ASCs and the scale parameter and re-estimate with the RP data.

The SP data is panel data, but the RP data has only one observation per participant.

When running, I get the following error. It seems the database is flattened repeatedly until it fails. Can you tell what the problem is?

Thank you very much and best regards,
Matthias

Biogeme parameters read from biogeme.toml.
Flattening database [(14297, 23)].
Database flattened [(11414, 107)]
Flattening database [(14297, 23)].
Database flattened [(11414, 107)]
Flattening database [(14297, 23)].
Database flattened [(11414, 107)]
Flattening database [(14297, 23)].
Database flattened [(11414, 107)]
---------------------------------------------------------------------------
BiogemeError                              Traceback (most recent call last)
Cell In[5], line 114
    111 conditional_trajectory_probability = PanelLikelihoodTrajectory(choice_probability_one_observation)
    112 log_probability = log(MonteCarlo(conditional_trajectory_probability))
--> 114 the_biogeme = BIOGEME(
    115     database,
    116     log_probability,
    117     number_of_draws=2000,
    118     number_of_threads=1,
    119     seed=12345,
    120     use_jit=False,
    121 )
    122 the_biogeme.model_name = 'combined_panel_with_RPscale'
    124 # Estimate

File ~/Documents/Biogeme/.venv/lib64/python3.12/site-packages/biogeme/biogeme.py:210, in BIOGEME.__init__(self, database, formulas, random_number_generators, user_notes, parameters, **kwargs)
    206 self.null_loglikelihood = None  #: Log likelihood of the null model
    208 self.best_iteration = None  #: Store the best iteration found so far.
--> 210 self._model_elements = ModelElements(
    211     expressions=self.formulas,
    212     database=self.database,
    213     number_of_draws=self.biogeme_parameters.get_value(name='number_of_draws'),
    214     user_defined_draws=random_number_generators,
    215     use_jit=self.biogeme_parameters.get_value(name='use_jit'),
...
     24         logger.warning(warning)
     25 if audit_tuple.errors:
---> 26     raise BiogemeError(audit_tuple.errors)

BiogemeError: ['Variable "rp__panel__01__panel__01__panel__01__panel__01" not found in the database.


This is the full code:
# STEP 3 — COMBINED SP+RP (PANEL) WITH RP SCALE

import numpy as np
import pandas as pd

import biogeme.biogeme_logging as blog
from biogeme.biogeme import BIOGEME
from biogeme.database import Database
from biogeme.expressions import Beta, Draws, Variable, MonteCarlo, PanelLikelihoodTrajectory, log
from biogeme.models import logit
from biogeme.results_processing import EstimationResults, get_pandas_estimated_parameters
from IPython.display import display

# -------- 0) Logging / reproducibility (suppress verbose flatten logs) --------
logger = blog.get_screen_logger(level=blog.ERROR)
np.random.seed(seed=12345)

# -------- 1) Load raw SP+RP data --------
df = pd.read_csv('cov_df.csv', sep=';', na_values=['NA', 'NaN', ''])

needed = [
    'RID','pref1',
    'price_opt1','price_opt2','price_opt3',
    'avail_opt1','avail_opt2','avail_opt3',
    'opt3',
    'ptsub','gender','city_center',
    'category_1','category_2','category_3','category_4','category_5','category_6',
    'age_cat_18_29','age_cat_30_44','age_cat_45_64','age_cat_65',
    'rp',
]
for c in needed:
    if c not in df.columns:
        df[c] = 0

df['RID'] = pd.to_numeric(df['RID'], errors='coerce').astype(int)
num_cols = [c for c in needed if c != 'RID']
df[num_cols] = df[num_cols].apply(pd.to_numeric, errors='coerce').fillna(0)
df[['avail_opt1','avail_opt2','avail_opt3']] = df[['avail_opt1','avail_opt2','avail_opt3']].astype(int)
df['pref1'] = df['pref1'].astype(int)
df['rp']    = df['rp'].astype(int)

# -------- 2) Build panel database (RP panels have length 1) --------
database = Database('bike_subscriptions_combined_panel', df)
database.panel('RID')

# -------- 3) Declare variables--------
CHOICE = Variable('pref1')
AV1, AV2, AV3, AV4 = Variable('avail_opt1'), Variable('avail_opt2'), Variable('avail_opt3'), 1

P1 = Variable('price_opt1') / 10.0
P2 = Variable('price_opt2') / 10.0
P3 = Variable('price_opt3') / 10.0
OPT3MIN = Variable('opt3')

def CV(name):  # row-level covariates
    return Variable(name)

#   - drop category_6
#   - drop age_cat_65
bin_covars = ['ptsub','gender','city_center']
cat_covars = ['category_1','category_2','category_3','category_4','category_5']   # ref: category_6
age_covars = ['age_cat_18_29','age_cat_30_44','age_cat_45_64']                     # ref: age_cat_65
covars = bin_covars + cat_covars + age_covars

# -------- 4) Load Step 1 estimates, set free/fixed status --------
try:
    beta_vals = results_sp.get_beta_values()  # if Step 1 ran in this kernel
except Exception:
    # fallback: load from file saved in Step 1
    results_sp = EstimationResults.from_yaml_file(
        filename='saved_results/step2_sp_panel_mixedlogit_random_ASCs_refdummies.yaml'
    )
    beta_vals = results_sp.get_beta_values()

# ASC means
ASC1_mu = Beta('ASC1_mu', beta_vals['ASC1_mu'], None, None, 0)
ASC2_mu = Beta('ASC2_mu', beta_vals['ASC2_mu'], None, None, 0)
ASC3_mu = Beta('ASC3_mu', beta_vals['ASC3_mu'], None, None, 0)

# ASC sigmas (FIXED when adding RP + scale)
ASC1_sig = Beta('ASC1_sig', beta_vals['ASC1_sig'], 0, None, 1)
ASC2_sig = Beta('ASC2_sig', beta_vals['ASC2_sig'], 0, None, 1)
ASC3_sig = Beta('ASC3_sig', beta_vals['ASC3_sig'], 0, None, 1)

# Random ASCs (same draw type as Step 1)
ASC1_rnd = ASC1_mu + ASC1_sig * Draws('ASC1_draw', 'NORMAL_ANTI')
ASC2_rnd = ASC2_mu + ASC2_sig * Draws('ASC2_draw', 'NORMAL_ANTI')
ASC3_rnd = ASC3_mu + ASC3_sig * Draws('ASC3_draw', 'NORMAL_ANTI')

# All other parameters FIXED to Step 1 estimates
B_price = Beta('B_price', beta_vals['B_price'], None, None, 1)
betas_cov = {(cv,k): Beta(f'B_{cv}_opt{k}', beta_vals.get(f'B_{cv}_opt{k}', 0.0), None, None, 1)
             for k in [1,2,3] for cv in covars}
B_opt3_minutes = Beta('B_opt3_minutes', beta_vals['B_opt3_minutes'], None, None, 1)

# -------- 5) RP scale (SP scale = 1). --------
lambda_rp = Beta('lambda_rp', 1.0, 1e-6, None, 0)
Scale = 1 + Variable('rp') * (lambda_rp - 1)   # 1 for SP rows, λ_RP for RP rows

# -------- 6) Utilities (scaled) --------
V1 = ASC1_rnd + B_price * P1 + sum(betas_cov[(cv,1)] * CV(cv) for cv in covars)
V2 = ASC2_rnd + B_price * P2 + sum(betas_cov[(cv,2)] * CV(cv) for cv in covars)
V3 = ASC3_rnd + B_price * P3 + B_opt3_minutes * OPT3MIN + sum(betas_cov[(cv,3)] * CV(cv) for cv in covars)
V4 = 0

v  = {1: Scale * V1, 2: Scale * V2, 3: Scale * V3, 4: Scale * V4}
av = {1: AV1, 2: AV2, 3: AV3, 4: AV4}

# -------- 7) Panel likelihood trajectory + Monte Carlo--------
choice_probability_one_observation = logit(v, av, CHOICE)
conditional_trajectory_probability = PanelLikelihoodTrajectory(choice_probability_one_observation)
log_probability = log(MonteCarlo(conditional_trajectory_probability))

the_biogeme = BIOGEME(
    database,
    log_probability,
    number_of_draws=2000,
    number_of_threads=1,
    seed=12345,
    use_jit=False,
)
the_biogeme.model_name = 'combined_panel_with_RPscale'

# Estimate
try:
    results_combined = EstimationResults.from_yaml_file(
        filename='combined_panel_with_RPscale.yaml'
    )
except FileNotFoundError:
    results_combined = the_biogeme.estimate()

print(results_combined.short_summary())
display(get_pandas_estimated_parameters(estimation_results=results_combined))

Michel Bierlaire

unread,
Oct 5, 2025, 9:08:08 AM (5 days ago) Oct 5
to mabru...@gmail.com, Michel Bierlaire, Biogeme
The 4 messages are OK. I did not discover why they are displayed 4 times, but it works for me.

There is probably an inconsistency between the names of the variables in the model and the columns of the flatten database.

You can access the pandas dataframe here:
the_biogeme.model_elements.database.dataframe
The names of the columns are available as
the_biogeme.model_elements.database.dataframe.columns

The list of variables involved in the loglikelihood can be found here:
the_biogeme.model_elements.expressions_registry.variables

Biogeme adds a suffix __panel_0x for the different instances of the same variable in your original specification.

This should allow you to detect the problem.
> --
> You received this message because you are subscribed to the Google Groups "Biogeme" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to biogeme+u...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/biogeme/eebf8932-81fb-4702-b17f-bbb1cbd1cc8cn%40googlegroups.com.

Michel Bierlaire
Transport and Mobility Laboratory
School of Architecture, Civil and Environmental Engineering
EPFL - Ecole Polytechnique Fédérale de Lausanne
http://transp-or.epfl.ch
http://people.epfl.ch/michel.bierlaire

Reply all
Reply to author
Forward
0 new messages