running a 400GB matrix through the biogeme package

37 views
Skip to first unread message

Gautham Bharathi

unread,
Dec 16, 2022, 9:54:43 AM12/16/22
to Biogeme
Hello 
I hope you are well
I am Gautham from Sydney
I am trying to model a large dataset -(using the Discrete Choice model) to obtain choice coefficients. The model DCM function always fails and throughs an error station. 
'Python Kernel unresponsive. Can biogeme package handle a large dataset- (like 400GB)? 
Also, can I run a Spark data frame through biogeme?
Please keep me posted.
 
Looking forward to positive conversations
Regards.

Bierlaire Michel

unread,
Dec 16, 2022, 10:01:08 AM12/16/22
to gautham....@gmail.com, Bierlaire Michel, Biogeme
First, I don’t think I have ever estimated a model on such a large database. So, althouh there is no theoretical limit imposed by the software itself, I don’t know about the “empirical” limitations. 

Biogeme relies on pandas, which is supposed to handle large datasets. 
But I have never tested it with Biogeme. 

Note that the error may be caused by something else than the size. 
Did you try to estimate first a model on a subset of the data of reasonable size? 

--
You received this message because you are subscribed to the Google Groups "Biogeme" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biogeme+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/biogeme/266361d0-7b2f-4d4c-96db-811eda7a4577n%40googlegroups.com.

Matthew Wigginton Bhagat-Conway

unread,
Dec 16, 2022, 10:16:39 AM12/16/22
to michel.b...@epfl.ch, gautham....@gmail.com, Biogeme
This sounds like a memory issue to me. I wasn't using biogeme but I worked with a similarly-sized model in my dissertation (https://files.indicatrix.org/conway_dissertation_proquest.pdf ; see ch. 4 and Appendix B), and the way I was able to estimate was by reducing the problem size by sampling the choice set. For MNL anyhow this leads to consistent though not efficient parameter estimation, but with a dataset this large efficiency isn't really an issue. That may help reduce your problem to tractable size.

For simulations, you still need to use all the alternatives, but you can break the dataset into smaller chunks for simulation to reduce memory requirements.

Matt

Reply all
Reply to author
Forward
0 new messages