Writing outputs to csv - quickly

59 views
Skip to first unread message

Joe Bosak

unread,
Nov 7, 2014, 12:39:09 PM11/7/14
to liam...@googlegroups.com
Hi there

I am seeking advice on how to go about exporting liam2 entities to csv format at the end of each iteration.  Exporting all variables for all entities would be fine.  In the following paragraphs I have set out some background and explained how far I have got.  I would be very grateful for some guidance on this!

Background
=========

We are looking at liam2 performance on models of the scale and complexity we ordinarily use.  The simulation code runs very fast, but we have hit issues with outputting the datasets produced to a format SAS can read - SAS being our product of choice for detailed and ad hoc analyses of microsimulation model outputs.  Howard Redway has posted on this in liam2-users.

SAS appears unable to read hdf5 files directly, so we are using csv as an intermediate format.

The current liam2 functionality for outputting individual-level records to csv files is accessed by using csv(dump()).  The time taken for a 99-year run of our test model to output the data this way is around 1 hour.  SAS only takes a few minutes to create comparably-sized csv files, and to read such files in.  We would like to explore whether there are faster ways to output to csv.


Attempts made
============

I have tested the csv routine on its own, on a simple array of comparable size to that we are using.  Python can export this to csv very quickly using the same csv writer as liam2.  I find this puzzling - the only explanation I can come up with is that it is due to the parsing that goes on within liam2 - but in that case I imagine the simulation itself ought to be similarly affected, yet it runs very fast.  So I am rather stumped.

I have also looked at using pandas HDFStore routines to translate the hdf5 files to csv format at the end of the simulation.  However for a 99 year run this appears to hit the memory limit on our system [20gb ram] so that it takes about 30 minutes to translate the data, compared to around 6 minutes for a 50-year run.   

Whilst we can look at expanding the memory further, I would also like to understand whether there is any direct way to access the data structures within liam2 during a run, to save them to csv format at each iteration. I do not mean to re-invent the dump() and csv() routines, but I am instead wondering whether there is something simple we can if we simply want to export all variables for all cases [and whether it might be faster to do so]?

Where I have got to with tracking down what to do is this:

- in the simulation.py module, entity.store_period_data is what calls the process for saving to the hdf5 file

- store_period_data is in the entities.py module and it's the term self.array.append_to_table(self.table) that is [presumably] moving the current period data to the hdf5 table

- append_to_table is in data.py but just seems to be an alternative name for append_carray_to_table

- append_carray_to_table seems to chop the data into chunks [based on the buffers] and append those to the table for storage

But I cannot see an appropriate point where I could add some code to output the array to csv - nor indeed how to refer to the array.


Regards


Joe

Gaëtan de Menten

unread,
Nov 17, 2014, 11:11:06 AM11/17/14
to liam...@googlegroups.com
On 2014-11-07 18:39, Joe Bosak wrote:
I am seeking advice on how to go about exporting liam2 entities to csv format at the end of each iteration.  Exporting all variables for all entities would be fine.  In the following paragraphs I have set out some background and explained how far I have got.  I would be very grateful for some guidance on this!

Background
=========

We are looking at liam2 performance on models of the scale and complexity we ordinarily use.  The simulation code runs very fast, but we have hit issues with outputting the datasets produced to a format SAS can read - SAS being our product of choice for detailed and ad hoc analyses of microsimulation model outputs.  Howard Redway has posted on this in liam2-users.

SAS appears unable to read hdf5 files directly, so we are using csv as an intermediate format.

The current liam2 functionality for outputting individual-level records to csv files is accessed by using csv(dump()).  The time taken for a 99-year run of our test model to output the data this way is around 1 hour.  SAS only takes a few minutes to create comparably-sized csv files, and to read such files in.  We would like to explore whether there are faster ways to output to csv.


Attempts made
============

I have tested the csv routine on its own, on a simple array of comparable size to that we are using.  Python can export this to csv very quickly using the same csv writer as liam2.
Just curious, how did you test this?


 I find this puzzling - the only explanation I can come up with is that it is due to the parsing that goes on within liam2 - but in that case I imagine the simulation itself ought to be similarly affected, yet it runs very fast.  So I am rather stumped.
If by parsing you mean the parsing of the string expressions (the "model" code), then no, that operation is negligible (and only happens once before the simulation is even started).


I have also looked at using pandas HDFStore routines to translate the hdf5 files to csv format at the end of the simulation.  However for a 99 year run this appears to hit the memory limit on our system [20gb ram] so that it takes about 30 minutes to translate the data, compared to around 6 minutes for a 50-year run.  
Something like this should do the trick (not tested):

firstchunk = pandas.read_hdf(h5path, key, stop=10000)
firstchunk.to_csv(csvpath)

for chunk in pandas.read_hdf(h5path, key, start=10000, chunksize=10000)
    chunk.to_csv(path, header=False, mode='a')


Whilst we can look at expanding the memory further, I would also like to understand whether there is any direct way to access the data structures within liam2 during a run, to save them to csv format at each iteration. I do not mean to re-invent the dump() and csv() routines, but I am instead wondering whether there is something simple we can if we simply want to export all variables for all cases [and whether it might be faster to do so]?

Where I have got to with tracking down what to do is this:

- in the simulation.py module, entity.store_period_data is what calls the process for saving to the hdf5 file
- store_period_data is in the entities.py module and it's the term self.array.append_to_table(self.table) that is [presumably] moving the current period data to the hdf5 table
- append_to_table is in data.py but just seems to be an alternative name for append_carray_to_table
- append_carray_to_table seems to chop the data into chunks [based on the buffers] and append those to the table for storage
This is all correct.

But I cannot see an appropriate point where I could add some code to output the array to csv - nor indeed how to refer to the array.
You could do that for example in store_period_data. The "fields" data is in self.array. It is an object of type ColumnArray (see data.py) which *acts* like a (numpy) ndarray with a "structured" dtype (ie an array with several fields). But it is actually a dict of (numpy) ndarrays (one ndarray for each field).

For example,

a = self.array
for name in a.dtype.names:
    print(a[name])

Would print each field in turn.

I think your best bet is to do:

import pandas as pd

then in store_period_data, something like (not tested):

     df = pd.DataFrame(self.array.columns)
     df.to_csv(csvpath)

Please report back your experience with this, and if it works the timings you get.

Hope it helps,
Gaëtan
Reply all
Reply to author
Forward
0 new messages