Performance of Iris (loading time) - NetCDF

126 views
Skip to first unread message

Carwyn Pelley

unread,
Nov 6, 2012, 5:15:27 AM11/6/12
to scitoo...@googlegroups.com
I have decided to start a discussion on the topic of performance of Iris (specifically the speed of loading).
This is driven by a number of Iris user requests for optimisation, where its current speed would not yield it to be a suitable replacement.

NetCDF files: (NEMO data)
The script reads in data from 4 model runs and does an area weighted sum of one of the variables. To demonstrate the issue I've reduced it to 5 files for each model run. Normally it could be in excess of 1000 files.

In the code you will see that I've tried doing this firstly reading in the variable I need into a masked array using the netcdf4 module and secondly using iris.load_strict (it makes little difference if I use iris.load or load_strict). I have also added in timings to the code:
 - Using the netCDF4 module takes ~0.27 seconds to look over the 4 model runs with 5 files each and produce the timeseries of area weighted sums.
 - Using iris takes over 17 seconds to do the same thing (about 60 times as long).

Using the netCDF4 module, only data has been loaded where as in Iris, interpretation (metadata translation) has occurred, creating a datatype agnostic cube which has overwhelming benefits.  However, a discussion should take place here as the common uses of Iris and how its speed may or may not stop them from using it in replace of what they are currently using.

bblay

unread,
Nov 6, 2012, 11:05:45 AM11/6/12
to scitoo...@googlegroups.com
We should find out if it's Pyke that's slow.
If so, this issue might help.

Carwyn Pelley

unread,
Nov 13, 2012, 11:42:27 AM11/13/12
to
Doesnt this warrant a development or investigative ticket on github?

bblay

unread,
Nov 14, 2012, 10:30:31 AM11/14/12
to
Raised in https://github.com/SciTools/iris/issues/198.
Where's the code and data?

TGraham

unread,
Nov 30, 2012, 4:56:31 AM11/30/12
to scitoo...@googlegroups.com
As a follow up to this I wonder if Phil Bentley's point about reading in data using the variable name rather than the standard name could help here. I haven't seen the break down of what is taking up the time in loading the cube but I wonder if time is wasted in parsing the attributes of unwanted variables in the file in order to find the required variable.

I have just tested this by removing taking a copy of a file but with just the variable I want to read (and associated coordinate variables) in and the speed up to read the file is much better (of course this could just be that the file is now smaller).

Thanks,

Tim

RHattersley

unread,
Nov 30, 2012, 5:19:52 AM11/30/12
to scitoo...@googlegroups.com
Hi Tim,

The vast majority of the time overhead was from repeated loading of the *large* auxiliary coordinate variables containing latitude/longitude values & bounds. Each data variable is 5MB, but the auxiliary coordinates amounted to approx 100MB! One of the pull-requests I've linked from the issue mentioned above deals with that by deferring the load of the auxiliary coordinate data until they're actually used. Along with a couple of other speed-ups I've added, the execution time for the sample job on my machine has dropped from several minutes to just five or six seconds.

Richard
Reply all
Reply to author
Forward
0 new messages