Only loading a specific cube from a file's cube list for efficiency

535 views
Skip to first unread message

Niall Robinson

unread,
Dec 6, 2012, 6:56:54 AM12/6/12
to scitoo...@googlegroups.com
Hello - I'm trying to improve the loading efficiency of a routine I'm writing.
 
Here is my problem. Someone has given me a bunch of pp files of sea temperature as a function of lat, long and level (ie. Depth). When I load the data from one of these files using iris.load, I get, for a certain date, a cube list of about 10 cubes: one cube for each depth level. Each of these cubes then has temp as a fn of lat and long. All the cubes have the same name, I just know what depth level they correspond to from the order they are in the list.
 
I would like to efficiently load one data point from one level from each file in sequence. Basically I want to construct a time series of a specific spatial point (i.e. lat, lon and depth), do some stats on that, save the stats to a list, and then move onto the next spatial point.
 
I can set up a lat lon constraint so that I'm only loading one datapoint of the cube. However, using the load function gives me 10 cubes of one data point each. This seems to take a long time. Is there any way to load in one point from a specific cube in the list, avoiding loading all the cubes? Bear in mind that the cubes all have the same name, so this would need to be done via the ordinal number of the cube.
 
Thanks in advance
 
Niall
 
Ps. Incidentally, by way of feedback, is there any reason that constraints are logically combined using '&' rather than the Python standard 'and'. It would seem better to me to keep it consistent, but its no big deal.

Ian Edwards

unread,
Dec 6, 2012, 7:08:33 AM12/6/12
to scitoo...@googlegroups.com
Hi Niall
 
I'd recommend not relying on the order of the cubes in your cubelist being the same as the order that they were loaded, I believe that this behaviour is not guaranteed. Instead we need to look at the PP data to find out why the levels are not being loaded as a coordinate. Fixing this will give you a single cube (instead of the cubelist).

Answering the final part of your question, the reason that we can not use 'and' is due to how this boolean operator it is defined:
(http://docs.python.org/2/library/stdtypes.html#boolean-operations-and-or-not)
 
   x and y       if x is false, then x, else y
 
When combining two constraints, the result would be one or the other and not the combination of both.
 
For this reason, we define '&' to give the following bahaviour:
 
>>> forecast_6 = iris.Constraint(forecast_period=6)
Constraint(coord_values={'forecast_period': 6})
>>> level_10 = iris.Constraint(model_level_number=10)
Constraint(coord_values={'model_level_number': 10})
>>> forecast_6 & level_10     
ConstraintCombination(Constraint(coord_values={'forecast_period': 6}), Constraint(coord_values={'model_level_number': 10}), <built-in function __and__>)
 

Niall Robinson

unread,
Dec 6, 2012, 8:51:58 AM12/6/12
to scitoo...@googlegroups.com
I've had a look at the pp files with ppfp and it seems that the level metadata is present in the pp files, I presume its just not being picked up by IRIS. If it picked up this metadata then I would just get one big cube which I could constrain in terms of lat, lon and depth? Any pointers for sorting out the depth metadata loading? It seems to be in a field called LEV.

bblay

unread,
Dec 10, 2012, 8:31:22 AM12/10/12
to
Hi Niall,

Regarding the level metadata, I suggest you send you send us a PP field to look at.
We should be able to get this loading correctly.
(Currently arranging this offline until we have a generic process for doing this)

However, as your requirement is for a specific point in a single level,
the most efficient way is to bypass the merge altogether (so you won't need the level metadata).

Using iris.load_raw() you can get the fields in the same order as in the file.
cubes = iris.load_raw("my_file.pp")
cube = cubes[5]
Note: Iris does not read PP data until it is accessed.
http://scitools.org.uk/iris/docs/latest/iris/iris.html#iris.load_raw

Also, for speed, I suggest you don't use constraints but rather direct indexing of the cube.
point_cube = cube[100,200]

Hope this helps,
Byron

Niall Robinson

unread,
Dec 11, 2012, 5:42:46 AM12/11/12
to scitoo...@googlegroups.com
Hi Byron, thanks for the help

So is there any advantage to applying the constraints a load time as opposed to loading them and then applying the constraints? I still find that the load line takes longer if the cube is bigger, which suggests to me that it is loading data it maybe doesn't need to. I am running the following code inside a loop over each of the lat lon grid points and its taking a very long time currently. This function returns to a list.append to give me a list of one point cubes, which I can then calculate the RMS of (see the first question for the problem). It might better to load in all the data and then do the operations on the matrices but I have quite a lot of data, so I'm loathed to load it all in at the same time.

When I call it with the latPnt and lonPnt arguments, the lines that load truthLevel and truthClimateLevel currently take significantly longer than the analAnomaly line, despite the grids being the same size. Am I accidentally forcing Iris to load prematurely or something?

def getErrorFields(level, month, year, yearOffset, stash, latPnt=None, lonPnt=None):
   
"""this loads the relevant data to return an error field
   
    Positional arguments:
    level -- filed level number
    month -- run month
    year -- run year
    yearOffset -- the '1860' initialisation date i.e. the offset to convert from naming year to actual year
    stash -- the stash code. 101 for sea temp, 102 for salinity
    latPnts -- optional arg for the lat grid point number to load
    lonPnts -- optional arg for the lon grid point number to load
   
    """

       
    analPath
= '/project/decadal2/nrobin/ocean_anal/temp/' # path to data analysed with correlation alogrythm. Expects anomoly  
    truthPath
= '/project/decadal2/haddi/HadGEM2/control/ocean/2_5deg/' # path to the 'truth' monthy averages (not anomoly)
    truthClimatologyPath
= '/project/decadal2/haddi/HadGEM2/control/ocean/climate/' # path to the 'truth' climatology(used with above to calc anomoly
   
    truthLevel
= I.load_raw(truthPath+str(stash)+'_'+str(year+yearOffset)+'%02d'%month+'.pp')[int(level)] # at all levels
   
if stash==101: # get the climatalogical average and analysed field for the correct variable
        analAnomaly
= I.load_cube(analPath+'pot_anal_'+str(year)+'%02d'%month+'%02d'%level+'.pp') # this also formats the relevant single digit integers to two digit strings
        truthClimateLevel
= I.load_raw(truthClimatologyPath+'pot_2334_2389_'+'%02d'%month+'.pp')[int(level)]
   
elif stash==102:
        analAnomaly
= I.load_cube(analPath+'sal_anal_'+str(year)+'%02d'%month+str(level)+'.pp') # this also formats the relevant single digit integers to two digit strings
        truthClimateLevel
= I.load_raw(truthClimatologyPath+'sal_2334_2389_'+'%02d'%month+'.pp')[int(level)]
   
else:
       
print "Not an appropriate stash number"
   
   
if latPnt!=None and lonPnt!=None:          
        thisAnalAnomaly
= analAnomaly[latPnt][lonPnt]    
        truthAnomaly
= imaths.subtract(truthLevel[latPnt][lonPnt], truthClimateLevel[latPnt][lonPnt]) # get the truth field expressed as an anomoly
        errorField
= imaths.subtract(thisAnalAnomaly, truthAnomaly) # calculates the difference between the anomalies
       
return [errorField, truthAnomaly, thisAnalAnomaly] # returns the error field
   
else:
        truthAnomaly
= imaths.subtract(truthLevel, truthClimateLevel) # get the truth field expressed as an anomoly
        errorField
= imaths.subtract(analAnomaly, truthAnomaly) # calculates the difference between the anomalies
       
return [errorField, truthAnomaly, analAnomaly] # returns the error field


bjlittle

unread,
Dec 11, 2012, 8:32:07 AM12/11/12
to scitoo...@googlegroups.com
Niall,

If you're really, really, really concerned about performance, you can circumvent the overhead (and convenience) of creating Iris Cubes and deal purely with PPField instances.

For example, you can do the following, which might be more a performant option given your needs:

import iris.fileformats.pp as pp

for field in pp.load('101_215601.pp.pp'):
    # Constrain to the field that you want ...
    if field.lbuser[3] == 101:
        # Show the PPField instance.
        print field
        # You can get to individual PPField attributes ...
        print field.lbrel, field.lbpack, field.blev, field.lblev, field.lbvc
        # Get and show the deferred loaded data payload associated with the PPField.
        print field.data
        # Do other stuff ...

Does this help ?

Also if you want to convert a PPField into an Iris Cube, let me know and I'll tell you how to do that.

Niall Robinson

unread,
Dec 11, 2012, 8:59:11 AM12/11/12
to scitoo...@googlegroups.com
Thanks bjlittle - I've just written something using the brute force method of loading all the cubes I need into a list and then operating on them once they are in memory. Its not as graceful as loading points in one at a time, but cutting down on the i/o seems to have sped things up a lot. I might give your suggestion a go if I've got time. Its good to know what options are available though.

RHattersley

unread,
Dec 11, 2012, 10:01:23 AM12/11/12
to scitoo...@googlegroups.com
Hi Niall,

Thanks for bringing these issues to our attention - and I'm glad you're making good progress towards the result you need.

I'd just like to add that the techniques being suggested are effectively low-level workarounds. Our aim for Iris is to enable exactly these kinds of computations at a high-level, allowing you to express the intent and leave Iris to handle the nitty-gritty detail. For example, Iris should let you load the metadata from all your files at once - giving you a 4-dimensional (t, z, y, x) Cube for each relevant quantity. The time-series extractions, or even a direct RMS calculation, should work on that 4D Cube, irrespective of the total data size. Sometime this will be slower than a low-level approach, but sometimes it can actually be faster.

Clearly there are still some bumps in the road. Some we know about, but some we don't. So please stick with the high-level where possible and let us know where you hit problems. And we'll do our best to sort them out. :-)

Thank you!
Richard

RHattersley

unread,
Dec 11, 2012, 10:04:29 AM12/11/12
to scitoo...@googlegroups.com
Hi Niall,

It's slightly off-topic, but when building up text from pieces it can sometimes be clearer to supply multiple substitutions at once. So a fragment such as:

analPath+'pot_anal_'+str(year)+'%02d'%month+'%02d'%level+'.pp'

can also be written as:
'%spot_anal_%d%02d%02d.pp' % (analPath, year, month, level)

or using the `format` method (which can take some getting used to, but often gives more readable code):
'{}pot_anal_{}{:02d}{:02d}.pp'.format(analPath, year, month, level)

Richard

Niall Robinson

unread,
Dec 11, 2012, 10:21:44 AM12/11/12
to scitoo...@googlegroups.com
Thanks Richard. I'm sold on the concept for sure. Iris will save a lot of the reinventing of the wheel that scientific programmers seem to spend the bulk of their life doing!

Quick question: you said "direct RMS calculation" - I couldn't find a way to do this so I have been manually doing it. Its not too arduous but is there an equivalent way of doing
thisAnalAnomaly.collapsed(['longitude', 'latitude'], analysis.MEAN)
but for RMS...I'm going to feel stupid if you say "change MEAN to RMS" but I'm pretty sure I tried that.

Thanks

Niall

bjlittle

unread,
Dec 11, 2012, 10:24:14 AM12/11/12
to scitoo...@googlegroups.com
Niall,

The GitHub Iris master now has the capability to translate the appropriate PP depth meta-data into an Iris Cube given your ocean dataset e.g.

Python 2.7.2 (default, Oct  1 2012, 15:56:20)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import iris
>>> cube = iris.load_cube('101_215601.pp')
>>> print cube
sea_water_potential_temperature     (model_level_number: 20; latitude: 72; longitude: 144)
     Dimension coordinates:
          model_level_number                           x             -              -
          latitude                                     -             x              -
          longitude                                    -             -              x
     Auxiliary coordinates:
          depth                                        x             -              -
     Scalar coordinates:
          forecast_period: 2566800 hours
          forecast_reference_time: -959040.0 hours since 1970-01-01 00:00:00
          time: 1607400.0 hours since 1970-01-01 00:00:00, bound=(1607040.0, 1607760.0) hours since 1970-01-01 00:00:00
     Attributes:
          STASH: m02s00i101
          source: Data from Met Office Unified Model 6.06
     Cell methods:
          mean: time (2 hour)
>>>

This means that you can now construct a high-level constraint that loads one or more specific ocean model levels/depths per Cube e.g.

>>> cube = iris.load_cube('101_215601.pp', iris.Constraint(model_level_number=[1,5]))
>>> print cube
sea_water_potential_temperature     (model_level_number: 2; latitude: 72; longitude: 144)
     Dimension coordinates:
          model_level_number                           x            -              -
          latitude                                     -            x              -
          longitude                                    -            -              x
     Auxiliary coordinates:
          depth                                        x            -              -
     Scalar coordinates:
          forecast_period: 2566800 hours
          forecast_reference_time: -959040.0 hours since 1970-01-01 00:00:00
          time: 1607400.0 hours since 1970-01-01 00:00:00, bound=(1607040.0, 1607760.0) hours since 1970-01-01 00:00:00
     Attributes:
          STASH: m02s00i101
          source: Data from Met Office Unified Model 6.06
     Cell methods:
          mean: time (2 hour)
>>> print cube.coord('model_level_number')
DimCoord(array([1, 5], dtype=int32), standard_name='model_level_number', units=Unit('1'), attributes={'positive': 'down'})
>>> print cube.coord('depth')
DimCoord(array([  5.        ,  47.84999847], dtype=float32), standard_name='depth', units=Unit('m'), attributes={'positive': 'down'})


Regards,
-Bill

Niall Robinson

unread,
Dec 11, 2012, 10:37:39 AM12/11/12
to scitoo...@googlegroups.com
Brilliant - thanks AVD

RHattersley

unread,
Dec 11, 2012, 10:40:39 AM12/11/12
to scitoo...@googlegroups.com
You're right - there's no analysis.RMS defined yet. I'm afraid "direct RMS calculation" currently falls into the "known bumps" category. I've created a GitHub issue to record it: https://github.com/SciTools/iris/issues/274

The same goes for "irrespective of the total data size". But we do have some technology in the wings. Can you give us some idea of dataset sizes you're dealing with?

Richard

Niall Robinson

unread,
Dec 11, 2012, 10:51:09 AM12/11/12
to scitoo...@googlegroups.com
Hi Richard - currently its just N48 files, but eventually it will be more data rich output. I've just started so I haven't got my hand too dirty yet.

bjlittle

unread,
Dec 12, 2012, 6:18:53 AM12/12/12
to scitoo...@googlegroups.com
Hi Niall,

Please note that as a user you should never require to use iris.load_raw().

This is a developer tool that has been exposed at the API level for convenience to assist with debugging the Iris load mechanism.

Regards,
-Bill

bblay

unread,
Mar 5, 2013, 8:26:22 AM3/5/13
to scitoo...@googlegroups.com
Please see here for a discussion on the future of custom loading and saving.

RHattersley

unread,
Mar 26, 2013, 5:27:37 AM3/26/13
to scitoo...@googlegroups.com
For the record, thanks to bjlittle iris.analysis.RMS made it into Iris 1.2.

Niall Robinson

unread,
Mar 26, 2013, 9:24:47 AM3/26/13
to scitoo...@googlegroups.com
Thanks bjlittle!
Reply all
Reply to author
Forward
0 new messages