Are there any ways to speed up reading PP data into IRIS.

753 views
Skip to first unread message

Simon Tett

unread,
Feb 1, 2016, 6:59:10 PM2/1/16
to Iris, simon...@ed.ac.uk

Hi,
 I am reading pp data from HadCM3 into iris cubes as follows:
t500=iris.load_cube(files,'air_temperature' & iris.Constraint(pressure=500.))
where files expands to 31 different files. Each file is 16Mbytes and contains 594 fields and my selection should extract 1 field per file returning 31 fields in all. The 

The data are on high speed disk but the read in iris is painfully slow taking > 1 minute to read the data. Subsequent reads of the files (for different fields) are also painfully slow so no caching of meta-data (or even the data itself appears to be taking place) . I am running on out local cluster which is running scientific linux 6, python 2.7.6 and iris version 1.7.0-dev (that is what iris.__version__ tells me). I'm using ipython for the test case though noticed problem with a script that uses python.

If I do the same analysis in IDL using the Met Office Library (ppa to be exact) the first read takes a few seconds and tells me how many fields in each file. Subsequent reads of the same thing take 1-2 seconds. ppa is of course caching the index data -- an optimisation Jonathan Gregory and I made way  back in the early 1990s... 

So I don't think my problem is due to local disk speed, network etc. 

So a couple of questions:
1) Am I doing something stupid -- the loads should really not be taking so long.
2) If not are there any plans to improve this? 

Simon Tett

marqh

unread,
Feb 3, 2016, 11:40:16 AM2/3/16
to scitoo...@googlegroups.com, simon...@ed.ac.uk
Hello Simon

One optimisation you can make use of is STASH filtering

If you provide only a constraint that is a STASH attribute constraint, then Iris can (and will) ignore all PP fields from the input file.

This can dramatically speed up the load process for constrained loading cases like this

You can use your subsequent constraints on an extract of the result of load (note, a combined STASH and level constraint is not currently optimised)

For example, if your STASH code is atmosphere: 3004
you can:
import iris
stash_constraint
= iris.AttributeConstraint(STASH='m01s03i004')
pressure_constraint = iris.Constraint(pressure=500.)
air_t = iris.load_cube(files, stash_constraint)
t500
= air_t.extract(pressure_constraint)



Simon Tett

unread,
Feb 5, 2016, 4:12:13 PM2/5/16
to Iris, simon...@ed.ac.uk
HI,
 I tried that.. and it makes no difference -- actually specifying the stash code slows things down. Times are slightly different from my last example as I am using a different directory with 41 files in it.

In [14]: %timeit t500=iris.load_cube(file,'air_temperature' & iris.Constraint(pressure=500.))
1 loops, best of 3: 2min 45s per loop

In [15]: %timeit atemp=iris.load_cube(file,iris.AttributeConstraint(STASH='m01s16i203') & iris.Constraint(pressure=500.))
1 loops, best of 3: 3min 2s per loop

In [16]: t500
Out[16]: <iris 'Cube' of air_temperature / (K) (time: 48; latitude: 73; longitude: 96)>

In [17]: atemp
Out[17]: <iris 'Cube' of air_temperature / (K) (time: 48; pressure: 15; latitude: 73; longitude: 96)>

One minor question -- why does iris not do the right thing with iris.AttributeConstraint(STASH='m01s16i203') & iris.Constraint(pressure=500.) ?
That should return all fields with  stash code m01s16i203 and presure level = 500. But only selects the stash code.

and just to be on the safe side turn off the pressure constraint..

 In [19]:  %timeit atemp=iris.load_cubes(file,iris.AttributeConstraint(STASH='m01s16i203'))
1 loops, best of 3: 2min 42s per loop

In [20]: atemp
Out[20]: <iris 'Cube' of air_temperature / (K) (time: 48; pressure: 15; latitude: 73; longitude: 96)>

The equivalent IDL version is much faster and takes 36 or so seconds (so 5 times faster than the python/iris solution):

MIDL64t> t=systime(/s) & file=findfile("/exports/work/geos_cesd_workspace/stett2/um/xlnrs/A/aps/*.pp") & pp=ppa(file,ss(stash=16203,blev=500),/all,/q) & print,systime(/s)-t
<snipped out IDL compilation>
       36.236768
Note that subsequent reads take ~ 0.1 seconds even for different fields as ppa caches information about where pp files are and the meta-data for them. Retrieval is then 1) match against the meta data, 2) read those fields straight out of the file.

As far as I can tell iris.load_cube doesn't do this -- otherwise 2nd data reads from a file would be really fast...

So is this something to do with the way we have python & iris installed on our cluster at Edinburgh and configured or is this intrinsic to iris?

If the former what should we do to make it go faster.
If the latter -- are there plans to speed up the first read (and cache indices) or should I convert pp data to netCDF and just read the netCDF assuming that works well... or stick with IDL which performs well...


Simon

Simon Tett

unread,
Feb 6, 2016, 9:51:28 AM2/6/16
to Iris, simon...@ed.ac.uk
Following up on this I wrote some Python to just read the field records for the PP data. This took 1.3 seconds to read 28512 records from 48 files.  Once you have the list of what is there extracting a small sub-set of the fields is a trivial job with a few f.seeks

I don't think my poor performance is due to the hardware  I'm running on, nor to Python as raw performance is good. It must be something to do with how iris reads in data and then converts it to data cubes. I assume the Met Office still makes use of PP data -- is performance poor there?

Simon

Code below:

"""
test code to read PP data from PP files. Is iris pain due to python or iris?
Code based on pp routine search. Just reads the records and generates a list
To make it do useful things (as opposed to benchmark) then data will
need to be converted to structure and way of asking for specified fields done.
"""
import numpy as np
import time
import glob
import os
filename = '$SAGESWORK/um/xlnra/A/aps/xlnraa.*.pp' # list of files to process
littleEnd=True # we are on a little end platform
verbose = False # set True to get messages
record_lst=[] # list of records
files=glob.glob(os.path.expandvars(filename)) # expand filename
t0=time.clock() # get current time
t1=t0
for file in files: # iterate over files reading all records
    t1=time.clock()
    with open(file, 'rb') as f: # open file and automatically close when done.
        while True: # keep reading  records. Error will be triggered when fail to read
            recLen=np.fromfile(f,dtype=np.int32,count=1) # read record len in BYTES
            if littleEnd:  recLen.byteswap(True)
            if verbose: print "recLen is ", recLen
            if recLen <= 0 or recLen is None or len(recLen) == 0: # failed to read file
                break
            record=np.fromfile(f,dtype=np.int32,count=recLen/4) # read header
            if littleEnd:  record.byteswap(True)
            # should do something to FP values and pull array apart.
            # also store file position for start of f77 record for field data.
            # this is demo so not bothering
            record_lst.append(record) # add record to list
            if verbose: print "Record %i has len %i with element 0 = %i"%(len(record_lst),recLen,record[0])
            # no of words to skip for next record
            f.seek(4,1) # skip end of record count info
            fldLen=np.fromfile(f,dtype=np.int32,count=1) # read fld size in bytes
            if littleEnd: fldLen.byteswap(True) # byte swap it
            fldLen+=4 # add an additional 4 bytes to field ptr as store length at end of field too
            if verbose: print "Skipping %i bytes"%fldLen
            f.seek(fldLen,1) # skip fldLen bytes from current posn read to read next thing
        print "read data from file %s in %f secs "%(file,time.clock()-t1)


print "Read %i records from %i files in %f seconds "%(len(record_lst),len(files),(time.clock()-t0))

Phil Elson

unread,
Feb 8, 2016, 5:30:46 AM2/8/16
to Simon Tett, Iris, simon...@ed.ac.uk
Hi Simon,

If you are wanting to work with PP structures and not have the benefits of cubes, Iris does have the ability to load PP files directly:

import iris.fileformats.pp as pp

for field in pp.load(filename):
    print(field.stash, field.lbcode)


The docs for this can be found at http://scitools.org.uk/iris/docs/latest/iris/iris/fileformats/pp.html#iris.fileformats.pp.load. I would expect the performance of this to be comparable to that of the code you provided (untested), though to actually understand the structure fully, the Iris version is doing a little more work, so it wouldn't surprise me if there was a little more overhead.


I assume the Met Office still makes use of PP data -- is performance poor there?

There are no silver bullets in the Met Office - PP data is still widely used (though by no means exclusively), and the performance will be comparable to that you are observing. Inevitably, there are cases where the performance is a bottleneck and needs work. In the majority of cases I've seen, large PP files are typically generated from model dumps (Fieldsfiles) which Iris can also read natively. It is often the case that these model dumps are predictably structured. This structure can be used to optimise the loading of fieldsfiles by using iris.experimental.fieldsfile.load (http://scitools.org.uk/iris/docs/latest/iris/iris/experimental/fieldsfile.html#iris.experimental.fieldsfile.load). Whilst we've not needed to implement it in earnest, such an optimisation can be made for structured PP files, and with the right conditions, I've observed x15 speedups of load-time.


Given the performance you are observing of loading raw, un-interpreted PP data, all the fingers are pointing at the poor performance of conversion from PP into cube form. That means that any Iris optimisation which is equivalent to the pph optimisation in the PP IDL routines would need to store the cube rather than an optimised PP-like structure. In actual fact, this kind of thing can be achieved in fairly short order using pickles. A degree of caution is needed, as pickles don't work particularly well across even minor versions of packages, but for short term re-usable type problems, it is an excellent option. I've thrown together an example which may help for your particular case:


from __future__ import print_function

 

from glob import glob

import hashlib

import iris

import os.path

from six.moves import cPickle

import six

 

 

def cached_load(filenames):

    """

    Load iris cubes given the filenames

 

    """

    if isinstance(filenames, six.string_types):

        filenames = [filenames]

    # Turn the list of filename patterns into a sorted list of unique

    # filenames.

    filenames = sorted(set(fname for filename_pattern in filenames

                           for fname in glob(filename_pattern)))

 

    # Hash the files.

    hasher = hashlib.sha256()

    for fname in filenames:

        with open(fname) as fh:

            buf = fh.read(65536)

            while len(buf) > 0:

                hasher.update(buf)

                buf = fh.read(65536)

    digest = hasher.hexdigest()

 

    cachefile = '.iris_cache_{}_{}'.format(iris.__version__, digest)

 

    if not os.path.exists(cachefile):

        print('Loading the cubes. This can take some time!')

        result = iris.load(filenames)

        with open(cachefile, 'wb') as fh:

            cPickle.dump(result, fh, cPickle.HIGHEST_PROTOCOL)

    else:

        with open(cachefile, 'rb') as fh:

            result = cPickle.load(fh)

    return result

 

# Load some actual data to try it out:

print(cached_load(['/path/to/data/engl_*.pp']))



With this code, and a pretty trivial PP file, I can observe a large speedup on second load:

$ time python memoize_load.py 

Loading the cubes. This can take some time!

0: unknown / (unknown)                 (latitude: 481; longitude: 640)

1: surface_downwelling_longwave_flux_assuming_clear_sky / (W m-2) (latitude: 481; longitude: 640)

...

144: y_wind / (m s-1)                    (model_level_number: 70; latitude: 480; longitude: 1)

 

real        0m32.09s

 

$ time python memoize_load.py

0: unknown / (unknown)                 (latitude: 481; longitude: 640)

1: surface_downwelling_longwave_flux_assuming_clear_sky / (W m-2) (latitude: 481; longitude: 640)

...

144: y_wind / (m s-1)                    (model_level_number: 70; latitude: 480; longitude: 1)

 

real        0m2.29s


Just to point out, I've not actually triggered data loading at this point, so the pickle files are still referencing the data payload of the original (pp) files.

Interested to hear your findings on this approach Simon,

Cheers,

Phil




--
You received this message because you are subscribed to the Google Groups "Iris" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scitools-iri...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Simon Tett

unread,
Feb 8, 2016, 10:23:41 AM2/8/16
to Iris, prof...@gmail.com, simon...@ed.ac.uk
Hi Phil,
 thanks for that -- I don't think my performance issue would be solved by pickling cubes.. If the cube is only meta-data (because it loads data on demand) then it takes too long to generate the cube in the first place.. I wondered if iris was reading all the data, making sub-cubes and then extracting what I wanted. But loading all cubes seems to be taking a very long time...

A solution that provides a different interface from netcdf for PP is *not* what I am looking for. I'm too old to want to have multiple systems to get at data. One of the attractions of iris was its ability to work with PP and netCDF data.

I guess iris is doing something under the hood -- but very inefficiently.  My normal approach to processing model data is to produce many  diagnostics many of which I never look at (I don't know which diagnostics I want to look at ahead of time). And so I need a fast extraction of a sub-set of the data. I believe this is a fairly common mode for climate data analysis though others might work in very different ways. But fast extract from PP data is straight forward --  just read the header records and store the filename and where in the file the data is. If you want to optimise then cache the meta-data after reading. The tricky bit is I think mapping from PP 2D fields (and the various ways they exist) to multi-dimensional cubes as I imagine that is complexl. But it is pointer stuff so should be fairly fast. Then when you want the field data (or components of it) just do f.seek, f.read. Presumably iris must be doing some of the complex stuff.

In answer to my question -- is it the configuration at Edinburgh or intrinsic to Iris?

 It sounds like it is intrinsic to iris.. Which is a shame and, for my purposes, I think unusable for processing PP data.

Is its performance better if I read in netCDF data?

Are there any plans to improve iris so it is much faster with PP data

Simon

marqh

unread,
Feb 16, 2016, 12:47:43 PM2/16/16
to Iris, prof...@gmail.com, simon...@ed.ac.uk
Hello Simon

I don't agree that Iris is: 'doing something under the hood -- but very inefficiently. ' 

Iris is doing a lot of work in loading the PP field metadata individually, converting each field's metadata, constraining on the newly created metadata and putting these 2D cubes together into a logical set of nD cubes.
The PP field load is fast, as you have seen from Philip's post.

Supporting filtering on cube coordinates and attributes requires that most of this has taken place.

The STASH constraint I showed you can have very significant performance implications where it is used in isolation (not combined with any other constraint) and the files in question have a large number of different STASH codes, but only a small fraction are required.
This seems to match your described use case:

> My normal approach to processing model data is to produce many  diagnostics many of which I never look at (I don't know which diagnostics I want to look at ahead of time). And so I need a fast extraction of a sub-set of the data.

this is exactly the case it was implemented for and has shown large performance improvements in particular cases; it is very case dependent. 

There are further activities taking place,  targeting field filtering for PP loading which could provide further options for tuning.  These may arrive in the next release of Iris


> Is its performance better if I read in netCDF data?

yes: there is a lot less work to do to interpret the metadata and mostly little work in understanding the structure (as the netCDF file can encode this directly)

if you are aiming to load data sets many times from disk into different sessions, converting the data to netCDF as a single hit can speed up lots of down stream processing tasks.

Iris supports streaming from format to format so you can convert very large quantities of data by

cs = iris.load(['fname', ...])
iris.save(cs, 'new_fname.nc')

mark
Reply all
Reply to author
Forward
0 new messages