Merging - again...

617 views
Skip to first unread message

Chris Roberts

unread,
Sep 5, 2013, 2:31:56 PM9/5/13
to scitoo...@googlegroups.com
I am having problems merging cmip5  data in iris. I have two netcdf files, each containing 12 months of data:
/project/champ/data/cmip5/output1/MPI-M/MPI-ESM-MR/piControl/mon/ocean/Omon/r1i1p1/v20120625/thetao/thetao_Omon_MPI-ESM-MR_piControl_r1i1p1_284801-284812.nc /project/champ/data/cmip5/output1/MPI-M/MPI-ESM-MR/piControl/mon/ocean/Omon/r1i1p1/v20120625/thetao/thetao_Omon_MPI-ESM-MR_piControl_r1i1p1_284901-284912.nc

If I try and load them, I get the following:
>>> import iris
>>> tcube=iris.load(ncfiles,callback=callback)
>>> print tcube
0: sea_water_potential_temperature / (K) (time: 12; depth: 40; cell index along second dimension: 404; cell index along first dimension: 802)
1: sea_water_potential_temperature / (K) (time: 12; depth: 40; cell index along second dimension: 404; cell index along first dimension: 802)

My callback kills all the attributes, so it isn't those. I vaguely remember reading that iris will only merge along scalar dimensions, not existing dimension coordinates. If each file contains 12 months of data, how can I merge along the time dimension on loading? Alternatively, how do I force iris to merge along an existing coordinate after I have already loaded it?

I seem to spend most of my time in iris trying merge data along the time dimension before I can do anything useful. Should I just accept this isn't going to happen and deal with the data in cubelists rather than a single cube?

Thanks,
Chris

Niall Robinson

unread,
Sep 6, 2013, 4:56:13 AM9/6/13
to scitoo...@googlegroups.com
Hi Chris - we'll see what AVD says but here's my take.

Should I just accept this
No. Its rubbish - but, as far as I know, people are thinking about how to make this less painful. It just depends on the dataset you are working with. Lots are completely painless.


I vaguely remember reading that iris will only merge along scalar dimension

Correct. What you want is the iris.experimental.concatenate function. Beware though - this (currently) loads all the data into memory. You want to stick all the cubes together along a time dimension, which is concatenating, rather than make a new dimension from scalar coords, which is merging.

Niall

Niall Robinson

unread,
Sep 6, 2013, 4:57:04 AM9/6/13
to scitoo...@googlegroups.com
p.s do that after loading, not in a callback


On Thursday, 5 September 2013 19:31:56 UTC+1, Chris Roberts wrote:

Chris Roberts

unread,
Sep 6, 2013, 6:24:33 AM9/6/13
to scitoo...@googlegroups.com
Thanks for the reply Niall. It is useful to know that there is a concatenate method that can do what I am requesting, but for the tasks I am performing it is not feasible for me to load all the data into memory. Merging the meta-data to a single cube so useful because it means I can cut the data however I want before loading into memory.

pp-mo

unread,
Sep 6, 2013, 6:41:50 AM9/6/13
to
On Friday, 6 September 2013 11:24:33 UTC+1, Chris Roberts wrote:
Thanks for the reply Niall. It is useful to know that there is a concatenate method that can do what I am requesting, but for the tasks I am performing it is not feasible for me to load all the data into memory....
Merging the meta-data to a single cube so useful because it means I can cut the data however I want before loading into memory.

Now, that really is interesting..
The obvious answer using the existing machinery is to say that, if you select data using a bunch of appropriate Constraints, these can be applied equally to all the individual data sources
-- i.e. you can load everything with the same constraints, getting only the bits you want, (though that can need further post-merging, given the limitations of the current merge procedures).
So, can you identify what is is about that workflow that really doesn't work in your case ??

and.. is it actually impossible, or inflexible / ungainly / inefficient  ?
Whatever the details, It's clear that we still don't have the facilities we want here.  There must be better ways, and all this stuff is still up for grabs given some improved proposals.

Richard Hattersley

unread,
Sep 6, 2013, 6:53:09 AM9/6/13
to scitoo...@googlegroups.com
Hi Chris,

That's very useful feedback - thanks. As Niall says, these are definitely issues we're working to improve.

And specifically, knowing about your interest in concatenating without loading data will help make the case for the work required to fix that. It's something I've been progressing "in the margins", but this kind of feedback will help get it the attention it needs.



On Friday, 6 September 2013 11:24:33 UTC+1, Chris Roberts wrote:

Chris Roberts

unread,
Sep 6, 2013, 7:30:11 AM9/6/13
to scitoo...@googlegroups.com
Yes, it really is impossible to read into memory. Hundreds of years of monthly 3D data is tens or hundreds of gigabytes depending on resolution. I am currently trying to work with ocean model data from the coupled model intercomparison project (CMIP5). I am looking at data from ~10 different models, each of which has data saved in netcdf formats that are slightly different.  In order to process the data in the most useful/efficient way possible, my work-flow typically looks something like this:

>>> all_cube = iris.load(files,callback = model_specific_callback) # Callback fixes attributes/dims etc so that they are consistent for all models.
>>> cube = extract_between_dates(all_cube)
>>> dates = cube.coord('time').points
>>> for date in dates:
>>>     sub_cube = extract_operation(cube, date=date, other_constraints=other_constraints)
>>>     data.append(data_reduction_operation(sub_cube))

This means I can do something useful like create a climatology from an arbitrary period within a much longer record without having to worry about how much data I am loading. This is the exact same way I previously worked with pp data using pp_assoc and ss_assoc. The work-flow always follows the same basic steps:

>>> READ ALL META-DATA
>>> SELECT DATA OF INTEREST USING META DATA
>>> LOOP THROUGH A META-DATA DIMENSION (e.g. TIME) AND LOAD SELECTIVELY TO MANAGE MEMORY
>>> CALCULATE SOMETHING INTERESTING AND APPEND TO OBJECT/ARRAY
>>> SAVE OBJECT/ARRAY

This is the way I have to work with large data sets because I do not have terabytes of memory to play with, and is probably true for many people in the office.

Chris













On Friday, 6 September 2013 11:41:30 UTC+1, pp-mo wrote:
On Friday, 6 September 2013 11:24:33 UTC+1, Chris Roberts wrote:
Thanks for the reply Niall. It is useful to know that there is a concatenate method that can do what I am requesting, but for the tasks I am performing it is not feasible for me to load all the data into memory....
Merging the meta-data to a single cube so useful because it means I can cut the data however I want before loading into memory.

Andy Hartley

unread,
Sep 16, 2014, 6:52:17 AM9/16/14
to scitoo...@googlegroups.com
Hi,

I was wondering what the latest news on merging or concatenating cubes is. Having read the discussion here, I see I have exactly the same problem - I am running into memory issues when I try to concatenate >4Gb of data. I have a large amount of satellite data as individual cubes for time slices that are spaced approximately every 10 days between 1999 and 2012. I would like to concatenate them together in to one cube so that I can subset and then calculate a multi-year average for particular dates. My workflow is therefore exactly as described by Chris:

- read all metadata (i.e. create a cubelist of all time slices - one cube for one timeslice)
- concatenate cubelist into one cube
- create a subset based on time metadata (only May to August)
- create a multi-year mean for each 10 day period within the subset
- save the resulting cubelist

Currently, I am running out of memory on the concatenation. It seems like a basic thing that most scientists would need to do, so I'm hoping some progress has been made or someone could suggest a more memory safe workflow.

Kid regards,
Andy

Niall Robinson

unread,
Sep 16, 2014, 9:49:14 AM9/16/14
to scitoo...@googlegroups.com
Ultimately, I think, biggus is meant to sort this but it hasn't been implemented yet. I've got to admit, one of my number one features that I would like in iris is Biggus extended to something more than cube.collapsing() i.e. aggregate_by, arithmetic, and concatenating. Do we have any idea when this might be, AVD?

Thanks
Niall

Andy Hartley

unread,
Sep 17, 2014, 6:36:38 AM9/17/14
to scitoo...@googlegroups.com
Thanks for the reply Niall - ultimately, I would like to do aggregating too, so biggus implementation for this would be very welcome.
In the mean time, can anyone suggest a work around in iris / python? I must admit that I'm on the edge of ditching iris for this task, and do the operation with cdo!

Cheers,
Andy

Andrew Dawson

unread,
Sep 17, 2014, 8:57:02 AM9/17/14
to scitoo...@googlegroups.com
I think the situation is that merging can handle larger than memory data sets but concatenate can't currently.

I have a large amount of satellite data as individual cubes for time slices that are spaced approximately every 10 days between 1999 and 2012

When you say individual cubes, do you mean you loaded each cube from a separate file? Also what format are the files? You can give iris more than one file name during load and it will automatically merge into as few cubes as possible. If the file format you load from supports deferred loading (most of them) then this will not actually load any data, and you could do the sub-setting without loading any data.

If this doesn't work or isn't practical for you, you could consider taking the May-August sub-set from each sub cube before concatenating them, which would reduce the size by 2/3. Using a constraint based month this would work regardless of how much of that period each cube actually covers. I guess this is the "old fashioned" way I would have done this before tools like iris.

Andy Hartley

unread,
Sep 17, 2014, 9:44:58 AM9/17/14
to scitoo...@googlegroups.com
Hi Andrew,

I have the data in individual netcdf files. The files were created in iris, although originally in hdf5 format, with one time step per netcdf file. When I load all the netcdfs into iris, I get a cubelist, with one cube per time step. I would like to merge / concatenate this cubelist into a single cube, but I think I'm not able to merge because I have irregular timesteps (and/or inconsistent bounds). There is data for the 3rd, 13th and 23rd (or sometimes 24th) of each month. Here's an example of a small subset of the data ...

import iris
import glob

wildcard
= '/data/local/hadhy/Projects/LAI_India/g2_BIOPAR_LAI_20121***0000_ASIA_VGT_V1.3_India.nc' # data on eld121
files
= glob.glob(wildcard)

cubelist = iris.load(files)
print cubelist                                                                                                                                           
0: unknown / (unknown)                 (grid_latitude: 1500; grid_longitude: 1500; time: 1)                                                                  
1: unknown / (unknown)                 (grid_latitude: 1500; grid_longitude: 1500; time: 1)                                                                  
...                                                                 
8: unknown / (unknown)                 (grid_latitude: 1500; grid_longitude: 1500; time: 1)   

new_cube = cubelist.merge_cube()                                                                                                                         
Traceback (most recent call last):                                                                                                                           
  File "<stdin>", line 1, in <module>                                                                                                                        
  File "/project/avd/iris/live/pre-release/iris/cube.py", line 343, in merge_cube                                                                            
    proto_cube.register(cube, error_on_mismatch=True)                                                                                                        
  File "/project/avd/iris/live/pre-release/iris/_merge.py", line 1237, in register                                                                           
    error_on_mismatch)                                                                                                                                       
  File "/project/avd/iris/live/pre-release/iris/_merge.py", line 259, in match_signature                                                                     
    raise iris.exceptions.MergeError(msgs)                                                                                                                   
iris.exceptions.MergeError: failed to merge into a single cube.                                                                                              
  Coordinates in cube.dim_coords differ: time.                                                                                                                                                                            
new_cube = cubelist.concatenate()[0]
print new_cube
unknown / (unknown)                 (grid_latitude: 1500; grid_longitude: 1500; time: 9)
     Dimension coordinates:
          grid_latitude                           x                     -           -
          grid_longitude                          -                     x           -
          time                                    -                     -           x
     Attributes:
          Conventions: CF-1.5
print new_cube[0].coord('time')
DimCoord([2012-10-03 00:00:00, 2012-10-13 00:00:00, 2012-10-24 00:00:00,
       2012-11-03 00:00:00, 2012-11-13 00:00:00, 2012-11-23 00:00:00,
       2012-12-03 00:00:00, 2012-12-13 00:00:00, 2012-12-24 00:00:00], standard_name=u'time', calendar=u'standard', var_name='time')


So, concatenate works fine on a subset of the data, but not when I scale up to the full dataset. Do you think merge would work if I clearly defined the coordinate bounds? Alternatively, I could always loop through each ~10 day period, creating the multi-year mean for that date (using numpy), and then concatenate each aggregation together at the end. Hmm ... I think I solved my own problem!

Cheers,
Andy

Andrew Dawson

unread,
Sep 17, 2014, 10:23:48 AM9/17/14
to scitoo...@googlegroups.com
I'd suggest cleaning up the coordinates. It is likely to be a simple problem such as the time coordinate having an attribute that differs on each cube, or the time units being referenced differently. Both of these can be solved fairly easily.

It would be helpful if you could print the time coordinate of two of those cubes, you will likely spot the difference immediately. If an attribute is present that is different on each you can simply delete it at load time using a load callback (it will be as if it never existed). If the time base unit is different you can use the function iris.util.unify_time_units which will make sure all cubes in a cube list use the same time unit. One or both of these will likely solve your merge problem and allow you to get back to the work the way you wanted to do it originally.

Andy Hartley

unread,
Sep 17, 2014, 11:49:49 AM9/17/14
to scitoo...@googlegroups.com
I can't see any difference in the coordinates of different cubes (there shouldn't be, because I created the cubes in a script) ...

>>> print cubelist[0].coords('time')
[DimCoord(array([ 376248.]), standard_name=u'time', units=Unit('hours since 1970-01-01 00:00:00', calendar='standard'), var_name='time')]
>>> print cubelist[1].coords('time')
[DimCoord(array([ 376752.]), standard_name=u'time', units=Unit('hours since 1970-01-01 00:00:00', calendar='standard'), var_name='time')]
>>> print cubelist[8].coords('time')
[DimCoord(array([ 375528.]), standard_name=u'time', units=Unit('hours since 1970-01-01 00:00:00', calendar='standard'), var_name='time')]

>>> print cubelist[0].coord('time').has_bounds()
False
>>> print cubelist[1].coord('time').has_bounds()
False
>>> print cubelist[8].coord('time').has_bounds()
False

Here's the code I used to create each cube in the cubelist. It's part of a function that I call as I loop through the input files ...

# The date comes from the metadata embedded in the dataset
mydate  
= ds.GetMetadata_Dict()['TEMPORAL_NOMINAL']
mydate  
= [int(i) for i in mydate.split('-')]
u        
= unit.Unit('hours since 1970-01-01 00:00:00', calendar=unit.CALENDAR_STANDARD)
this_date
= u.date2num(datetime.datetime(mydate[0], mydate[1], mydate[2], 0))

time    
= DimCoord(this_date, standard_name='time', units=u)
cube    
= Cube(data, dim_coords_and_dims=[(latitude, 0), (longitude, 1), (time, 2)])


So, the time coordinates of each cube seem to be the same. I was wondering though whether I should be adding time as an aux_coord? It's not really clear to me what the difference between and aux_coord and a dim_coord.

Thanks again for your help - I'm still a bit of an iris newbie.
Andy

Andrew Dawson

unread,
Sep 17, 2014, 2:37:09 PM9/17/14
to scitoo...@googlegroups.com
OK seems reasonable. Perhaps it is something to do with DimCoord vs AuxCoord vs scalar coordinate, I always forget how it works. Could you post an "ncdump -h" of one of these files? I'll then have a play and see if I can come up with anything useful to you.
...

Andy Hartley

unread,
Sep 18, 2014, 4:40:46 AM9/18/14
to scitoo...@googlegroups.com
Sure, here you go ...
I've also put the file on /windows/m-drive/depot/AndyHartley/ in case that helps.

ncdump -h g2_BIOPAR_LAI_201212240000_ASIA_VGT_V1.3_India.nc                                                                                        
netcdf g2_BIOPAR_LAI_201212240000_ASIA_VGT_V1
.3_India {                                                                                                      
dimensions
:                                                                                                                                                  
        grid_latitude
= UNLIMITED ; // (1500 currently)                                                                                                      
        grid_longitude
= 1500 ;                                                                                                                              
        time
= 1 ;                                                                                                                                            
variables
:                                                                                                                                                    
       
float unknown(grid_latitude, grid_longitude, time) ;                                                                                                  
                unknown
:_FillValue = 1.e+20f ;                                                                                                                
                unknown
:grid_mapping = "rotated_latitude_longitude" ;                                                                                        
       
int rotated_latitude_longitude ;                                                                                                                      
                rotated_latitude_longitude
:grid_mapping_name = "rotated_latitude_longitude" ;                                                                
                rotated_latitude_longitude
:longitude_of_prime_meridian = 0. ;                                                                                
                rotated_latitude_longitude
:earth_radius = 6371229. ;                                                                                          
                rotated_latitude_longitude
:grid_north_pole_latitude = 90. ;                                                                                  
                rotated_latitude_longitude
:grid_north_pole_longitude = 180. ;
                rotated_latitude_longitude
:north_pole_grid_longitude = 0. ;
       
float grid_latitude(grid_latitude) ;
                grid_latitude
:axis = "Y" ;
                grid_latitude
:units = "degrees" ;
                grid_latitude
:standard_name = "grid_latitude" ;
       
float grid_longitude(grid_longitude) ;
                grid_longitude
:axis = "X" ;
                grid_longitude
:units = "degrees" ;
                grid_longitude
:standard_name = "grid_longitude" ;
       
double time(time) ;
                time
:axis = "T" ;
                time
:units = "hours since 1970-01-01 00:00:00" ;
                time
:standard_name = "time" ;
                time
:calendar = "standard" ;

// global attributes:
               
:Conventions = "CF-1.5" ;
}

Andrew Dawson

unread,
Sep 18, 2014, 6:04:24 AM9/18/14
to scitoo...@googlegroups.com
That is quite unusual to have latitude as the unlimited dimension and time as the last dimension, I'm not sure if this would present a problem or not. I haven't managed to mock up a file that looks like that so I haven't tested anything yet. I don't work at the Met Office so I can't look at your sample file.
...                                     &nbs
...

Andrew Dawson

unread,
Sep 18, 2014, 9:25:09 AM9/18/14
to scitoo...@googlegroups.com
OK, I think I understand this now. In order to merge, the time coordinate must be a scalar coordinate rather than a length-1 dimension coordinate. I'd suggest you try the following (probably with a subset of your input files) and see if it meets your needs:

cubes = iris.load(files)
# Now remove the time dimension coordinate using slicing, time is the third dimension
# (number 2):
cubes
= iris.cube.CubeList([cube[:, :, 0] for cube in cubes])
# Now merge them:
cube
= cubes.merge_cube()

I'm not sure how well behaved this is with respect to deferred loading, which is why I suggest trying this on a smaller number of files first so you can work this out, it may not work in its current form. Also, if it does work, there may be a better way of doing this!


On Thursday, 18 September 2014 09:40:46 UTC+1, Andy Hartley wrote:
...                                     &nbs
...

Niall Robinson

unread,
Sep 18, 2014, 9:56:46 AM9/18/14
to scitoo...@googlegroups.com
I know I'm always just glibly wishing for things to appear in iris and not submitting PRs but...

we really could combine merge and concatonate, couldn't we? It confuses everyone and the logic of choosing which one to use must be a pretty simple if statement, especially if we get the user to specify which coord it was over, which would save us getting unexpected merge shapes as well

Andy Hartley

unread,
Sep 23, 2014, 4:11:58 AM9/23/14
to scitoo...@googlegroups.com
Sorry Andy - my original reply to your post seems to not have been posted.
Just wanted to say that your suggestion worked (very quickly, so no memory problems), so many thanks for your help.
Andy

Benjamin Root

unread,
Sep 26, 2014, 4:25:51 PM9/26/14
to scitoo...@googlegroups.com
I am running into a similar issue with my Oklahoma Mesonet data loader (see recent thread). The data can be organized by station for a whole day. I can load up a single station's data for a day just fine, and end up with a cube that has 288 time coordinates (5-minute data), but loading up two days of station data results in two cubes.

I tried to do a merge_cube() on the cubelist, but that doesn't work because the time coordinate is a dimension rather than a scalar.

I could see an improved functionality of cube loading in the time dimension also working to make it easy to stitch together tiled data.

Cheers!
Ben Root

Andrew Dawson

unread,
Sep 26, 2014, 4:31:00 PM9/26/14
to scitoo...@googlegroups.com
Hi Ben

Just in case you weren't aware, there is also the option of concatenation, which does stitch data together as you describe: http://scitools.org.uk/iris/docs/latest/iris/iris/cube.html?highlight=concatenate#iris.cube.CubeList.concatenate

The downside is that currently this will load all deferred data. Not so bad for the dimensionality you describe, but can be a problem for large data sets. I think this is probably on the list of things to improve.

Richard Hattersley

unread,
Sep 29, 2014, 4:00:37 AM9/29/14
to scitoo...@googlegroups.com
On Thursday, 18 September 2014 14:56:46 UTC+1, Niall Robinson wrote:
we really could combine merge and concatonate, couldn't we? It confuses everyone and the logic of choosing which one to use must be a pretty simple if statement, especially if we get the user to specify which coord it was over, which would save us getting unexpected merge shapes as well

As mentioned elsewhere, `merge` is currently really just `stack`. The intention always was for it to include `concatenate` as well. The important word there being "intention"... :-(

Letting the user specify a coordinate to stack/concatenate over is an interesting idea. Perhaps something like cube_list.merge_over('time'). Not sure how well user-control would sit with the existing algorithm though.

Niall Robinson

unread,
Sep 29, 2014, 9:19:19 AM9/29/14
to scitoo...@googlegroups.com
We were just talking about this in my group - we load a load of lagged hindcast sets i.e. the conceptual cube looks like
realization x forecast_reference_time x forecast_period x lat x lon
with a 2D time coord spanning the ref time and period axis.

However, the chances of loading a dataset where there are no missing data points is small e.g. one realization always fails half way though. This forces the ref time, period, realization and time to get lumped into aux coords on the same axis. I wonder if we could sort this by letting the user do
sliced_cube.merge(['realization', 'forecast_reference_time', 'forecast_period', 'lat', 'lon'])
with iris inserting masked data to allow the desired orthogonality. Oh, btw, some kind soul has written a languishing PR which would help with that
https://github.com/SciTools/iris/pull/1068

Niall

Benjamin Root

unread,
Sep 29, 2014, 9:55:17 AM9/29/14
to Niall Robinson, scitoo...@googlegroups.com
Perhaps this user-expectation is something that could be encoded through the Constraints system? If I understand it correctly, each Constraint object represents a single cube. So, if the user specifies time as a constraint, then iris would then know to do what it can to either merge or concatenate multiple sub-cubes into one.

--
You received this message because you are subscribed to a topic in the Google Groups "Iris" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scitools-iris/QKaocYoLf8I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scitools-iri...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages