PP loading performance

RHattersley

unread,

Jul 10, 2013, 6:11:22 AM7/10/13

to scitools...@googlegroups.com

To help frame the current discussion and pull requests relating to (mostly) PP loading performance, I'd like to get an idea of the size of performance improvement possible. Establishing upper bounds on performance, within certain assumptions, is a useful piece in that puzzle.

Along those lines I've written some simple code to emulate the process of creating a 2D Cube once you know which PP rules are relevant. I've then timed this code with the current master vs. a "maximum-speed" branch. The maximum-speed branch removes things like validity checking that offer no benefit in this controlled environment. Using the "master" branch, 2D Cubes are created at the rate of 1200 per second. Whereas the "maximum-speed" branch gives a rate of 11000 per second.

For the "maximum-speed" branch simple statistical profiling suggests that almost all the elapsed time is spent just creating objects. So further speed-ups are expected to require a reduction in the number of objects created. This might be possible by simplifying any overly complex private data structures and/or identifying where instances can be shared.

It remains to be seen how much of the validation code, etc. can be bypassed in a real implementation without compromising the robustness of the file conversion process and the design of the public API.

And while I have some of this in my head, some observations:
- Creating Unit instances is slow.
- Using ABCMeta makes instance creation slow.
- Making a NumPy array is slow. And p = np.array([v]) is slower than p = np.empty(1); p[0] = v
- _OrderedHashable is slow.
- __slots__ doesn't make that much difference (although I've not looked at the improvement in memory usage)

RHattersley

unread,

Jul 11, 2013, 5:53:18 AM7/11/13

to scitools...@googlegroups.com

Some random thoughts on the "maximum-speed" experiment.
- The pickle module can quite happily create objects without invoking their __init__ method. Also, it doesn't even use the __new__ method unless you want it to!
- It's still useful to apply validation checks during the cube creation process - we can't trust the data. But there are plenty of validation checks which can be ignored. For example, when creating a regular longitude coordinate, if BDX != 0 then the generated longitude values will always be strictly monotonic.
- We could have a (hopefully small) collection of the pre-approved factory functions which create coordinates, etc. for common patterns. For example, create_regular_coord(bdx, bzx, ...).

RHattersley

unread,

Jul 16, 2013, 4:47:05 AM7/16/13

to

Motivated by the reduce-the-number-of-objects perspective, another experiment:
- What happens if you convert the existing PP rules into a module containing a single function: convert(cube, field)?

By doing this we can avoid the creation of all the intermediate rule result objects (e.g. CMAttribute) ... and it also avoids a whole bunch of function calls.

Method:
Using a 54000 field PP file, capture timings for:
- list(iris.fileformats.pp.load(path))
- list(iris.fileformats.pp.load_cubes(path))
Do this for upstream/master @ 4ba1de8... and rulegen @ e83a40c

Results:
For master: pp.load = 39s, pp.load_cube = 185s (292 cubes per second).
For rulegen: pp.load = 39s, pp.load_cube = 171s (316 cubes per second).
That's a 7.5% reduction in the overall time for pp.load_cube, or a 9.5% reduction in the field-to-cube time.

Plus there's an added benefit, the rule processing system becomes a whole load easier to read and debug!

RHattersley

unread,

Jul 11, 2013, 7:04:08 AM7/11/13

to scitools...@googlegroups.com

At the risk of muddying the waters somewhat, just adding a simple Unit caching scheme onto PPField gives a further significant speed boost to the rulegen branch: load_cube = 157s. That's a further 10% reduction in the field-to-cube time.

RHattersley

unread,

Jul 16, 2013, 4:47:42 AM7/16/13

to

I tweaked the "cutdown" branch referenced above to allow it to work with the real PP rules. This required re-instating the np.array() calls in the coordinate points/bounds setters. See cutdown-ish @ d1260bd.

Then using the same 54000 field PP file as referenced elsewhere:

For master: pp.load = 39s, pp.load_cube = 185s (292 cubes per second).

For cutdown-ish: pp.load = 39s, pp.load_cube = 136s (397 cubes per second).
That's a 26% reduction in the overall time for pp.load_cube, or a 33% reduction in the field-to-cube time.

Obviously 397 cubes/second is a lot less than 11000 cubes/second, so there's a lot of *something* going on that isn't included in the idealised test described previously.

RHattersley

unread,

Jul 16, 2013, 4:47:55 AM7/16/13

to

Combining all three sets of changes does indeed give the almost linear response expected:

For cutdown-ish + rulegen + unit-cache: pp.load = 39s, pp.load_cube = 113s (476 cubes/s)

RHattersley

unread,

Jul 15, 2013, 4:31:44 AM7/15/13

to scitools...@googlegroups.com

pp-mo - I'm curious how a rules-as-a-function approach might fit with your memoization branch?

bblay

unread,

Jul 15, 2013, 5:39:27 AM7/15/13

to scitools...@googlegroups.com

rulegen @ e83a40c

Yay! Python rules rules!

Also, as you'll know, many of those generated conditions don't need to be run.
It could be refactored so logic is not distributed across many rules, becoming even easier to read and debug!

pp-mo

unread,

Jul 15, 2013, 5:48:15 AM7/15/13

to scitools...@googlegroups.com

On Monday, 15 July 2013 09:31:44 UTC+1, RHattersley wrote:

pp-mo - I'm curious how a rules-as-a-function approach might fit with your memoization branch?

Well obviously, I had to assume that the rule actions were purely functional -- or tweak them so they were.
The clever bit was to "automatically" identify the parts of the source metadata that each rule action is using (i.e. the arguments to the function), by intercepting the Field accesses when executing an action.

To correctly identify which 'fundamental' metadata elements each action result depends on (aka "is a function of"), I actually had to modify the PPField code as it provides some 'derived' properties which are actually cached.
With a rewrite, it would make a lot more sense for a specific action to be associated with a definitive list of the metadata elements it refers to.

Meanwhile, the really tough part was speeding up the creation of duplicate rules-result objects, such as coords and cube attributes (accelerating deepcopy operations).
In principle, as you say, this should be quicker than creating new objects, but in practice a simple deepcopy operation was often much slower.
The problem is that a naive object deepcopy recurses to copy all the keys and values of the object dictionary (keys too because dictionary keys are only required to be hashable, not immutable).
So in many cases, with knowledge of the object design and the properties of its attributes, it was substantially quicker to construct a new one. The required logic is generally equivalent to the pickle interface.

Before I implemented the caching operation, I also found Unit construction to be especially slow.
As the design does not currently support and in-place operations on them, I allowed them to be directly re-used (instead of copied) in the Coord.deepcopy operation. I think that's ok at present, but it would break if we ever added a self-modifying operation on Units.

pp-mo

unread,

Jul 15, 2013, 5:55:17 AM7/15/13

to

But I could not have implemented the "memoization branch" approach if we had pure code rules.
It relies on the formal, visible separation of the rules conditions and the individual result actions. In fact, it would have been be a lot safer + neater if each action independently 'registered' which pieces of basic metadata it uses, as I had to infer this from the behaviour + take a lot of rather nasty extra steps to make this work.

bblay

unread,

Jul 15, 2013, 6:00:32 AM7/15/13

to scitools...@googlegroups.com

What, you can't do that in Python?

pp-mo

unread,

Jul 15, 2013, 6:39:00 AM7/15/13

to

On Monday, 15 July 2013 11:00:32 UTC+1, bblay wrote:

What, you can't do that in Python?

Well you can, and we have.
But it requires rules to be implemented as objects with a specific common structure of organisation (conditions, actions, arguments, results..) and operations (evaluate, run actions, process results..).

By contrast, if we write "just code", we can implement operations in whatever different ways best suit individual cases. But then we won't necessarily have overall generalisations to describe or manage things like testing, logging, execution or caching. While we could add such structure within a more flexible coding scheme, none of it comes for free and you don't know what is needed in advance (like the results caching scheme).

For example, we might now consider whether we can accelerate the testing of rules firing using a memoization scheme similar to the rules results one.
But that's only easy to consider because the rules are explicitly structured with a separate 'evalution phase', so you can address that separately without needing to look at the implementation of all the actual existing rules.

(Note: this is really only an example : I think actually that bit takes too little overall time to be worth accelerating )

bblay

unread,

Jul 16, 2013, 8:55:35 AM7/16/13

to scitools...@googlegroups.com

Rule objects have always seemed reasonable to me.
Whatever we need, I'm confident we can do it well in Python.

Reply all

Reply to author

Forward