The fileformats architecture pain points

22 views

Skip to first unread message

Phil Elson

unread,

Jul 12, 2017, 1:17:48 PM7/12/17

to Iris-dev

We recently talked over the move to split the GRIB component out of iris itself, and this highlighted a couple of shortcomings with the current Iris architecture.

As I recall, the shortcomings come to two pertinent points (using iris and iris-grib as a concrete example):

iris.load requires iris-grib functionality to load GRIB files. The iris-grib requires iris to construct Cubes. Therefore there is a circular dependency.
iris currently has a number of integration tests that depend upon iris-grib. This makes testing hard, as we are effectively in lock step with respect to versioning (we can't change iris-grib without potentially breaking an iris test).

I'm starting this thread as I'd like us to be talking about these issues, and would like to document a few of my thoughts. I'm not necessarily advocating doing anything on this front for Iris 2.0.

Problem 1. The circular dependency issue

We've been skirting around the fact that iris.load depends upon iris.fileformats and that iris.fileformats depends upon iris.cube since day one of Iris. Initially it was a case that we just needed to be careful with our imports within iris, but when we moved the iris.fileformats.grib code out, the circularity became all the more apparent.

The extent of the circularly is:

iris-grib fundamentally depends on many iris CDM (common data model) concepts (cube, coordinates, cell_methods etc.), as well as some of the other public machinery (e.g. rules, load interfaces)
(excluding the integration tests, to be discussed in #2) iris optionally depends on iris-grib in just 2 places: it expects a function called load_cubes and a function called save_grib2 with particular signatures. The "optional" part is done in the form of imports that are both lazy and catch ImportErrors.

Using the notation "b → a" to mean "a depends upon b" the current iris-grib architecture looks like:

ecCodes/gribapi ↘

iris-grib → iris (load | save)

iris (cube | coords | *CDM) ↗

In bringing together the discussion about iris-grib a number of options were discussed regarding addressing this circularity.

Option 1: iris-grib has no dependency on iris

One option in particular was raised on a number of occasions. Namely, the idea of having packages such as iris-grib be entirely independent of iris, and have these packages express Cube creation as an abstract concept. Iris would then depend on the format package (e.g. iris-grib) and turn these abstract things into iris. This can be represented as:

ecCodes/gribapi ↘

iris-grib → iris

"abstract data model" ↗

I personally consider this to be a non-starter for the following reasons:

The "abstract thing" is essentially a data model - it must either be extremely tightly coupled to iris, or an implementation of a new kind of abstracted data model. Iris is that abstracted data model - having this kind of model would simply push the problem down the stack, not actually address the issue.
We are essentially trying to push iris-grib down the dependency chain to avoid the downstream impacts of CDM API changes, but in practice we need to be very clear with API changes anyway - there are already many packages that have very hard dependencies on the core iris core data model API, and if we want to make changes to those APIs we need to have water-tight change documentation.

Option 2: iris-grib depends upon iris-core

Instead of having an "abstract data model" we should be making use of Iris' USP - the fact that it is a rich and highly capable data model. The biggest blocker to this has always been the baggage that comes along with Iris (e.g. dependencies on matplotlib, scipy, netcdf4, pyke, etc.). Essentially, we could take the core data model out of "iris", have iris-grib depend upon that, and then have "iris" depend on both.

ecCodes/gribapi ↘
iris-grib → iris
iris-core ↗

Here I mean that "iris" would actually have some functionality - it would have the rich collection of capabilities that today's iris has (e.g. dependencies on mpl, scipy etc.), and in particular the load/save mechanisms would belong there. iris-core would be the stripped down "just the data model" part of today's iris.

Personally, I believe this model addresses the circular dependency observations that we see with current iris-grib. Despite this, I don't feel like immediately going down this road is actually worth the development & maintenance cost compared to the value delivered from improving the existing (fairly limited) circular dependency issue.

Problem 2. Iris integration testing of packages that depend on iris

Like them or not, integration tests are a necessary part of ensuring the tools we develop function together. Our unit tests are there to give us a separation between packages and to test them in (relative) isolation, but we do need to check that the assumptions we have made between the boundaries of those packages continue to hold true. For example, we absolutely MUST assert that we can load a GRIB file and get an iris cube - this means that no matter how we split Iris, we will be wanting to run an integration test at some place.

Most of the pain (I believe) that we have felt with the iris-grib split has come from the fact that iris-grib is predominantly dependant upon iris, yet iris is attempting to validate the integration of this downstream package within its own set of integration tests. It is clear that integration testing must happen at (or above) the top dependency (i.e. the end of the dependency graph), with our current architecture that should be in iris-grib, or in a third party integration test repo.

ecCodes/gribapi ↘

iris-grib →/and GRIB iris [load | save] integration tests

iris ↗

We should also be talking about how much we rely on integration tests currently - I'd personally like to set an aspirational goal of full test coverage from our unit-tests alone. I'd like to see our integration tests as the safety net that we never need to use (but are there just in case!). With that in mind, I'm inclined to suggesting we move (and refine) our integration tests out of iris itself and into a dedicated repo that runs its integration testing on a regular (i.e. nightly) basis. This would fundamentally change how we see integration tests - currently we test them with each and every commit (through CI) which means we are using that safety net all of the time, instead I'd like us to be more dependent upon quality unit-testing to give us of PR/change confidence, and nightly integration to catch any slip-ups.

---------------------------------------------------------------------------------------------------------------------------------------

Personal conclusions:

Architecting our way out of the iris-grib <> iris circular dependency issue is feasible without a major re-write, but despite that, I question the benefit vs the cost
iris' integration tests are costly (they slow down the rest of iris' tests, and currently add circular dependency complexity) and we should be reducing their number to genuinely test the connective tissue between components, not test the functionality itself
In addition, we should consider moving our integration tests out of iris completely to speed up iris testing/development and reduce our reliance on them as a means of testing functionality

Cheers,

Phil

Reply all

Reply to author

Forward

0 new messages