RFD: cumulative vs non-cumulative data

29 views
Skip to first unread message

David Megginson

unread,
Mar 4, 2019, 3:15:51 PM3/4/19
to HXL public mailing list
The HXL WG has been considering various ways that we might flag whether it's safe to add numbers in a column to get a total. The use case would be automatic visualisations and other analysis where the software creates a default view without human input.

An example of non-cumulative data is the number of new measles cases reported each month:

Surveillance monthNew measles cases
#date +month#affected +infected
2019-0172
2019-0231
2019-0318

An example of cumulative data is the total number of measles cases reported as of each month (during the outbreak):

Surveillance monthCases to date
#date +month#affected +infected
2019-0172
2019-02103
2019-03121

In the first case, it's reasonable to add the totals in the #affected+infected column to get a total, and in the second, it's not. Here are the questions:
  1. Should the assumed default be summable (non-cumulative unless otherwise indicated) or non-summable?
  2. What attribute(s) should use use, e.g. +nosum, +summable, +cumulative, etc.?
We're looking forward to your input.


Thanks, and all the best,


David

Andrew Smith

unread,
Mar 6, 2019, 4:14:53 AM3/6/19
to hxlpr...@googlegroups.com
Hi David,

I would advocate that the HXL standard should assume that by default is disagregated (eg reported non-cumulatively and hence summable) and that we should encourage data to be reported this way.

I note that your particular examples are epidemiological. In epidemiology and other domains there may be existing standard practises for aggregate data - in which case we ought to defer to them. My experience is with meteorological data - rainfall data in particular which can be aggregated over time in a number of different ways for different purposes.

A couple of observations:

A) Non-cumulative data is much more likely to be self-evident that it is non-cumulative, without additional knowledge. With cumulative values it might be hard to distinguish that they are cumulative vs just an accelerating growth rate. Your first table can *only* be non-cumulative, your secound table could be ambigous to someone who didn't read the metadata carefully.

B) "Cases to date" has an implict start date. If the number of cases prior to the episode of interest is zero then the choice of start date may not affect analysis too much. But if you are dealing with a metric which has some ambient value (eg birth-rate, mortality-rate within a population) then the selection of the start date is important and ought to be made explict. I believe that for an epidemic there is a formal process for declaring a start-date (I sure someone will correct me if I'm wrong on this), but for other metrics a formal, agreed start date might not exist.

C) Another use case to consider is the potential need to aggregate over multiple overlaping time periods. Say you had available the number of new measles cases per day. There might be cases for aggregating by calendar month (perhaps for finnacial planning/reporting) and/or a rolling mutli-day average (eg for understanding the progression of the outbreak). [Appologies for the clunkly example].

So in answer to your questions:
>>> 1. Should the assumed default be summable (non-cumulative unless otherwise indicated) or non-summable?
Summable unless indicated otherwise by the presence of a "start period" column.

>>> 2. What attribute(s) should use use, e.g. +nosum, +summable, +cumulative, etc.?
If the metric is cumulative then it is important to capture and make explict the startdate. Hence I'd suggest a slightly different approach and tag the columns with the start and end datetime for the aggregation period:

Surveillance period start Surveillance period end Cases
#date +start #date +end #affected +infected
2019-01-01 2019-02-01 72
2019-01-01 2019-03-01 103
2019-01-01 2019-04-01 121

I hope that is helpful,
Andy


Andy Smith
Head of Technical Development
MapAction
Mapping for people in crisis

Douglas Court, 1-2 Seymour Business Park, Station Road, Chinnor, OX39 4HA
t: +44 (0)1494 568 899 | s: andrewphilipsmith
mapaction.org | asm...@mapaction.org

Please note my regular working days are Tuesday to Friday
For more information about the MapAction privacy policy see mapaction.org/privacy


--
You received this message because you are subscribed to the Google Groups "Humanitarian Exchange Language (HXL)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hxlproject+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

minu limbu

unread,
Mar 7, 2019, 2:46:19 AM3/7/19
to hxlpr...@googlegroups.com
Tks Andy, 

Fully support. 

Best Regards,

--- 
Minu Limbu 
|Humanitarian|Emergency|Innovation|Information|Semantics|Knowledge|
minu...@gmail.com|skype: minu.limbu
ph:+254702314432 ( new )


David Megginson

unread,
Mar 7, 2019, 2:58:54 PM3/7/19
to hxlpr...@googlegroups.com
Thank you, Andy and Minu. The consensus so far, then, is that we should assume that data is summable by default, unless there's some indication that it's not. I think that matches how people have actually been using HXL-hashtagged data over the past few years. Andy's point about needing a start date for cumulative data to be meaningful is also well taken.

Cheers, David
Reply all
Reply to author
Forward
0 new messages