At the SED-ML Editor's meeting, we discussed
https://github.com/SED-ML/sed-ml/issues/151 some, but the issues were a bit complicated, so I promised to come back and email everyone with a clearer description of the problem.
So! Attached is an l1v4 SED-ML file, its referenced SBML file, and an Excel file with three possible reports.
The SED-ML file pulls in the 'hill' model as model0, then sets 'n' to 0 for model1.
There is one simulation: 'simulate from 0 to 1 in 5 steps'.
task0 runs that simulation on model0, task1 runs that simulation on model1.
task2 is a repeated task: run task0 and then task1 four times.
There are three data generators: the first tracks S2 from model0 in task2, and the second tracks S3 from model1 in task2.
Then, finally, there is a datagenerator that calculates the difference between model0.S2 and model1.S3.
So! The question is: what do datagenerators do when the variable they've been asked to track is being changed by one subtask, but not another?
There are three possibilities, as I see it. All three are tracked on the attached Excel spreadsheet.
Option 1 (tab 1): Do nothing. As you are collecting data about model0.S2, if model0 isn't being updated by the task (i.e.task1), you simply don't collect data at all. You end up with a 24-entry column of data for both S2 and S3, though you collected them at different times (that is, you first collected six values for S2, and then you collect six values for S3, and then another six for S2, another six for S3, etc.). When you calculate S2 - S3, you are comparing data that wasn't actually collected at the same time, making it impossible to create your "S2 - S3' data on the fly; you have to wait until the end of the entire repeat before you can perform the comparison. This turns out nicely for this example, but it's easy to see it getting thrown off if there had been two different simulations with different numbers of entries. You'd just have to say 'this is illegal' or something.
Option 2 (tab 2): Collect NaNs. As you are collecting data about model0.S2, if model0 isn't being updated by the task (i.e. task1), you collect NaNs instead. You end up with a 48-entry column of data, half of which are NaNs. Every single one of your calculations of 'S2 - S3' end up as 'NaN' because of this, but at least you can calculate them on the fly.
Option 3 (tab3): Collect last known values. As
you are collecting data about model0.S2, if model0 isn't being updated by the task (i.e. task1), you collect the last known value for model0.S2. You end up with a 48-entry column of data, half of which are repeats. Your calculations of 'S2 - S3' all end up as actual values, and you can calculate them on the fly. Note that you need model1 to be initialized before collecting any data, so you know what model1.S3 is when you start the first subtask, which only operates on model0.
As I have currently edited the spec, I claim that Option 3 is what is expected, simply because it made a certain amount of intuitive sense to me (I have not tried to actually implement this, however). I believe that for Jonathan Karr's Biosimulators, they would end up with Option 2. Option 1 is somewhat intuitive, but can easily get messed up by uneven data.
To further complicate things, I wrote the spec such that if you want Option 1, you can accomplish this by using the 'remainingDimensions' construct. This is illustrated in the attached 'example_one_rt_many_models_remaining_dimensions.sedml' file. Basically, you explicitly state in the Variable object 'we are only collecting data for these dimensions' by only listing task2 and the appropriate subtask. This would allow people to accomplish either Option1 or Option 3, depending on what they wanted.
So! What do we want? It should be noted that none of this is because of new constructs in L1v4; it's been possible to define SED-ML files with these issues in them since L1v2 when repeated tasks were introduced, so this is entirely an issue of clarification for the new spec.
-Lucian