Repeated tasks with multiple models

Lucian Smith

unread,

Jun 18, 2021, 6:47:32 PM6/18/21

to sed-ml-...@googlegroups.com

At the SED-ML Editor's meeting, we discussed https://github.com/SED-ML/sed-ml/issues/151 some, but the issues were a bit complicated, so I promised to come back and email everyone with a clearer description of the problem.

So! Attached is an l1v4 SED-ML file, its referenced SBML file, and an Excel file with three possible reports.

The SED-ML file pulls in the 'hill' model as model0, then sets 'n' to 0 for model1.

There is one simulation: 'simulate from 0 to 1 in 5 steps'.

task0 runs that simulation on model0, task1 runs that simulation on model1.

task2 is a repeated task: run task0 and then task1 four times.

There are three data generators: the first tracks S2 from model0 in task2, and the second tracks S3 from model1 in task2.

Then, finally, there is a datagenerator that calculates the difference between model0.S2 and model1.S3.

So! The question is: what do datagenerators do when the variable they've been asked to track is being changed by one subtask, but not another?

There are three possibilities, as I see it. All three are tracked on the attached Excel spreadsheet.

Option 1 (tab 1): Do nothing. As you are collecting data about model0.S2, if model0 isn't being updated by the task (i.e.task1), you simply don't collect data at all. You end up with a 24-entry column of data for both S2 and S3, though you collected them at different times (that is, you first collected six values for S2, and then you collect six values for S3, and then another six for S2, another six for S3, etc.). When you calculate S2 - S3, you are comparing data that wasn't actually collected at the same time, making it impossible to create your "S2 - S3' data on the fly; you have to wait until the end of the entire repeat before you can perform the comparison. This turns out nicely for this example, but it's easy to see it getting thrown off if there had been two different simulations with different numbers of entries. You'd just have to say 'this is illegal' or something.

Option 2 (tab 2): Collect NaNs. As you are collecting data about model0.S2, if model0 isn't being updated by the task (i.e. task1), you collect NaNs instead. You end up with a 48-entry column of data, half of which are NaNs. Every single one of your calculations of 'S2 - S3' end up as 'NaN' because of this, but at least you can calculate them on the fly.

Option 3 (tab3): Collect last known values. As you are collecting data about model0.S2, if model0 isn't being updated by the task (i.e. task1), you collect the last known value for model0.S2. You end up with a 48-entry column of data, half of which are repeats. Your calculations of 'S2 - S3' all end up as actual values, and you can calculate them on the fly. Note that you need model1 to be initialized before collecting any data, so you know what model1.S3 is when you start the first subtask, which only operates on model0.

As I have currently edited the spec, I claim that Option 3 is what is expected, simply because it made a certain amount of intuitive sense to me (I have not tried to actually implement this, however). I believe that for Jonathan Karr's Biosimulators, they would end up with Option 2. Option 1 is somewhat intuitive, but can easily get messed up by uneven data.

To further complicate things, I wrote the spec such that if you want Option 1, you can accomplish this by using the 'remainingDimensions' construct. This is illustrated in the attached 'example_one_rt_many_models_remaining_dimensions.sedml' file. Basically, you explicitly state in the Variable object 'we are only collecting data for these dimensions' by only listing task2 and the appropriate subtask. This would allow people to accomplish either Option1 or Option 3, depending on what they wanted.

So! What do we want? It should be noted that none of this is because of new constructs in L1v4; it's been possible to define SED-ML files with these issues in them since L1v2 when repeated tasks were introduced, so this is entirely an issue of clarification for the new spec.

-Lucian

example_one_rt_many_models.sedml

hill.xml

Three_reports.xlsx

example_one_rt_many_models_remaining_dimensions.sedml

Jonathan Karr

unread,

Jun 18, 2021, 7:33:16 PM6/18/21

to sed-ml-...@googlegroups.com

Thanks for pushing this discussion. This gets at one of the biggest sources of divergence around SED-ML.

I don't follow option 1.

To me, option 3 requires a few jumps in logic:

It's not clear to me that model0 is defined for task0 and that model 1 is defined for task 2. Trying to work around this requires using model references where they haven't been used before.
It requires initializing models, and models having simulation state separate from simulations. This requires a sort of look ahead in the execution of workflows. This becomes particularly unclear if a workflow involves applying multiple simulation methods (which would involve potentially very different initial simulation states) to the same model. For example, a repeated task with subtasks for both CVODE and SSA on the same model. In this case, what is the implicit initial simulation state? What if those subtasks have the same order (meaning they could be executed in parallel)? What if the initial conditions are random? How does that impact the use of random streams? What is the initial state of a steady-state simulation? or a Reachability analysis? or FBA?
I think it's very important to clearly separate models, operations (simulations) on models, and simulation state of models. This option blurs all three.
By blurring models, simulations, and simulation state, this option would make it more difficult for simulation tools to support SED-ML. For one, this requires simulation tools to separate simulation initialization and temporal evolution in a particular way. Its not clear to me that this makes sense for all possible algorithms.

To me, option 2 is the best choice. Simulation state is only meaningful (i.e. non-NaN) where models are used with tasks. Where a model isn't used, its simulation "state" is ill-defined (i.e. NaN/null). This maintains a clear separation of models, simulations, and simulation state. The implication of this is that S2-S3 would always be undefined (NaN). Obviously, this isn't helpful. One way to enable more mathematical flexibility would be to allow data generators to be inputs to calculations.

Jonathan

--
You received this message because you are subscribed to the Google Groups "sed-ml-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sed-ml-discus...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/sed-ml-discuss/CAHLmBr0afEk6odN7ZnpJMj6MqAcDs6V5nmYi2-ybPQsHUNjPhg%40mail.gmail.com.

Lucian Smith

unread,

Jun 18, 2021, 8:19:00 PM6/18/21

to sed-ml-...@googlegroups.com

On Fri, Jun 18, 2021 at 4:33 PM Jonathan Karr <jonr...@gmail.com> wrote:

Thanks for pushing this discussion. This gets at one of the biggest sources of divergence around SED-ML.

I don't follow option 1.

To me, option 3 requires a few jumps in logic:
It's not clear to me that model0 is defined for task0 and that model 1 is defined for task 2. Trying to work around this requires using model references where they haven't been used before.

This part is unavoidable, regardless of how we choose to interpret this situation. If task2 involves both model0 and model1, you have to tell SED-ML which model you want 'S2' from. I don't see any other way around this.

It requires initializing models, and models having simulation state separate from simulations.

Yes, this is how I understand and have always understood SED-ML to work. Models have states that are set up through initialization, modified by simulations, and retained afterwards unless/until reset. Model state that persists after simulation is absolutely a requirement of correct SED-ML interpretation, and I believe that along with that is an implied 'Model state is created through initialization after applying any model changes'.

This requires a sort of look ahead in the execution of workflows. This becomes particularly unclear if a workflow involves applying multiple simulation methods (which would involve potentially very different initial simulation states) to the same model. For example, a repeated task with subtasks for both CVODE and SSA on the same model. In this case, what is the implicit initial simulation state?

I can think of no simulation that changes the initial state of the model outside of a ModelChange. Initialized models have a clear state: initialized. You don't ever have to look ahead; you just initialize the model. It is the same for CVODE and SSA and any other simulation algorithm I can think of.

What if those subtasks have the same order (meaning they could be executed in parallel)?

This is a problem with both Option 2 and Option 3. I think that it implies that if you want 'S2 - S3', you can't have the same order without getting non-reproducible results. If we go with Option 1, this problem disappears!

What if the initial conditions are random?

This is part of the initialization of the model, and the randomization applies at that point.

How does that impact the use of random streams?

I don't know what a random stream is in this context.

What is the initial state of a steady-state simulation? or a Reachability analysis? or FBA?

They are all identical to each other, and entirely defined by the 'Model' construct.

I think it's very important to clearly separate models, operations (simulations) on models, and simulation state of models. This option blurs all three.

A model file and any ModelChanges define its initial state.

Simulations change that state.

That's it!

It is possible to separate model initialization from model definition, but SBML and CellML have never done this (nor have model languages derived from/associated with them, such as BNGL). It's possible that we might encounter some other language that had a separate 'initialize' process, and if so, I would recommend adding a whole new 'Initialize' construct to SED-ML to handle it. (For now, I would use the 'ModelChange' to handle it, if need be.) But we haven't seen (or really thought about this) thus far with SED-ML.

By blurring models, simulations, and simulation state, this option would make it more difficult for simulation tools to support SED-ML. For one, this requires simulation tools to separate simulation initialization and temporal evolution in a particular way. Its not clear to me that this makes sense for all possible algorithms.

I don't really understand how you see any 'blurring' here, but more to the point, I don't see how you can *avoid* the role of the 'model state' as I understand it above.

To me, option 2 is the best choice. Simulation state is only meaningful (i.e. non-NaN) where models are used with tasks. Where a model isn't used, its simulation "state" is ill-defined (i.e. NaN/null). This maintains a clear separation of models, simulations, and simulation state. The implication of this is that S2-S3 would always be undefined (NaN). Obviously, this isn't helpful. One way to enable more mathematical flexibility would be to allow data generators to be inputs to calculations.

To be clear: while I think Options 3 or even 1 are preferable to 2, mostly I just want to define everything so that everyone is on the same page (an opinion I know you share!)

-Lucian

Jonathan Karr

unread,

Jun 18, 2021, 8:32:47 PM6/18/21

to sed-ml-...@googlegroups.com

I agree with using model references in addition to task references to disambiguate. I think this is a separate idea that can be combined with all three options.

Whatever we land on, we should add detailed cases for this to our test suite. We have several cases for repeated tasks already running with all supported 5 supported languages. But the cases went designed to make sure alternative interpretations aren't being used.

Jonathan

--

You received this message because you are subscribed to the Google Groups "sed-ml-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sed-ml-discus...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/sed-ml-discuss/CAHLmBr2RWORFrLM0pEzKdf32zWVkY6Xy-qaUPK-rAKKkOcCFTg%40mail.gmail.com.

Matthias König

unread,

Jul 16, 2021, 12:45:34 PM7/16/21

to sed-ml-discuss

For me option 1 makes the most sense. Only if a model is involved in a subtask data should be collected for the model (i.e. for model variables which are used in DataGenerators).

This requires that the simulations/tasks are well defined and math is only possible if data generator matrices have the same dimension. I.e S2 - S3 would give a ValueError if S2 and S3 are in this example vectors of different length.

This problem is just a subset of the problem that math on matrices with different dimensions (and length of dimensions) is not well defined. The exact same issue would for instance occur if we have 2 repeated tasks and have a data generators which is doing math between the two repeated tasks. I.e. task_repeated1.S2 - task_repeated2.S3.

Option 3 is probably not working due to parallelization. Because I would write it as

</listOfSubTasks>

because the two subtasks are independent they can be parallelized and should be parallelized via identical order. But in this case you don't have values to fill in for option 3 as I understand it. Similar issues with Option 2 and parallelization. You want to be able to execute the different subtask independently. I.e. the data generators should have the same dimension and data if you do

</listOfSubTasks>

with rt1.task0.S2 - rt1.task1.S3

or

</listOfSubTasks>

</listOfSubTasks>

with rt2.task0.S2 - rt3.task1.S3

Best Matthias

Reply all

Reply to author

Forward

Message has been deleted