Modeling and Parameter Estimates

157 views
Skip to first unread message

Liang Meng Wee

unread,
Nov 17, 2020, 4:20:57 PM11/17/20
to COPASI User Forum
Dear all,
I appreciate if you can help address some of my questions on parameter estimates.

Goal

My goal is to perform parameter estimate for the true rate constants that characterize step-wise transcription by RNA polymerase using 3 independent datasets.

Data

The data are extracted from the quantification of radiolabeled transcripts as a proxy for transcription by RNA polymerase. Transcription is monitored, with single nucleotide resolution, as an increased in transcript length with time. Each lane corresponds to a specific time point during which transcription is stopped. And for every lane, except for zero time, will have bands of varying size that are separately quantified and normalized to the total intensity of each lane. Different sized transcript bands per lane correspond to the positions of RNA polymerase on DNA and thus tracks transcription. In summary, my input raw data will have a column for time and 3 additional columns that tracks the normalized population of transcript that corresponds to RNA polymerase at the specific positions of E, ES, E2S and E3S (see model below). These positions are picked because they constitute the rate limiting steps in the overall scheme. Here, I have attached my raw data as 3 separate tab delimited files.

 Model for fitting

 The simplest model contains 3 rate limiting steps along the main pathway from E to ES, from ES to E2S and finally from E2S to E3S and include 1 off-pathway state into ESi from ES. All transitions are assumed to be reversible. Given that substrate is in excess, I assume that it exists purely as a constant and is not explicitly defined in the model. Hence, the pseudo first order model (Model 1) is depicted as follows,

Model 1:


Fitting results

I am using Levenberg-Marquardt method for the fit that gives me an objective value of 0.00388, a root mean square value of 0.0037 and a standard deviation of 0.0037. Of the 8 estimated values for the true rate constants, 6 have coefficient of variation that are less than 10% and 2 that consistently hover around 20% for good fits (ie. fitted lines run through most of the experimental data points). Even when I randomized the initial start values for all parameters without updating the model, several consecutive fitting runs converge to pretty much the same estimated values for all parameters indicating a pretty robust and reproducible fit for the specified kinetic scheme. I have included the Model 1.cps file along in this post.

 Questions

 1)    I realize that a robust fit requires that I input an initial value for Ei that is greater than 0 (I am using a random value of 0.5) even though I know that this intermediate should not be present at zero timepoint. Is this going to be an issue to my fit results pertaining to the validity of the estimated rate constants? 

 2)    Currently, I am looking at the coefficient of variation, standard deviation and sum of squares in the progress of fit curve as indicators for how good my fit is. Are there any other built-in statistical outputs that I should be aware of that can help me make the judgement call as to whether the parameter estimates are trustworthy?

3)    In my bid to simplify the model further (to have the most parsimonious model without over parameterizing), I have tried making the main pathway steps irreversible in its entirety or for some steps. One of the model tested is as follows,

Irreversible model


In doing so, I notice that the modified model does not fit my data well. Furthermore, the estimated reverse rate constants using the all reversible model are comparable in magnitude to the forward rate constants for each step indicating that assuming irreversible steps are not valid. My question is: Can I conclude that based on the bad fit using an irreversible model plus the observations of comparable forward and reverse rates with the all reversible model justify sticking with Model 1?

 4)    I know that a kinetic model is written based on a priori information of the underlying biological mechanism. Assuming that I have no prior knowledge of the underlying mechanism, can I start with a model with all steps being reversible (Model 1) and if the estimated parameters indicate that for a particular step, the reverse rate is far smaller than a forward rate, I can then proceed to make that step irreversible?

 5)    Given prior information that RNA polymerase at E and at E2S positions can fall into an off-pathway states of Ei and E2Si respectively as indicated by the following model,

 

I proceed to introduce four additional rate constants describing these new transitions in this second model. To simplify this second model, I assume that k2f = k6f and that k2r = k6r by defining them as global quantities. The simplified model is now as follows,

Model 2

 

With these changes, I notice that the process of fitting is more challenging despite having only 2 additional parameters k2f and k2r. There is a tendency for the final estimated parameters to fluctuate given the same set of initial parameter values. Likewise, it is less likely for the fitted parameters to converge to the same values using random initial guesses and I suspect that the fitting gets trapped in numerous local minimal. Only with fine guesses for Ei, ESi and E2Si, can I achieve a good parameter estimates where their coefficient of variation are kept low (<10%) throughout. What is assuring is that with a good fit (low coefficient of variation for all parameters), the estimated values for the parameters have similar values to that obtained using Model 1. Can I assume that this is a consequence of over parameterizing with the more complicated model? i.e there are many more possible solutions but not necessary with a good fit. I have also included the Model 2.cps file for this second model here.

 6)    For my last question which is related to the previous question. By imposing constraints, I believe I should be able to limit the search space and reduce trappings in local minimal. Is this assumption correct?

Best,

Wee

Data_1.txt
Model 1.cps
Model 2.cps
Data_2.txt
Data_3.txt

Liang Meng Wee

unread,
Nov 17, 2020, 4:25:21 PM11/17/20
to COPASI User Forum
Here are the missing models. Sorry for that.
Models.jpg

Liang Meng Wee

unread,
Nov 17, 2020, 4:29:11 PM11/17/20
to COPASI User Forum
Models.jpg

Mendes,Pedro

unread,
Nov 18, 2020, 10:30:59 AM11/18/20
to copasi-u...@googlegroups.com
Hello Wee,

I looked at your model and data and I would say that you have a decent fit. I actually tried fitting with a lot more algorithms and never got a better objective value than 0.00388. The fit seems OK to me. You do have a few parameters that have large standard deviations and I think that is likely because of identifiability problems.


 1)    I realize that a robust fit requires that I input an initial value for Ei that is greater than 0 (I am using a random value of 0.5) even though I know that this intermediate should not be present at zero timepoint. Is this going to be an issue to my fit results pertaining to the validity of the estimated rate constants?

Yes, you need to have a finite amount of E in the system, otherwise there is no reaction. You should know from the experiments how much is the total [E] (the total concentration of  RNAPolymerase in the assay). If this is not a known parameter, then you need to either use an good guess or if you don't have one maybe a random value (but really should be at least in the correct order of magnitude). It could be possible to fit this value too, but most likely the data is not sufficient to estimate it (ie it would become unidentifiable).

By not having the correct value of [E]_0 (the initial total concentration of E), the values of your estimated rate constants are not reliable in absolute terms. Note that the model really fits the concentration of the intermediates which appear with rates of [E]*k_on so if you have the wrong [E] the k_on you estimate will not be the correct one. (but because all k's are affected similarly, their relative values are trustworthy since the [E] cancels out.). The fit essentially gets [E]*k_on correctly but since you don't know the value of [E] then k_on is also not correct.

This is even a little more complex than I wrote above, because it is the total amount [E]+[ES]+[E2S]+[E3S]+[ESi] that corresponds to the total RNAPol that you have in the assay (because in the preparation you don't know how much of each form you have). Also what I wrote for [E]*k1_on will eventually be [ES]*k2_on etc.

Finally I also notice that you have [ESi]_0 = 0.5 which also affects the whole thing. Bear in mind that the system you have has a total mass conservation equation [E]+[ES]+[E2S]+[E3S]+[ESi]. You can check this under Tasks->Stoichiometric Analysis->Mass Conservation
That gives you the total RNAPol in your system.



 2)    Currently, I am looking at the coefficient of variation, standard deviation and sum of squares in the progress of fit curve as indicators for how good my fit is. Are there any other built-in statistical outputs that I should be aware of that can help me make the judgement call as to whether the parameter estimates are trustworthy?

Those are good indicators of fit. Bear in mind that they are values that depend on your model and data, so the value 0.0038 seems to be good for your problem, but a different problem may have a very good fit with an objective function value of 3.4e6 ...

Another way of analyzing the quality of the fit is to look at the residuals (weighted errors). You can do that by just making visible the weighted errors on your plot (ie hide the measured and fitted values). The weighted errors should be small and randomly scattered around 0. When the fit is bad they will be larger and will reflect the data (ie will show a pattern like the measured data, or its mirror)


3)    In my bid to simplify the model further (to have the most parsimonious model without over parameterizing), I have tried making the main pathway steps irreversible in its entirety or for some steps. One of the model tested is as follows,

Irreversible model


In doing so, I notice that the modified model does not fit my data well. Furthermore, the estimated reverse rate constants using the all reversible model are comparable in magnitude to the forward rate constants for each step indicating that assuming irreversible steps are not valid. My question is: Can I conclude that based on the bad fit using an irreversible model plus the observations of comparable forward and reverse rates with the all reversible model justify sticking with Model 1?

I am not sure about the conclusions you can get from this. At face value what you write may be correct. But I think we need to analyze the model better. Your model only represents the states of the RNAPol, and it doesn't allow it to be free after full transcription (ie it ends in E3S state, but normally it would be released and become free again, ie E3S -> E).

 4)    I know that a kinetic model is written based on a priori information of the underlying biological mechanism. Assuming that I have no prior knowledge of the underlying mechanism, can I start with a model with all steps being reversible (Model 1) and if the estimated parameters indicate that for a particular step, the reverse rate is far smaller than a forward rate, I can then proceed to make that step irreversible?


Well, when you have that situation you can make the step irreversible and see how that affects the result. That is one of the advantages of modeling, it allows you to try out hypotheses. Sometimes what seems obvious does not materialize as we think.. .so you can always try and compare.

 5)    Given prior information that RNA polymerase at E and at E2S positions can fall into an off-pathway states of Ei and E2Si respectively as indicated by the following model,

 

I proceed to introduce four additional rate constants describing these new transitions in this second model. To simplify this second model, I assume that k2f = k6f and that k2r = k6r by defining them as global quantities. The simplified model is now as follows,

Model 2

 

With these changes, I notice that the process of fitting is more challenging despite having only 2 additional parameters k2f and k2r. There is a tendency for the final estimated parameters to fluctuate given the same set of initial parameter values. Likewise, it is less likely for the fitted parameters to converge to the same values using random initial guesses and I suspect that the fitting gets trapped in numerous local minimal. Only with fine guesses for Ei, ESi and E2Si, can I achieve a good parameter estimates where their coefficient of variation are kept low (<10%) throughout. What is assuring is that with a good fit (low coefficient of variation for all parameters), the estimated values for the parameters have similar values to that obtained using Model 1. Can I assume that this is a consequence of over parameterizing with the more complicated model? i.e there are many more possible solutions but not necessary with a good fit. I have also included the Model 2.cps file for this second model here.

No,  the problem with the more complex model is that it entails a number of nonlienar constraints that now make the solution harder to get. Indeed it may be generating local minima in the error function, or it could be adding regions that have little or no gradient (though this is not easy to demonstrate as the error function is high-dimensional)


 6)    For my last question which is related to the previous question. By imposing constraints, I believe I should be able to limit the search space and reduce trappings in local minimal. Is this assumption correct?

Yes, but... adding constraints makes finding the solution harder (because you are making parts of the parameter space "forbidden" which makes some algorithms not work well. In fact you can add contradictory constraints and then the problem is unsolvable. So if you want to add constraints I would be careful and add them one by one and carefully study the behavior of the parameter estimation process.

Hope this helps
Pedro
-- 
Pedro Mendes, PhD
Professor and Director,
Richard D. Berlin Center for Cell Analysis and Modeling
University of Connecticut School of Medicine
group website: http://www.comp-sys-bio.org

ICSB 2021 will take place in Hartford, CT, October 9-14, 2021
http://icsb2020.bioscience-ct.net

Liang Meng Wee

unread,
Nov 18, 2020, 11:35:41 PM11/18/20
to copasi-u...@googlegroups.com, pme...@uchc.edu

Hello Pedro,

Thanks for your prompt reply.

On Nov 18, 2020, at 7:30 AM, Mendes,Pedro <pme...@uchc.edu> wrote:

Hello Wee,

I looked at your model and data and I would say that you have a decent fit. I actually tried fitting with a lot more algorithms and never got a better objective value than 0.00388. The fit seems OK to me. You do have a few parameters that have large standard deviations and I think that is likely because of identifiability problems.

 1)    I realize that a robust fit requires that I input an initial value for Ei that is greater than 0 (I am using a random value of 0.5) even though I know that this intermediate should not be present at zero timepoint. Is this going to be an issue to my fit results pertaining to the validity of the estimated rate constants? 

Yes, you need to have a finite amount of E in the system, otherwise there is no reaction. You should know from the experiments how much is the total [E] (the total concentration of  RNAPolymerase in the assay). If this is not a known parameter, then you need to either use an good guess or if you don't have one maybe a random value (but really should be at least in the correct order of magnitude). It could be possible to fit this value too, but most likely the data is not sufficient to estimate it (ie it would become unidentifiable). 

By not having the correct value of [E]_0 (the initial total concentration of E), the values of your estimated rate constants are not reliable in absolute terms. Note that the model really fits the concentration of the intermediates which appear with rates of [E]*k_on so if you have the wrong [E] the k_on you estimate will not be the correct one. (but because all k's are affected similarly, their relative values are trustworthy since the [E] cancels out.). The fit essentially gets [E]*k_on correctly but since you don't know the value of [E] then k_on is also not correct.

This is even a little more complex than I wrote above, because it is the total amount [E]+[ES]+[E2S]+[E3S]+[ESi] that corresponds to the total RNAPol that you have in the assay (because in the preparation you don't know how much of each form you have). Also what I wrote for [E]*k1_on will eventually be [ES]*k2_on etc. 

Finally I also notice that you have [ESi]_0 = 0.5 which also affects the whole thing. Bear in mind that the system you have has a total mass conservation equation [E]+[ES]+[E2S]+[E3S]+[ESi]. You can check this under Tasks->Stoichiometric Analysis->Mass Conservation
That gives you the total RNAPol in your system.


In the model and raw data files that I posted earlier, the concentrations of my observables (E, ES, E2S and E3S) were presented as fractions of total enzyme concentration (unitless quantities). I had my initial E set as 0.64 while the rest (ES, ESi, E2S and E3S) were set to 0 at time. = 0. With these parameters, the fit wasn’t great. But, when I set an initial value for Ei = 0.5, the fit was decent. This is what puzzles me because in theory at time = 0, all except E should be  0 (ie.e Ei should be = 0).  Initially, I thought this might reflect an incorrect kinetic scheme but the decent fit obtained with an initial Ei = 0.5 suggests otherwise. I like to get your advice on this.

Fitting with estimated concentration of E (t = 0) instead of using normalized fraction
Total [enzyme] added ~ 50 nM. Using the raw data for E, plotted with time and extrapolated back to the y-axis, the normalized active [E] at initial time zero = 0.64 of total enzyme added. This translates to 0.64 x 50 nM = ~32 nM. This is what I used for the initial [E]. The rest of the enzyme was presumed to be inactive.

In a new parameter estimates using initial [E] = 32 nM; Ei = 15 nM, I fed in raw data that lists the concentrations of the observables with time (by multiplying the fractions of species to the starting E concentration) so that my initial guess for E and the input raw data are represented in the same unit of nmole/L (nM). In this fit, the estimated parameters were similar (very close) to the fit when I had used the unitless parameter. I hope I am doing this correctly. I have attached here the updated Model1C.cps file and the raw data (Data_1C, Data_2C and Data_3C). My interpretation as to why it works in both cases is because I had removed the need for an absolute concentration of the enzyme by using normalized values in the former. In the later fit, both the estimates and raw data are multiplied by a constant, which is the final active enzyme concentration. Therefore, in either case, the resultant rate constants after fitting are similar.


Model1C.cps
Data_1C.txt
Data_2C.txt
Data_3C.txt
Objective Function Value.tiff
OBFn.tiff
Flux.tiff
Parameter Estimate.tiff

Liang Meng Wee

unread,
Nov 19, 2020, 11:29:06 AM11/19/20
to COPASI User Forum
Sorry, I just realize that my reply was cut short. I am reposting here with my reply in blue.

Hello Wee,


Yes, you need to have a finite amount of E in the system, otherwise there is no reaction. You should know from the experiments how much is the total [E] (the total concentration of  RNAPolymerase in the assay). If this is not a known parameter, then you need to either use an good guess or if you don't have one maybe a random value (but really should be at least in the correct order of magnitude). It could be possible to fit this value too, but most likely the data is not sufficient to estimate it (ie it would become unidentifiable). 

By not having the correct value of [E]_0 (the initial total concentration of E), the values of your estimated rate constants are not reliable in absolute terms. Note that the model really fits the concentration of the intermediates which appear with rates of [E]*k_on so if you have the wrong [E] the k_on you estimate will not be the correct one. (but because all k's are affected similarly, their relative values are trustworthy since the [E] cancels out.). The fit essentially gets [E]*k_on correctly but since you don't know the value of [E] then k_on is also not correct.

This is even a little more complex than I wrote above, because it is the total amount [E]+[ES]+[E2S]+[E3S]+[ESi] that corresponds to the total RNAPol that you have in the assay (because in the preparation you don't know how much of each form you have). Also what I wrote for [E]*k1_on will eventually be [ES]*k2_on etc. 

Finally I also notice that you have [ESi]_0 = 0.5 which also affects the whole thing. Bear in mind that the system you have has a total mass conservation equation [E]+[ES]+[E2S]+[E3S]+[ESi]. You can check this under Tasks->Stoichiometric Analysis->Mass Conservation
That gives you the total RNAPol in your system.

In the model and raw data files that I posted earlier, the concentrations of my observables (E, ES, E2S and E3S) were presented as fractions of total enzyme concentration (unitless quantities). I had my initial E set as 0.64 while the rest (ES, ESi, E2S and E3S) were set to 0 at time. = 0. With these parameters, the fit wasn’t great. But, when I set an initial value for Ei = 0.5, the fit was decent. This is what puzzles me because in theory at time = 0, all except E should be  0 (ie.e Ei should be = 0).  Initially, I thought this might reflect an incorrect kinetic scheme but the decent fit obtained with an initial Ei = 0.5 suggests otherwise. I like to get your advice on this.

Fitting with estimated concentration of E (t = 0) instead of using normalized fraction
Total [enzyme] added ~ 50 nM. Using the raw data for E, plotted with time and extrapolated back to the y-axis, the normalized active [E] at initial time zero = 0.64 of total enzyme added. This translates to 0.64 x 50 nM = ~32 nM. This is what I used for the initial [E]. The rest of the enzyme was presumed to be inactive.

In a new parameter estimates using initial [E] = 32 nM; Ei = 15 nM, I fed in raw data that lists the concentrations of the observables with time (by multiplying the fractions of species to the starting E concentration) so that my initial guess for E and the input raw data are represented in the same unit of nmole/L (nM). In this fit, the estimated parameters were similar (very close) to the fit when I had used the unitless parameter. I hope I am doing this correctly. I have attached here the updated Model1C.cps file and the raw data (Data_1C, Data_2C and Data_3C). My interpretation as to why it works in both cases is because I had removed the need for an absolute concentration of the enzyme by using normalized values in the former. In the later fit, both the estimates and raw data are multiplied by a constant, which is the final active enzyme concentration. Therefore, in either case, the resultant rate constants after fitting are similar.

Those are good indicators of fit. Bear in mind that they are values that depend on your model and data, so the value 0.0038 seems to be good for your problem, but a different problem may have a very good fit with an objective function value of 3.4e6 ... 

Another way of analyzing the quality of the fit is to look at the residuals (weighted errors). You can do that by just making visible the weighted errors on your plot (ie hide the measured and fitted values). The weighted errors should be small and randomly scattered around 0. When the fit is bad they will be larger and will reflect the data (ie will show a pattern like the measured data, or its mirror)

Got it!

I am not sure about the conclusions you can get from this. At face value what you write may be correct. But I think we need to analyze the model better. Your model only represents the states of the RNAPol, and it doesn't allow it to be free after full transcription (ie it ends in E3S state, but normally it would be released and become free again, ie E3S -> E).

In my setup, when RNAPol transcribes through the DNA template, it falls off and will not restart. That said, each RNAPol transcribes only once through the sites of ES , E2S and E3S. In a sense, it is unlike any other enzyme that does multiple turnover that recovers free E that will be fed into the next catalytic cycle. That said, I did not include the E3S -> E because I thought this would assume that RNAPol once completed its run through the template will begin another. Is this thinking correct? 

Well, when you have that situation you can make the step irreversible and see how that affects the result. That is one of the advantages of modeling, it allows you to try out hypotheses. Sometimes what seems obvious does not materialize as we think.. .so you can always try and compare.

I have tried to make either all the main steps or the last step irreversible and the fit wasn’t good. With these observations plus the fact that the forward rate is comparable to the reverse rate for most of the steps, I infer that I cannot ignore the reverse reactions rendering the steps necessarily reversible.

No,  the problem with the more complex model is that it entails a number of nonlienar constraints that now make the solution harder to get. Indeed it may be generating local minima in the error function, or it could be adding regions that have little or no gradient (though this is not easy to demonstrate as the error function is high-dimensional)

Point taken

Yes, but... adding constraints makes finding the solution harder (because you are making parts of the parameter space "forbidden" which makes some algorithms not work well. In fact you can add contradictory constraints and then the problem is unsolvable. So if you want to add constraints I would be careful and add them one by one and carefully study the behavior of the parameter estimation process.

Great! This is good to know.

************************************************************************************************************************************
Additional Questions

1) In my new fit using initial guesses for the enzyme concentration, the objective function value has increased as you had highlighted that it will change based on the input values/experiments. Can I construe from the breakdown of the final objective function value that [E3S], which has the highest value also has the most uncertain associated estimates. ie. the coefficient of variation is highest for rate constants associated with E3S. (Figures Parameter Estimate.png and OBFn.png)

2) Looking at the final output more carefully, can you explain how COPASI determine the flux? This values might be very informative if I compare these values to a different set of experiments. (Figure: Flux.png)

3) A naive question, which I can’t wrapped my head around and I felt I should know this: Does the output rate constants, flux or even the concentrations of intermediates that are being estimated reflect steady state or perhaps even pre-steady state condition?



Liang Meng Wee

unread,
Nov 19, 2020, 11:31:26 AM11/19/20
to COPASI User Forum
Here are the figures, new cps and raw data files.
Objective Function Value.png
Data_3C.txt
OBFn.png
Model1C.cps
Data_2C.txt
Flux.png
Data_1C.txt
Parameter Estimate.png
Reply all
Reply to author
Forward
0 new messages