Goal
My goal is to perform parameter estimate for the true rate constants that characterize step-wise transcription by RNA polymerase using 3 independent datasets.
Data
The data are extracted from the quantification of radiolabeled transcripts as a proxy for transcription by RNA polymerase. Transcription is monitored, with single nucleotide resolution, as an increased in transcript length with time. Each lane corresponds to a specific time point during which transcription is stopped. And for every lane, except for zero time, will have bands of varying size that are separately quantified and normalized to the total intensity of each lane. Different sized transcript bands per lane correspond to the positions of RNA polymerase on DNA and thus tracks transcription. In summary, my input raw data will have a column for time and 3 additional columns that tracks the normalized population of transcript that corresponds to RNA polymerase at the specific positions of E, ES, E2S and E3S (see model below). These positions are picked because they constitute the rate limiting steps in the overall scheme. Here, I have attached my raw data as 3 separate tab delimited files.
Model for fitting
The simplest model contains 3 rate limiting steps along the main pathway from E to ES, from ES to E2S and finally from E2S to E3S and include 1 off-pathway state into ESi from ES. All transitions are assumed to be reversible. Given that substrate is in excess, I assume that it exists purely as a constant and is not explicitly defined in the model. Hence, the pseudo first order model (Model 1) is depicted as follows,
Model 1:
Fitting results
I am using Levenberg-Marquardt method for the fit that gives me an objective value of 0.00388, a root mean square value of 0.0037 and a standard deviation of 0.0037. Of the 8 estimated values for the true rate constants, 6 have coefficient of variation that are less than 10% and 2 that consistently hover around 20% for good fits (ie. fitted lines run through most of the experimental data points). Even when I randomized the initial start values for all parameters without updating the model, several consecutive fitting runs converge to pretty much the same estimated values for all parameters indicating a pretty robust and reproducible fit for the specified kinetic scheme. I have included the Model 1.cps file along in this post.
Questions
1) I realize that a robust fit requires that I input an initial value for Ei that is greater than 0 (I am using a random value of 0.5) even though I know that this intermediate should not be present at zero timepoint. Is this going to be an issue to my fit results pertaining to the validity of the estimated rate constants?
2) Currently, I am looking at the coefficient of variation, standard deviation and sum of squares in the progress of fit curve as indicators for how good my fit is. Are there any other built-in statistical outputs that I should be aware of that can help me make the judgement call as to whether the parameter estimates are trustworthy?
3) In my bid to simplify the model further (to have the most parsimonious model without over parameterizing), I have tried making the main pathway steps irreversible in its entirety or for some steps. One of the model tested is as follows,
Irreversible model
In doing so, I notice that the modified model does not fit my data well. Furthermore, the estimated reverse rate constants using the all reversible model are comparable in magnitude to the forward rate constants for each step indicating that assuming irreversible steps are not valid. My question is: Can I conclude that based on the bad fit using an irreversible model plus the observations of comparable forward and reverse rates with the all reversible model justify sticking with Model 1?
4) I know that a kinetic model is written based on a priori information of the underlying biological mechanism. Assuming that I have no prior knowledge of the underlying mechanism, can I start with a model with all steps being reversible (Model 1) and if the estimated parameters indicate that for a particular step, the reverse rate is far smaller than a forward rate, I can then proceed to make that step irreversible?
5) Given prior information that RNA polymerase at E and at E2S positions can fall into an off-pathway states of Ei and E2Si respectively as indicated by the following model,
I proceed to introduce four additional rate constants describing these new transitions in this second model. To simplify this second model, I assume that k2f = k6f and that k2r = k6r by defining them as global quantities. The simplified model is now as follows,
Model 2
With these changes, I notice that the process of fitting is more challenging despite having only 2 additional parameters k2f and k2r. There is a tendency for the final estimated parameters to fluctuate given the same set of initial parameter values. Likewise, it is less likely for the fitted parameters to converge to the same values using random initial guesses and I suspect that the fitting gets trapped in numerous local minimal. Only with fine guesses for Ei, ESi and E2Si, can I achieve a good parameter estimates where their coefficient of variation are kept low (<10%) throughout. What is assuring is that with a good fit (low coefficient of variation for all parameters), the estimated values for the parameters have similar values to that obtained using Model 1. Can I assume that this is a consequence of over parameterizing with the more complicated model? i.e there are many more possible solutions but not necessary with a good fit. I have also included the Model 2.cps file for this second model here.
6) For my last question which is related to the previous question. By imposing constraints, I believe I should be able to limit the search space and reduce trappings in local minimal. Is this assumption correct?
Best,
Wee
1) I realize that a robust fit requires that I input an initial value for Ei that is greater than 0 (I am using a random value of 0.5) even though I know that this intermediate should not be present at zero timepoint. Is this going to be an issue to my fit results pertaining to the validity of the estimated rate constants?
2) Currently, I am looking at the coefficient of variation, standard deviation and sum of squares in the progress of fit curve as indicators for how good my fit is. Are there any other built-in statistical outputs that I should be aware of that can help me make the judgement call as to whether the parameter estimates are trustworthy?
3) In my bid to simplify the model further (to have the most parsimonious model without over parameterizing), I have tried making the main pathway steps irreversible in its entirety or for some steps. One of the model tested is as follows,
Irreversible model
In doing so, I notice that the modified model does not fit my data well. Furthermore, the estimated reverse rate constants using the all reversible model are comparable in magnitude to the forward rate constants for each step indicating that assuming irreversible steps are not valid. My question is: Can I conclude that based on the bad fit using an irreversible model plus the observations of comparable forward and reverse rates with the all reversible model justify sticking with Model 1?
4) I know that a kinetic model is written based on a priori information of the underlying biological mechanism. Assuming that I have no prior knowledge of the underlying mechanism, can I start with a model with all steps being reversible (Model 1) and if the estimated parameters indicate that for a particular step, the reverse rate is far smaller than a forward rate, I can then proceed to make that step irreversible?
5) Given prior information that RNA polymerase at E and at E2S positions can fall into an off-pathway states of Ei and E2Si respectively as indicated by the following model,
I proceed to introduce four additional rate constants describing these new transitions in this second model. To simplify this second model, I assume that k2f = k6f and that k2r = k6r by defining them as global quantities. The simplified model is now as follows,
Model 2
With these changes, I notice that the process of fitting is more challenging despite having only 2 additional parameters k2f and k2r. There is a tendency for the final estimated parameters to fluctuate given the same set of initial parameter values. Likewise, it is less likely for the fitted parameters to converge to the same values using random initial guesses and I suspect that the fitting gets trapped in numerous local minimal. Only with fine guesses for Ei, ESi and E2Si, can I achieve a good parameter estimates where their coefficient of variation are kept low (<10%) throughout. What is assuring is that with a good fit (low coefficient of variation for all parameters), the estimated values for the parameters have similar values to that obtained using Model 1. Can I assume that this is a consequence of over parameterizing with the more complicated model? i.e there are many more possible solutions but not necessary with a good fit. I have also included the Model 2.cps file for this second model here.
6) For my last question which is related to the previous question. By imposing constraints, I believe I should be able to limit the search space and reduce trappings in local minimal. Is this assumption correct?
-- Pedro Mendes, PhD Professor and Director, Richard D. Berlin Center for Cell Analysis and Modeling University of Connecticut School of Medicine group website: http://www.comp-sys-bio.org ICSB 2021 will take place in Hartford, CT, October 9-14, 2021 http://icsb2020.bioscience-ct.net
Hello Pedro,Thanks for your prompt reply.
On Nov 18, 2020, at 7:30 AM, Mendes,Pedro <pme...@uchc.edu> wrote:Hello Wee,
I looked at your model and data and I would say that you have a decent fit. I actually tried fitting with a lot more algorithms and never got a better objective value than 0.00388. The fit seems OK to me. You do have a few parameters that have large standard deviations and I think that is likely because of identifiability problems.Yes, you need to have a finite amount of E in the system, otherwise there is no reaction. You should know from the experiments how much is the total [E] (the total concentration of RNAPolymerase in the assay). If this is not a known parameter, then you need to either use an good guess or if you don't have one maybe a random value (but really should be at least in the correct order of magnitude). It could be possible to fit this value too, but most likely the data is not sufficient to estimate it (ie it would become unidentifiable).1) I realize that a robust fit requires that I input an initial value for Ei that is greater than 0 (I am using a random value of 0.5) even though I know that this intermediate should not be present at zero timepoint. Is this going to be an issue to my fit results pertaining to the validity of the estimated rate constants?
By not having the correct value of [E]_0 (the initial total concentration of E), the values of your estimated rate constants are not reliable in absolute terms. Note that the model really fits the concentration of the intermediates which appear with rates of [E]*k_on so if you have the wrong [E] the k_on you estimate will not be the correct one. (but because all k's are affected similarly, their relative values are trustworthy since the [E] cancels out.). The fit essentially gets [E]*k_on correctly but since you don't know the value of [E] then k_on is also not correct.
This is even a little more complex than I wrote above, because it is the total amount [E]+[ES]+[E2S]+[E3S]+[ESi] that corresponds to the total RNAPol that you have in the assay (because in the preparation you don't know how much of each form you have). Also what I wrote for [E]*k1_on will eventually be [ES]*k2_on etc.
Finally I also notice that you have [ESi]_0 = 0.5 which also affects the whole thing. Bear in mind that the system you have has a total mass conservation equation [E]+[ES]+[E2S]+[E3S]+[ESi]. You can check this under Tasks->Stoichiometric Analysis->Mass Conservation
That gives you the total RNAPol in your system.
In the model and raw data files that I posted earlier, the concentrations of my observables (E, ES, E2S and E3S) were presented as fractions of total enzyme concentration (unitless quantities). I had my initial E set as 0.64 while the rest (ES, ESi, E2S and E3S) were set to 0 at time. = 0. With these parameters, the fit wasn’t great. But, when I set an initial value for Ei = 0.5, the fit was decent. This is what puzzles me because in theory at time = 0, all except E should be 0 (ie.e Ei should be = 0). Initially, I thought this might reflect an incorrect kinetic scheme but the decent fit obtained with an initial Ei = 0.5 suggests otherwise. I like to get your advice on this.Fitting with estimated concentration of E (t = 0) instead of using normalized fractionTotal [enzyme] added ~ 50 nM. Using the raw data for E, plotted with time and extrapolated back to the y-axis, the normalized active [E] at initial time zero = 0.64 of total enzyme added. This translates to 0.64 x 50 nM = ~32 nM. This is what I used for the initial [E]. The rest of the enzyme was presumed to be inactive.In a new parameter estimates using initial [E] = 32 nM; Ei = 15 nM, I fed in raw data that lists the concentrations of the observables with time (by multiplying the fractions of species to the starting E concentration) so that my initial guess for E and the input raw data are represented in the same unit of nmole/L (nM). In this fit, the estimated parameters were similar (very close) to the fit when I had used the unitless parameter. I hope I am doing this correctly. I have attached here the updated Model1C.cps file and the raw data (Data_1C, Data_2C and Data_3C). My interpretation as to why it works in both cases is because I had removed the need for an absolute concentration of the enzyme by using normalized values in the former. In the later fit, both the estimates and raw data are multiplied by a constant, which is the final active enzyme concentration. Therefore, in either case, the resultant rate constants after fitting are similar.