Parameter estimation is a common and crucial task in modeling, as many models depend on unknown parameters which need to be inferred from data. There exist various tools for tasks like model development, model simulation, optimization, or uncertainty analysis, each with different capabilities and strengths. In order to be able to easily combine tools in an interoperable manner, but also to make results accessible and reusable for other researchers, it is valuable to define parameter estimation problems in a standardized form. Here, we introduce PEtab, a parameter estimation problem definition format which integrates with established systems biology standards for model and data specification. As the novel format is already supported by eight software tools with hundreds of users in total, we expect it to be of great use and impact in the community, both for modeling and algorithm development.
Copyright: 2021 Schmiester et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The authors confirm that all data underlying the findings are fully available without restriction. Specifications of PEtab, the PEtab Python library, as well as links to examples, and all supporting software tools are available at -dev/PEtab a snapshot is available at All original content is available under permissive licenses.
Dynamical modeling is central to systems biology, providing insights into the underlying mechanisms of complex phenomena [1]. It enables the integration of heterogeneous data, the testing and generation of hypotheses, and experimental design. However, to achieve this, the unknown model parameters commonly need to be inferred from experimental observations.
The Simulation Experiment Description Markup Language (SED-ML) builds on top of such model definitions and allows for a machine-readable description of simulation experiments based on XML [14]. Also more complex simulation experiments like parameter scans can be encoded, and a human-readable adaptation is provided by the phraSED-ML format [15]. Similarly, the XML-based Systems Biology Results Markup Language (SBRML) was designed to associate models with experimental data and share simulation experiment results in a machine-readable way [16]. Like SED-ML, SBRML can also be used for parameter scans. Complementary, SBtab is a set of table-based conventions for the definition of experimental data and models designed for human-readability and -writability [17].
However, parameter estimation is so far not in the scope of any of the available formats, and important information for it, like the definition of a noise model, is missing. Parameter estimation toolboxes usually use their own specific input formats, making it difficult for the user to switch between tools to benefit from their complementary functionalities and hindering reusability and reproducibility.
Based on our experience with parameter estimation and tool development for systems biology, we developed PEtab, a tabular format for specifying parameter estimation problems. This includes the specification of biological models, observation and noise models, experimental data and their mapping to the observation model, as well as parameters in an unambiguous way.
The scope of PEtab is the full specification of parameter estimation problems in typical systems biology applications. In our experience, a typical setup of data-based modeling starts either with (i) the model of a biological system that is to be calibrated, or with (ii) experimental data that are to be integrated and analyzed using a computational model. Measurements are linked to the biological model by an observation and noise model. Often, measurements are taken after some perturbations have been applied, which are modeled as derivations from a generic model (Fig 1A). Therefore, one goal was to specify such a setup in the least redundant way. Furthermore, we wanted to establish an intuitive, modular, machine- and human-readable and -writable format that makes use of existing standards.
(A) Example of a typical setup for data-based modeling. Usually, a model of a biological system is developed and calibrated based on measurements from perturbation experiments, which are linked to the biological model by an observation model. Different instances of a generic model are used to account for different perturbations or measurement setups. (B) Simplified illustration of how different entities from (A) map to different PEtab files (not all table columns are shown).
PEtab defines parameter estimation problems using a set of files that are outlined in Fig 2. A detailed specification of PEtab version 1 is provided in supplementary file S1 File, as well as at -dev/PEtab. Additionally, we created a tutorial illustrating how to set up a PEtab problem, covering the most common features (supplementary file S2 File). Further example problems can be found at -Initiative/Benchmark-Models-PEtab. The different files specify the biological model, the observation model, experimental conditions, measurements, parameters and visualizations (Fig 1B). These files are described in more detail in the following.
PEtab consists of a model in the SBML format and several tab-separated value (TSV) files to specify measurements and link them to the model. A visualization file can be provided optionally. A YAML file can be used to group the aforementioned files unambiguously.
Model (SBML): File specifying the biological process using the established and well-supported SBML format [11]. Any existing SBML model can be used without modification. All versions of SBML are supported by PEtab and can be used if the specific toolbox supports it.
Observables (TSV): File linking model properties such as state variables and parameter values to measurement data via observation functions and noise models. Various noise models including normal and Laplace distributions are supported, and noise model parameters can be estimated. Observables can be on linear or logarithmic scale.
Measurements (TSV): File specifying and linking experimental data to the experimental conditions and the observables via the respective identifiers. Optionally, simulation conditions for pre-equilibration can be defined (Fig 3B). Parameters that are relevant for the observation process of a given measurement, such as offsets or scaling parameters, can be provided along with the measured values. This allows for overriding generic output parameters in a measurement-specific manner (Fig 3A).
Parameters (TSV): File defining the parameters to be estimated, including lower and upper bounds as well as transformations (e.g., linear or logarithmic) to be used in parameter estimation. Furthermore, prior information on the parameters can be specified to inform starting points for parameter estimation, or to perform Bayesian inference.
Visualization (TSV): Optional visualization file specifying how to combine data and simulations for plotting. Different plots such as time-course or dose-response curves can be automatically created based on this file using the PEtab Python library described below. This allows, for example, to quickly create visualizations to inspect parameter estimation results. A default visualization file can be automatically generated.
PEtab problem file (YAML): File linking all of the above-mentioned PEtab files together. This allows combinations of, e.g., multiple models or measurement files into a single parameter estimation problem, as well as easy reuse of various files in different parameter estimation problems (e.g., for model selection). The current YAML version 1.2 is used here.
We designed PEtab to cover common features needed for parameter estimation. The TSV files comprise different mandatory columns. These provide all necessary information to define an objective function like the χ2 or likelihood function. However, some methods tailored to specific problems require additional information to estimate the unknown parameters. To acknowledge this, we allow for optional application-specific extensions in addition to the required columns in the PEtab files, e.g., if some parameters can be calculated analytically using hierarchical optimization approaches [18].
To facilitate easy usability, PEtab ( -dev/PEtab) comes with detailed documentation describing the specific format of each of the different files in a concise yet comprehensive manner. Additionally, we provide a Python-based library that can be used to read, modify, write, and validate existing PEtab problems. Furthermore, the PEtab library provides functionality to package PEtab files into COMBINE archives [19]. After parameter estimation, the modeler usually investigates how well the model fits the experimental data. To support this, the PEtab library provides various visualization routines to analyze data and parameter estimation results.
We implemented support for PEtab in currently eight systems biology toolboxes, namely COPASI [2], AMICI [6], pyPESTO [20], pyABC [21], Data2Dynamics [5], dMod [10], parPE [18], and MEIGO [4]. These toolboxes provide a broad range of distinct features for model creation, model simulation, parameter inference, and uncertainty quantification (Table 1). Combining different tools with complementary features is often desirable. However, in practice this was hitherto hampered by the substantial overhead of tedious and error-prone re-implementation of the parameter estimation problem in the specific format required by the respective tool. With all of these tools now supporting PEtab, a user can more easily combine different tools and make use of their specific strengths. For example, one can use COPASI for model creation and testing, AMICI for efficient simulation of large models, pyPESTO for multi-start local optimization and sampling, or MEIGO for global scatter searches, and Data2Dynamics or dMod for profiling. The ease of switching between tools also provides the opportunity to easily reproduce and verify results, e.g., whether different tools yield similar results. Additionally, developers can compare the performance of newly developed methods with existing algorithms implemented in different toolboxes, independent of the programming language, to select the most appropriate one for a given setting.
93ddb68554