Datacard Software Download

1 view
Skip to first unread message

Tonja Witcraft

unread,
Aug 4, 2024, 9:39:13 PM8/4/24
to sinfuepertio
Theinput to Combine, which defines the details of the analysis, is a plain ASCII file we will refer to as datacard. This is true whether the analysis is a simple counting experiment or a shape analysis.

Following this, one declares the number of observables, imax, that are present in the model used to set limits / extract confidence intervals. The number of observables will typically be the number of channels in a counting experiment. The value * can be specified for imax, which tells Combine to determine the number of observables from the rest of the datacard. In order to better catch mistakes, it is recommended to explicitly specify the value.


After providing this information, the following lines describe what is observed in data: the number of events observed in each channel. The first line, starting with bin, defines the label used for each channel. In the example we have 1 channel, labelled 1, and in the following line, observation, the number of observed events is given: 0 in this example.


The expected shape can be parametric, or not. In the first case the parametric PDFs have to be given as input to the tool. In the latter case, for each channel, histograms have to be provided for the expected shape of each process. The data have to be provided as input as a histogram to perform a binned shape analysis, and as a RooDataSet to perform an unbinned shape analysis.


If using RooFit-based inputs (RooDataHists/RooDataSets/RooAbsPdfs) then you need to ensure you are using different RooRealVars as the observable in each category entering the statistical analysis. It is possible to use the same RooRealVar if the observable has the same range (and binning if using binned data) in each category, although in most cases it is simpler to avoid doing this.


As with the counting experiment, the total nominal rate of a given process must be identified in the rate line of the datacard. However, there are special options for shape-based analyses, as follows:


The Combine tool can take as input histograms saved as TH1, as RooAbsHist in a RooFit workspace (an example of how to create a RooFit workspace and save histograms is available in github), or from a pandas dataframe (example).


In addition, user-defined keywords can be used. Any word in the datacard $WORD will be replaced by VALUE when including the option --keyword-value WORD=VALUE. This option can be repeated multiple times for multiple keywords.


Shape uncertainties can be taken into account by vertical interpolation of the histograms. The shapes (fraction of events \(f\) in each bin) are interpolated using a spline for shifts below +/- 1σ and linearly outside of that. Specifically, for nuisance parameter values \(\nu\leq 1\)


The normalizations are interpolated linearly in log scale, just like we do for log-normal uncertainties. If the value in a given bin is negative for some value of \(\nu\), the value will be truncated at 0.


For each shape uncertainty and process/channel affected by it, two additional input shapes have to be provided. These are obtained by shifting the parameter up and down by one standard deviation. When building the likelihood, each shape uncertainty is associated to a nuisance parameter taken from a unit gaussian distribution, which is used to interpolate or extrapolate using the specified histograms.


The effect can be "-" or 0 for no effect, 1 for the normal effect, and something different from 1 to test larger or smaller effects (in that case, the unit gaussian is scaled by that factor before using it as parameter for the interpolation).


There are two options for the interpolation algorithm in the "shape" uncertainty. Putting shape will result in an interpolation of the fraction of events in each bin. That is, the histograms are first normalized before interpolation. Putting shapeN while instead base the interpolation on the logs of the fraction in each bin. For both shape and shapeN, the total normalization is interpolated using an asymmetric log-normal, so that the effect of the systematic on both the shape and normalization are accounted for. The following image shows a comparison of the two algorithms for the example datacard.


In this case there are two processes, signal and background, and two uncertainties affecting the background (alpha) and signal shapes (sigma). In the ROOT file, two histograms per systematic have to be provided, they are the shapes obtained, for the specific process, by shifting the parameter associated with the uncertainty up and down by a standard deviation: background_alphaUp and background_alphaDown, signal_sigmaUp and signal_sigmaDown.


If there is also an uncertainty that affects the shape, e.g. the jet energy scale, shape histograms for the jet energy scale shifted up and down by one sigma need to be included. This could be done by creating a folder for each process and writing a line like


If you have a nuisance parameter that has shape effects on some processes (using shape) and rate effects on other processes (using lnN) you should use a single line for the systematic uncertainty with shape?. This will tell Combine to fist look for Up/Down systematic templates for that process and if it doesnt find them, it will interpret the number that you put for the process as a lnN instead.


In some cases, it can be convenient to describe the expected signal and background shapes in terms of analytical functions, rather than templates. Typical examples are searches/measurements where the signal is apparent as a narrow peak over a smooth continuum background. In this context, uncertainties affecting the shapes of the signal and backgrounds can be implemented naturally as uncertainties in the parameters of those analytical functions. It is also possible to adopt an agnostic approach in which the parameters of the background model are left freely floating in the fit to the data, i.e. only requiring the background to be well described by a smooth function.


Technically, this is implemented by means of the RooFit package, which allows writing generic probability density functions, and saving them into ROOT files. The PDFs can be either taken from RooFit's standard library of functions (e.g. Gaussians, polynomials, ...) or hand-coded in C++, and combined together to form even more complex shapes.


In the datacard using templates, the column after the file name would have been the name of the histogram. For parametric analysis we need two names to identify the mapping, separated by a colon (:).


The first part identifies the name of the input RooWorkspace containing the PDF, and the second part the name of the RooAbsPdf inside it (or, for the observed data, the RooAbsData). It is possible to have multiple input workspaces, just as there can be multiple input ROOT files. You can use any of the usual RooFit pre-defined PDFs for your signal and background models.


If in your model you are using RooAddPdfs, in which the coefficients are not defined recursively, Combine will not interpret them correctly. You can add the option --X-rtd ADDNLL_RECURSIVE=0 to any Combine command in order to recover the correct interpretation, however we recommend that you instead re-define your PDF so that the coefficients are recursive (as described in the RooAddPdf documentation) and keep the total normalization (i.e the extended term) as a separate object, as in the case of the tutorial datacard.


which indicates that the input file simple-shapes-parametric_input.root should contain an input workspace (w) with PDFs named sig and bkg, since these are the names of the two processes in the datacard. Additionally, we expect there to be a data set named data_obs. If we look at the contents of the workspace in data/tutorials/shapes/simple-shapes-parametric_input.root, this is indeed what we see:


In this datacard, the signal is parameterized in terms of the hypothesized mass (MH). Combine will use this variable, instead of creating its own, which will be interpreted as the value for -m. For this reason, we should add the option -m 30 (or something else within the observable range) when running Combine. You will also see there is a variable named bkg_norm. This is used to normalize the background rate (see the section on Rate parameters below for details).


Combine will not accept RooExtendedPdfs as input. This is to alleviate a bug that lead to improper treatment of the normalization when using multiple RooExtendedPdfs to describe a single process. You should instead use RooAbsPdfs and provide the rate as a separate object (see the Rate parameters section).


These lines encode uncertainties in the parameters of the signal and background PDFs. The parameter is to be assigned a Gaussian uncertainty of Y around its mean value of X. One can change the mean value from 0 to 1 (or any value, if one so chooses) if the parameter in question is multiplicative instead of additive.


meaning there is a parameter in the input workspace called sigma, that should be constrained with a Gaussian centered at 1.0 with a width of 0.1. Note that the exact interpretation of these parameters is left to the user since the signal PDF is constructed externally by you. All Combine knows is that 1.0 should be the most likely value and 0.1 is its 1σ uncertainy. Asymmetric uncertainties are written using the syntax -1σ/+1σ in the datacard, as is the case for lnN uncertainties.


Though this is not strictly necessary in frequentist methods using profiled likelihoods, as Combine will still profile these nuisances when performing fits (as is the case for the simple-shapes-parametric.txt datacard).


RooFit uses the integral of the PDF, computed analytically (or numerically, but disregarding the binning), to normalize it, but computes the expected event yield in each bin by evaluating the PDF at the bin center. This means that if the variation of the pdf is sizeable within the bin, there is a mismatch between the sum of the event yields per bin and the PDF normalization, which can cause a bias in the fits. More specifically, the bias is present if the contribution of the second derivative integrated in the bin size is not negligible. For linear functions, an evaluation at the bin center is correct. There are two recommended ways to work around this issue:

3a8082e126
Reply all
Reply to author
Forward
0 new messages