Sequenza Parameters

921 views
Skip to first unread message

jane.me...@gmail.com

unread,
Feb 1, 2016, 4:02:40 AM2/1/16
to Sequenza User Group
Dear sequenza users,

I posted these questions on seqanwers (http://seqanswers.com/forums/showthread.php?p=188559#post188559) last week, with no answer at the moment. I can be luckier here.

I am using Sequenza only recently. After the python script, I ran the 3 functions sequenza.extract(), sequenza.fit() and sequenza.results() with default parameters. My data is a set of 5 tumor-normal WES pairs.
Now, I would like to adjust the parameters more precisely. The problem is that I do not clearly understand part of them, even after reading the vignette, the discussions from this google group and the description of the R functions.


Here are the parameters:
1) sequenza.extract
  • window=10^6
    "size of windows used when plotting mean and quartile ranges of depth ratios and B-allele frequencies. Smaller windows will take more time to compute"
    Is this parameter useful only for plotting?
    I changed it to 500 and I did not see changes when looking at the genome_view.pdf
  • overlap=1
    "integer specifying the number of overlapping windows"
    If we consider a specific window, it can overlap only 0, 1 or 2 window(s), right?
  • min.type.freq=0.9
    "minimum frequency of aberrant types"
    What does it mean?
  • weighted.mean=TRUE
    "boolean to select if the segments should be calculated using the read depth as weights to calculate depth ratio and B-allele frequency means"
    What does this mean?

2) sequenza.fit
  • N.ratio.filter=10
    "Threshold of minimum number of observation of depth ratio in a segment"
    Minimum number of variants in a segment?
  • N.BAF.filter=1
    "threshold of minimum number of observation of B-allele frequency in a segment"
    Minimum number of variants in a segment?
    Why is the default value not the same as for N.ratio.filter?
  • segment.filter=3 10^6
    "threshold segment length (in base pairs) to filter out short segments, that can cause noise when fitting the cellularity and ploidy parameters. The threshold will not affect the allele-specific segmentation"
    Is it the minimum length of a segment?
    What is the usual range? 3 10^6 seems big.
  • ratio.priority=FALSE
    "logical, if TRUE only the depth ratio will be used to determine the copy number state, while the Bf value will be used to determine the number of B-alleles"
    Does this mean that with FALSE, both depth ratio and BAF will be use to determine copy number and only BAF to determine the number of B-alleles?

My last concern is the gamma parameter, which is crucial for segmentation.
Since I have WES data, I chose breaks.method="full" and now I want to determine gamma.pcf and kmin.pcf.
From what I understood, in cancer data, the range varies often from 15 to 40.
I started with 40. From 140 to 40, copy number estimations did not change but 2 out 5 cellularity estimations changed: from 0.46 to 0.97 and from 0.35 to 0.26.
Do you have a way to determine gamma? Is it possible to use the gamma.plot function of copynumber package?

I am surprised to get very distant cellularities from what we believe. We are pretty sure to have high purity, most likely >90%, but sequenza returned range between 0.35 to 0.5 with default settings.


Sorry for the high number of questions and the long post! I do not know a lot sequenza and I am not an expert in segmentation.
Thank you in advance for your help.
Jane
Jane M is online now Report Post   Edit/Delete Message

Francesco Favero

unread,
Feb 4, 2016, 10:04:30 PM2/4/16
to Sequenza User Group

Hi Jane,

I'm sorry for taking so long to reply. I haven't been too much on seqanswers lately, and so all the sequenza users, apparently :).

I'll try to answer all you questions:

I am using Sequenza only recently. After the python script, I ran the 3 functions sequenza.extract(), sequenza.fit() and sequenza.results() with default parameters. My data is a set of 5 tumor-normal WES pairs.
Now, I would like to adjust the parameters more precisely. The problem is that I do not clearly understand part of them, even after reading the vignette, the discussions from this google group and the description of the R functions.


Here are the parameters:
1) sequenza.extract
  • window=10^6
    "size of windows used when plotting mean and quartile ranges of depth ratios and B-allele frequencies. Smaller windows will take more time to compute"
    Is this parameter useful only for plotting?
 Yes, the windows have no influence on the segmentation. In theory, I've added a way to segment the windows, with the breaks.method="fast", but it's not always working...
The idea of using windows is to reduce the data in a way to be more light to plot.
  • I changed it to 500 and I did not see changes when looking at the genome_view.pdf
  • overlap=1
    "integer specifying the number of overlapping windows"
    If we consider a specific window, it can overlap only 0, 1 or 2 window(s), right?
In theory you can have as many overlapping windows as you want, the constrains are the window size. In practice, you could keep a big windows size but increase the overlap to have a more windows..
  • min.type.freq=0.9
    "minimum frequency of aberrant types"
    What does it mean?
I'm sorry for this quite unhelpful explanation. Since in sequenza we don't aim to call for mutation I implemented some very basic filtering to remove some obvious noise. This one in particular compare the frequency of the aberrant bases. So if you have represented only one aberrant bases it will be 1, in other cases, due to mis-alignment or else, it is possible to have other calls... a practical example

ref C: 10 reads; other position score A: 5 reads G 0 reads T 1 reads
the type.freq would be 5/6(0.83) 0/6(0) and 1/6(0.17)

So in this case, none ot the call will be reported, unless you lower the min.freq.type to 0.8
  • weighted.mean=TRUE
    "boolean to select if the segments should be calculated using the read depth as weights to calculate depth ratio and B-allele frequency means"
    What does this mean?
Simply that it uses a weighted mean, positions with more reads have more weight in the calculation of the mean value for the segment (a segment value is an average of all the positions included in it)... In some case is better with the feature turned off in presence of bias (positions with huge amount of reads)  
2) sequenza.fit
  • N.ratio.filter=10
    "Threshold of minimum number of observation of depth ratio in a segment"
    Minimum number of variants in a segment?
No, just all possible positions, we are executing the segmentation on all the genomic data. Especially in exomes, there might be gaps, so a segment of 10Mb could virtually have just few data points at the two edges of the segment, with this parameter we filter out those segment. The fit step is particularly important to filter out unreliable segments. This will produce the estimation, we are not throwing out segments for downstream analysis.
  • N.BAF.filter=1
    "threshold of minimum number of observation of B-allele frequency in a segment"
    Minimum number of variants in a segment?
    Why is the default value not the same as for N.ratio.filter?
The segment is a bivariate data, the ratio, using any position in the genome, and the b-allele frequency, using the heterozygous position.

So a segment could have 100 "ratio" pos, but only 2 "heterozygous" pos.
I think you are confusing yourself with the variants in the segment in this case: the only variant that we consider for the segment are the germlines heterozygous, so position that are detected het already in the normal. 
  • segment.filter=3 10^6
    "threshold segment length (in base pairs) to filter out short segments, that can cause noise when fitting the cellularity and ploidy parameters. The threshold will not affect the allele-specific segmentation"
    Is it the minimum length of a segment?
    What is the usual range? 3 10^6 seems big.
This is to give to the model the "good" segments. In many case works without even filtering, in other you want to go up to bigger segments, 3 Mb is a compromise... 
  • ratio.priority=FALSE
    "logical, if TRUE only the depth ratio will be used to determine the copy number state, while the Bf value will be used to determine the number of B-alleles"
    Does this mean that with FALSE, both depth ratio and BAF will be use to determine copy number and only BAF to determine the number of B-alleles?
It means that if TRUE both will be used, if false the absoulute cn rely only on the depth ratio, and once estimated the cn you determine more likely allele copy number, given the baf. 

This is advisable if the BAF profile looks very noisy. In WES it could be the case, in WGS the heterozygous position are much more and more reliable, but in exome it depends a lot on the specific dataset.

My last concern is the gamma parameter, which is crucial for segmentation.
Since I have WES data, I chose breaks.method="full" and now I want to determine gamma.pcf and kmin.pcf.
From what I understood, in cancer data, the range varies often from 15 to 40.
I started with 40. From 140 to 40, copy number estimations did not change but 2 out 5 cellularity estimations changed: from 0.46 to 0.97 and from 0.35 to 0.26.
Do you have a way to determine gamma? Is it possible to use the gamma.plot function of copynumber package?
Well you can visually inspect the segmentation by looking at the chromosome view or genome view. Unfortunately there is no easy way to determine the perfect gamma kmin set

I am surprised to get very distant cellularities from what we believe. We are pretty sure to have high purity, most likely >90%, but sequenza returned range between 0.35 to 0.5 with default settings.

If you have cellularity bigger the 90% it should be clear upon visual inspection that the cellularity is higher then predicted, but the model clearly picked on some misplaced segment (this can happen if the genome does not have much aberrations, he model focus on some telomeric or centromeric regions that are generally badly aligned and noisy). In this case you could use the estimation using the mutation frequency, but it is not very well tested (you need to use only the trusted mutations....) 

Sorry for the high number of questions and the long post! I do not know a lot sequenza and I am not an expert in segmentation.

No problem, unfortunately segmentation is not optimal, if we manage to improve on that also the model fitting will have a much cleaner way of weight the data. 
Thank you in advance for your help.
Jane

Thanks for trying sequenza, and  you ar welcome to post any other problem you have in the future.

`Francesco

jane.me...@gmail.com

unread,
Feb 5, 2016, 8:26:23 AM2/5/16
to Sequenza User Group
Dear Franceso,

Thank you a lot for your complete answer.
I will take time to well understand everything and I will try to adjust more precisely the parameters based on your explanations.
Jane
Reply all
Reply to author
Forward
0 new messages