low variant call amount for snp2seqz

Sebastian Hollizeck

unread,

Jan 28, 2020, 2:38:21 AM1/28/20

to Sequenza User Group

Hey,

I would love to use the already called variants (by strelka2) to generate the seqz file.

however i am unsure if i am doing something wrong, as the log messages are not really how i would expect them to be.

ext <- sequenza.extract("/Volumes/dawson_genomics/Projects/CASCADE/CA82/analysis/CA82-1/sequenza/CA82-1.seqz.gz", chromosome.list=paste0("chr",c(1:22,"X","Y")), assembly='hg38', parallel=10)
Collecting GC information  done

Processing chr1:
   1 variant calls.
   2 copy-number segments.
   26 heterozygous positions.
   2663 homozygous positions.
Processing chr2:
   1 variant calls.
   2 copy-number segments.
   23 heterozygous positions.
   2195 homozygous positions.
Processing chr3:
   1 variant calls.
   2 copy-number segments.
   20 heterozygous positions.
   1682 homozygous positions.
Processing chr4:
   1 variant calls.
   2 copy-number segments.
   13 heterozygous positions.
   1825 homozygous positions.
...

I am just worried with the 1 variant calls, which surprises me, as the vcf i started out with has about 40K variants.
I used to do the analysis from the original bams, but the amount of time it took to calculate the results as well as the inability of bam2seqz to read crams makes me want to switch to the precomputed variants.

And while i am at it: When using snp2seqz, am I mean to do the binning as well afterwards? it seems to me, that that would be unnecessary for the snp based approach, but I just wanted to hear what other people think.

Cheers,

Sebastian

Francesco Favero

unread,

Jan 28, 2020, 9:22:26 AM1/28/20

to Sebastian Hollizeck, Sequenza User Group

Hi Sebastian,

Which file are you using from strelka2? the germline or the somatic?

The correct approach (or at least what I think it is) is to use snp2seqz for germline and somatic VCFs separately.

And then merge them using seuenza-utils seqz_merge -1 seqz_from_somatic.seqz.gz -2 seqz_from_germline.seqz.gz ….

The primary goal of binning is to reduce the computation requirements by reducing the size of the seqz.
The resulting file from snp2seqz should be already order-of-magnitude smaller then the average seqz from bam/pileup, so you are right (no binning necessary).

Thanks for testing the snp2seqz program, it’s in kind of a beta feature. I’ve used it quite a lot but I haven’t consolidated the documentation and the API yet.

Best

Francesco

> --
> You received this message because you are subscribed to the Google Groups "Sequenza User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sequenza-user-g...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/sequenza-user-group/e9318372-ae29-4bbe-bfd6-c7bb0865c99a%40googlegroups.com.

Sebastian Hollizeck

unread,

Jan 28, 2020, 4:52:53 PM1/28/20

to Sequenza User Group

Hey Francesco,

so far I was only using the somatic snvs vcf from strelka.

So what you are saying is I call the germline variants in both the tumour and the normal samples as well as the somatic variant calling (basically the difference) and then combine the results?

That seems reasonable to me, so i am happy to try that.

It would be amazing to not have to analyse the results from the pileup, as this is such a lengthy process in high depth WGS

Cheers,

Sebastian

> To unsubscribe from this group and stop receiving emails from it, send an email to sequenza-user-group+unsub...@googlegroups.com.

Francesco Favero

unread,

Jan 29, 2020, 7:30:12 AM1/29/20

to Sebastian Hollizeck, Sequenza User Group

Hi again,

however, it seems that the number of variant is a bit too low.

Can you check if the normal/tumor order (command line vs the columns in the VCF) is set properly?

The vcf parsing do a bit of rethinking by looking at the variants genotypes, so it’s not taking directly all the PASS calls

Cheers

Francesco

On 28 Jan 2020, at 08.38, Sebastian Hollizeck <sebastian...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "Sequenza User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to sequenza-user-g...@googlegroups.com.

Sebastian Hollizeck

unread,

Jan 29, 2020, 8:33:11 PM1/29/20

to Sequenza User Group

Hey Francesco,

I tried the approach you said, by calling germline variants for tumor and normal jointly and supplying that to snp2seqz and then also supplying the somatic vcfs and combine the seqz

This will then make a much bigger seqz with about 5Mio lines

when I then want to use the R analysis it fails


library(sequenza)
ext <- sequenza.extract('/dawson_genomics/Projects/CASCADE/CA82/analysis/CA82-4/sequenza/CA82.4.seqz.gz', kmin=10, gamma=80, assembly='hg38')

Collecting GC information ......... done

Processing chr1:
Error in data.frame(base.count = as.integer(n.base.mut), maj.base.freq = as.numeric(max.freqs[,  : 
  arguments imply differing number of rows: 976, 909
Calls: sequenza.extract -> mutation.table -> mut.fractions -> data.frame
Execution halted
In addition: There were 50 or more warnings (use warnings() to see the first 50)

If I then use warnings()

> warnings()
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion
5: In FUN(X[[i]], ...) : NAs introduced by coercion
6: In FUN(X[[i]], ...) : NAs introduced by coercion
7: In FUN(X[[i]], ...) : NAs introduced by coercion
8: In FUN(X[[i]], ...) : NAs introduced by coercion
9: In FUN(X[[i]], ...) : NAs introduced by coercion
...

Is there anything i need to do differently?

Here is an extract of the seqz.gz file

zcat /dawson_genomics/Projects/CASCADE/CA82/analysis/CA82-4/sequenza/CA82.4.seqz.gz | head

chromosome position base.ref depth.normal depth.tumor depth.ratio Af Bf zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand

chr1 10177 A 90 148 1.644 0.574 0.426 het 46 148 AAC . 0

chr1 10178 CCTAA 89 163 1.831 0.748 0.252 het 46 163 CCTAAC . 0

chr1 10231 C 20 33 1.650 0.848 0.152 het 50 33 CA . 0

chr1 10247 TAAACCCTA 89 127 1.427 0.756 0 hom 50 127 TAAACCCTA . 0

chr1 10329 AC 131 223 1.702 0.839 0 hom 60 223 AC . 0

chr1 10333 C 42 70 1.667 0.768 0 hom 60 69 C . 0

chr1 10349 CCCTA 80 127 1.587 0.583 0.417 het 60 127 CCCTAC . 0

chr1 10439 AC 44 67 1.523 0.806 0 hom 50 67 AC . 0

chr1 10446 A 35 61 1.743 0.836 0 hom 50 61 A . 0

Francesco Favero

unread,

Jan 30, 2020, 3:14:49 AM1/30/20

to Sebastian Hollizeck, Sequenza User Group

Hi Sebastian,

Sorry, I haven’t mentioned that you may also need to remove indels/mnv.

Corrently the seqz file supports only single nucleotides.

I need to either add support for indels in the R package or remove the variants with ref/alt other then the 4 single nucleotides.

I think the last approach would be the faster to implement.

A side note; if you want to try a native hg38 support in sequenza, you could try a the “chemins” branch of the R package

https://bitbucket.org/sequenzatools/sequenza/src/chemins/

devtools::install_bitbucket(“sequenzatools/sequenza@chemins”)

|t uses a very simple segmentation approach (dumb) but on my test is not worst then aspfc used from copynumber (which seems unmaintained from many years now)

the command line changes slightly, as there are no more gamma/kmin arguments, but instead there are slide_win and peak_win parameter.

The first (slide_win) doesn’t have much effect on the segmentation (determine the size in data points of a sliding window).

Instead, the second parameter (peak_win) does: smaller values more segments, higher values bigger segments.

I have many changes that I hope to merge soon, which includes the automatic per chromosome selection of the peak_win value, but it’s not impossible to fiddle with it manually :)

Best

Francesco

On 30 Jan 2020, at 02.33, Sebastian Hollizeck <sebastian...@gmail.com> wrote:

Hey Francesco,

I tried the approach you said, by calling germline variants for tumor and normal jointly and supplying that to snp2seqz and then also supplying the somatic vcfs and combine the seqz

This will then make a much bigger seqz with about 5Mio lines

when I then want to use the R analysis it fails

library(sequenza) ext <- sequenza.extract('/dawson_genomics/Projects/CASCADE/CA82/analysis/CA82-4/sequenza/CA82.4.seqz.gz',kmin=10, gamma=80, assembly='hg38')

--
You received this message because you are subscribed to the Google Groups "Sequenza User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sequenza-user-g...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/sequenza-user-group/f7269b0a-37cd-4d9b-90ec-047d8c4d519c%40googlegroups.com.

Sebastian Hollizeck

unread,

Jan 30, 2020, 9:09:47 PM1/30/20

to Sequenza User Group

Hi Franceso,

i figured that out through a bit of debugging and i restricted the input to snps (maybe something that the snp2seqz program could take care of as well?)

I also found the reason of the low variant calls found.

it is because min.fw.freq is set to 0 default but the check only checks if the found frequency is greater not greater equals. snp2seqz, at least for the strelka case does not populate that field though so all of the entries are 0 and fail the filter

From: https://bitbucket.org/sequenzatools/sequenza/src/master/R/mutations.R

if (!is.na(min.fw.freq)) {
     fw.2 <- 1 - min.fw.freq
     fw.2 <- sort(c(fw.2, min.fw.freq))
     fw.filt <- mu.fracts$fw.freq > fw.2[1] &
     mu.fracts$fw.freq < fw.2[2]
     mufreq.filt <- mufreq.filt & type.filt & prop.filt & fw.filt
 } else {
     mufreq.filt <- mufreq.filt & type.filt & prop.filt
 }

if you either set the default to NA or change the test to >= and <= respectively, it works how i would expect it. (the greater and smaller equals is the better change in my opinion.
Then my output is

Collecting GC information ....... done
Processing chr1:
   956 variant calls.
   243 copy-number segments.
   181664 heterozygous positions.
   157571 homozygous positions.

Thank you very much for all your help in this issue 

To unsubscribe from this group and stop receiving emails from it, send an email to sequenza-user-group+unsub...@googlegroups.com.

Reply all

Reply to author

Forward