Estimating Contamination - Targeted Panel

205 views
Skip to first unread message

Robert Sicko

unread,
Jan 30, 2018, 4:03:15 PM1/30/18
to verifyBamID
Hi,

We are trying to detect sample contamination in batches of targeted sequencing samples (15 per run). No comparison genotype data available, but we want to detect if any samples contain a mixture of more than one sample. We created a mock contamination sample by spiking DNA from sample2 into sample1. 

I ran verifyBamID (1.1.3) on each sample as follows:

./verifyBamID --vcf 180124-1000G_phase1.snps.high_confidence.hg19.intersected_w_scid.vcf --bam $sample_bam --out ${sample_bam}_verifybam --maxDepth 1000 --precise --ignoreRG
 
Results are shown below. I notice FREEMIX is >0.03 for all samples but as stated in the docs the contam sample does have a large FREELK1-FREELK0. I just don't know why FREEMIX is so large.

Thanks,
Bob

#SEQ_ID RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT LK1-LK0
NA12878 ALL NA 190 105684 556.23 0.41738 12869.78 13016.96 NA NA NA NA NA NA NA NA NA NA -147.18
contam ALL NA 190 99082 521.48 0.11943 13757.01 16077.28 NA NA NA NA NA NA NA NA NA NA -2320.27
101 ALL NA 190 122038 642.31 0.1473 18395.67 18793.14 NA NA NA NA NA NA NA NA NA NA -397.47
102 ALL NA 190 128412 675.85 0.1189 14683.43 14782.45 NA NA NA NA NA NA NA NA NA NA -99.02
103 ALL NA 190 67040 352.84 0.20848 6258.31 7008.78 NA NA NA NA NA NA NA NA NA NA -750.47
104 ALL NA 190 125581 660.95 0.22077 14120.24 14687 NA NA NA NA NA NA NA NA NA NA -566.76
105 ALL NA 190 48454 255.02 0.2325 4889.09 5294.43 NA NA NA NA NA NA NA NA NA NA -405.34
106 ALL NA 190 105552 555.54 0.03397 12108.93 12490.17 NA NA NA NA NA NA NA NA NA NA -381.24
107 ALL NA 190 95481 502.53 0.19454 9507.44 10291.24 NA NA NA NA NA NA NA NA NA NA -783.8
107-2 ALL NA 190 100201 527.37 0.11967 10317.52 10649.87 NA NA NA NA NA NA NA NA NA NA -332.35
108 ALL NA 190 105646 556.03 0.40653 11863.23 11966.78 NA NA NA NA NA NA NA NA NA NA -103.55
109 ALL NA 190 107761 567.16 0.21798 12910.33 13361.75 NA NA NA NA NA NA NA NA NA NA -451.42
110 ALL NA 190 2727 14.35 0.24626 436.96 480.72 NA NA NA NA NA NA NA NA NA NA -43.76
111 ALL NA 190 118092 621.54 0.40362 11095.55 11545.68 NA NA NA NA NA NA NA NA NA NA -450.13
112 ALL NA 190 59090 311 0.13762 8884.5 8924.07 NA NA NA NA NA NA NA NA NA NA -39.57

Hyun Min Kang

unread,
Feb 4, 2018, 11:55:34 PM2/4/18
to verif...@googlegroups.com
Bob, using only 190 SNPs may be too small to reliably estimate the contamination. Any chance that you could include more variants?

--
You received this message because you are subscribed to the Google Groups "verifyBamID" group.
To unsubscribe from this group and stop receiving emails from it, send an email to verifybamid...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Sicko

unread,
Feb 9, 2018, 1:19:28 PM2/9/18
to verifyBamID
Our panel only targets 39 genes and each sample only contains ~300 variants after filtering. I could other reference VCFs if you think getting closer to 300 will make a difference (for the above example I interestected 1000G_phase1.snps.high_confidence with out ROI).

Thanks,
Bob

Hyun Min Kang

unread,
Feb 9, 2018, 6:16:40 PM2/9/18
to verif...@googlegroups.com
When estimating FREEMIX, you can still use all 1000 genomes variants within the target region, to increase the marker density, perhaps?

Hyun. 
Reply all
Reply to author
Forward
0 new messages