Numbers of CHIPMIX and FREEMIX to contamination

104 views
Skip to first unread message

Jordi Valls

unread,
Sep 4, 2018, 12:20:30 PM9/4/18
to verifyBamID
Hi Hyun,

I have a question about interpretation of values of contamination....

I have the data genotyping and WGS for the same samples. But it seems that my samples are contaminated because I dont get this conditions [CHIPMIX] >> 0.02 and/or [FREEMIX] >> 0.02. This is my output of a sample:

#SEQ_ID RG      CHIP_ID #SNPS   #READS  AVG_DP  FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF   RDPHET  RDPALT
CWGS102 ALL     CWGS102 287554  8038205 27.95   0.49992 3826715.71      3834431.88      NA      NA      0.07153 3705451.26      3707852.91      NA      NA      -nan    -nan    -nan

I dont understand this results, because I looking for the mutations in a bam file using GATK Haplotype caller and I found all the mutations with correct genotype as microarray. but when I run the tool like this:

verifyBamID --vcf CWGS102_.vcf --bam CWGS102.bam --out CWGS102_precise_ok_ --verbose --ignoreRG --best --maxDepth 30 --precise (The coverage of bams are 30, this is the reason why I use the precise option)

I dont run multiple vcf of the same population to analyse the contamination, can be this the problem?? I hope that my samples are not contaminated, because the output of Haplotype Caller and microarray are the same, so I think that my BAM file is well constructed, otherwise I will no get the same results as microarray data. Can you give me some idea about this?? I aligned the reads using the hs37d5 reference genome, can be this the problem? I marked the duplicates and recalibrated base quality as manual indicates...

 Thanks for your help...

Jordi

Hyun Min Kang

unread,
Sep 4, 2018, 12:29:03 PM9/4/18
to verif...@googlegroups.com
Can you run verifyBamID with the Omni VCF file provided at http://csg.sph.umich.edu/kang/verifyBamID/download/ and see whether you get similar FREEMIX results?

Thanks,
Hyun.

--
You received this message because you are subscribed to the Google Groups "verifyBamID" group.
To unsubscribe from this group and stop receiving emails from it, send an email to verifybamid...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jordi Valls

unread,
Sep 5, 2018, 12:54:51 PM9/5/18
to verifyBamID
Hi Hyun,
I tried with VCF that you provide me in the previous comment and my output is the follow:


#SEQ_ID RG      CHIP_ID #SNPS   #READS  AVG_DP  FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF   RDPHET  RDPALT
CWGS102 ALL     NA      1781659 50082987        28.11   0.00000 9629638.14      9629638.14      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA
it means that there's no contamintation!

My next step was use the vcf with all samples from the same population, as you can see here the header of vcf, where it contains the sample  of I want to analyse the contamination. This sample is CWGS102
##fileformat=VCFv4.3
##fileDate=20180904
##source=PLINKv2.00
##contig=<ID=1,length=249210708>
##contig=<ID=2,length=243041384>
##contig=<ID=3,length=197872161>
##contig=<ID=4,length=190915651>
##contig=<ID=5,length=180698154>
##contig=<ID=6,length=170890385>
##contig=<ID=7,length=159122660>
##contig=<ID=8,length=146292682>
##contig=<ID=9,length=141090314>
##contig=<ID=10,length=135434304>
##contig=<ID=11,length=134945710>
##contig=<ID=12,length=133838354>
##contig=<ID=13,length=115106997>
##contig=<ID=14,length=107287664>
##contig=<ID=15,length=102398753>
##contig=<ID=16,length=90153347>
##contig=<ID=17,length=81151540>
##contig=<ID=18,length=77972067>
##contig=<ID=19,length=59094137>
##contig=<ID=20,length=62915232>
##contig=<ID=21,length=48081491>
##contig=<ID=22,length=51174332>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  CWGS141 CWGS236 CWGS109 CWGS225 CWGS222 CWGS136 CWGS108 CWGS111 CWGS139 CWGS106 CWGS122 CWGS128 CWGS103 CWGS239 CWGS230 CWGS144 CWGS125 CWGS117 CWGS211 CWGS228 CWGS105 CWGS233 CWGS114 CWGS130 CWGS102 CWGS133

The result is this (best option):

#SEQ_ID RG      CHIP_ID #SNPS   #READS  AVG_DP  FREEMIX FREELK1 FREELK0 FREE_RH FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF   RDPHET  RDPALT
CWGS102 ALL     CWGS102 682416  18974304        27.80   0.00090 4376731.90      4377641.31      NA      NA      0.00111 3957000.50      3958168.31      NA      NA      27.836  0.9973  0.9943

The FREEMIX is 0.00090 and CHIPMIX 0.00111 it means that there's no contamination!
I have a question about IBD...

Comparing with individual CWGS102.. Optimal fIBD = 0.998891, LLK0 = 3958168.310553, LLK1 = 3957000.499242 for readgroup -1
Comparing with individual CWGS133.. Optimal fIBD = 0.000393, LLK0 = 6043278.471088, LLK1 = 4377244.245771 for readgroup -1
....

Why the fIBD is the biggest number for the CWGS102 (which is the sample that I analyse)?? what does it mean?? And why I need all sample population of genotypeing vcf to analyse the contamination?

Thanks a lot for your help! If something is wrong please tell me!

Jordi

Hyun Min Kang

unread,
Sep 5, 2018, 3:02:26 PM9/5/18
to verif...@googlegroups.com
I think the reason might be you used VCF with single sample, where allele frequency estimation is very poor. If you annotate your VCF with AF INFO field, you may get a reliable estimate. If you use --self option, it will still use the whole VCF, but will compare with only the individual with matching ID. 

fIBD is 1-CHIPMIX, so higher the better. 

Hyun.

--
Reply all
Reply to author
Forward
0 new messages