Hi Hyun,
I tried with VCF that you provide me in the previous comment and my output is the follow:
#SEQ_ID
RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH
FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT
CWGS102
ALL NA 1781659 50082987 28.11 0.00000 9629638.14
9629638.14 NA NA NA NA NA NA NA
NA NA NA
it means that there's no contamintation!
My
next step was use the vcf with all samples from the same population, as
you can see here the header of vcf, where it contains the sample of I
want to analyse the contamination. This sample is CWGS102
##fileformat=VCFv4.3
##fileDate=20180904
##source=PLINKv2.00
##contig=<ID=1,length=249210708>
##contig=<ID=2,length=243041384>
##contig=<ID=3,length=197872161>
##contig=<ID=4,length=190915651>
##contig=<ID=5,length=180698154>
##contig=<ID=6,length=170890385>
##contig=<ID=7,length=159122660>
##contig=<ID=8,length=146292682>
##contig=<ID=9,length=141090314>
##contig=<ID=10,length=135434304>
##contig=<ID=11,length=134945710>
##contig=<ID=12,length=133838354>
##contig=<ID=13,length=115106997>
##contig=<ID=14,length=107287664>
##contig=<ID=15,length=102398753>
##contig=<ID=16,length=90153347>
##contig=<ID=17,length=81151540>
##contig=<ID=18,length=77972067>
##contig=<ID=19,length=59094137>
##contig=<ID=20,length=62915232>
##contig=<ID=21,length=48081491>
##contig=<ID=22,length=51174332>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM
POS ID REF ALT QUAL FILTER INFO FORMAT CWGS141
CWGS236 CWGS109 CWGS225 CWGS222 CWGS136 CWGS108 CWGS111 CWGS139 CWGS106
CWGS122 CWGS128 CWGS103 CWGS239 CWGS230 CWGS144 CWGS125 CWGS117 CWGS211
CWGS228 CWGS105 CWGS233 CWGS114 CWGS130 CWGS102 CWGS133
The result is this (best option):
#SEQ_ID
RG CHIP_ID #SNPS #READS AVG_DP FREEMIX FREELK1 FREELK0 FREE_RH
FREE_RA CHIPMIX CHIPLK1 CHIPLK0 CHIP_RH CHIP_RA DPREF RDPHET RDPALT
CWGS102
ALL CWGS102 682416 18974304 27.80 0.00090 4376731.90
4377641.31 NA NA 0.00111 3957000.50 3958168.31
NA NA 27.836 0.9973 0.9943
The FREEMIX is 0.00090 and CHIPMIX 0.00111 it means that there's no contamination!
I have a question about IBD...
Comparing with individual CWGS102.. Optimal fIBD = 0.998891, LLK0 = 3958168.310553, LLK1 = 3957000.499242 for readgroup -1
Comparing with individual CWGS133.. Optimal fIBD = 0.000393, LLK0 = 6043278.471088, LLK1 = 4377244.245771 for readgroup -1
....
Why
the fIBD is the biggest number for the CWGS102 (which is the sample
that I analyse)?? what does it mean?? And why I need all sample
population of genotypeing vcf to analyse the contamination?
Thanks a lot for your help! If something is wrong please tell me!
Jordi