Minimum Pihat is 0.5 for totally unrelated sample

310 views
Skip to first unread message

bassu

unread,
Sep 26, 2017, 2:54:25 AM9/26/17
to plink2-users
Hi,

The Minimum pihat value I got is 0.5 for totally unrelated 5 WGS samples (variant called using GATK). Please do let me know where did i go wrong. 

I used the following steps to do filter before running in plink(PLINK v1.90p 64-bit )

Step1: Each individual VCF is Filtered for biallelic snp using bcftools
bcftools view -m2 -M2 -v snps -O z -o Sample1_BIA.vcf.gz Sample1.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample2_BIA.vcf.gz Sample2.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample3_BIA.vcf.gz Sample3.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample4_BIA.vcf.gz Sample4.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample5_BIA.vcf.gz Sample5.vcf.gz


Step2:Merged individual VCF to single vcf using bcftools:
bcftools merge -m all --threads 10 -O z -o Merged.vcf.gz *.vcf.gz 

Step3: Filtered the Merged VCF for nonbiallelic SNP 
bcftools view -m2 -M2 -v snps -O z -o Merged_BIA.vcf.gz Merged.vcf.gz

Step4: Only considered those biallelic SNP  which is present in all the samples (no missing data) with good DP and GQ (1,318,224 positions)

Plink commands 

plink2 --vcf Filtered_100.vcf --double-id --recode --out ALL --make-bed --no-pheno --no-sex
plink2
--bfile ALL --genome --out ALL_ibs

Output :

              FID1               IID1               FID2               IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO
  Sample1  Sample1  Sample2  Sample2 UN    NA  0.0000  0.9397  0.0603  0.5302  -1  0.812023  1.0000      NA
  Sample1  Sample1  Sample3  Sample3 UN    NA  0.0000  0.8510  0.1490  0.5745  -1  0.829754  1.0000      NA
  Sample1  Sample1  Sample4  Sample4 UN    NA  0.0000  0.8491  0.1509  0.5754  -1  0.830140  1.0000      NA
  Sample1  Sample1  Sample5  Sample5 UN    NA  0.0000  0.9369  0.0631  0.5315  -1  0.812576  1.0000      NA
  Sample2  Sample2  Sample3  Sample3 UN    NA  0.0000  0.9500  0.0500  0.5250  -1  0.809958  1.0000      NA
  Sample2  Sample2  Sample4  Sample4 UN    NA  0.0000  0.9395  0.0605  0.5302  -1  0.812054  1.0000      NA
  Sample2  Sample2  Sample5  Sample5 UN    NA  0.0000  0.8500  0.1500  0.5750  -1  0.829959  1.0000      NA
  Sample3  Sample3  Sample4  Sample4 UN    NA  0.0000  0.1666  0.8334  0.9167  -1  0.966667  1.0000      NA
  Sample3  Sample3  Sample5  Sample5 UN    NA  0.0000  0.9484  0.0516  0.5258  -1  0.810282  1.0000      NA
  Sample4  Sample4  Sample5  Sample5 UN    NA  0.0000  0.9446  0.0554  0.5277  -1  0.811044  1.0000      NA

I thad tried with maf and max-maf cutoff too but the result was similar. 

Looking forward for a reply.
 

Christopher Chang

unread,
Sep 26, 2017, 12:02:14 PM9/26/17
to plink2-users
How are homozygous-reference calls being handled here?  A common problem with VCF data is that homozygous-reference calls are not explicitly stored.  If that's true here, your SNP set will be highly skewed.  A bad solution is to include "--missing-to-ref" in the bcftools merge step; a better one is to start with gVCF files instead of VCF files and use the additional information to distinguish between homozygous-reference and missing calls.

Also, the PI_HAT estimate requires reasonably accurate population allele frequencies, which you'd need to load with --read-freq.  Right now, it's just estimating allele frequencies from your 5 samples; this gives horrible results.  An alternative method that doesn't require an allele frequency file is plink 2.0's --make-king/--make-king-table.

bassu

unread,
Sep 27, 2017, 7:42:59 AM9/27/17
to plink2-users
Hi ,

Thanks for the reply. 

How are homozygous-reference calls being handled here?  A common problem with VCF data is that homozygous-reference calls are not explicitly stored.  If that's true here, your SNP set will be highly skewed.  A bad solution is to include "--missing-to-ref" in the bcftools merge step; a better one is to start with gVCF files instead of VCF files and use the additional information to distinguish between homozygous-reference and missing calls.
t I am considering only those calls which are present across all samples(no missing data), i.e; I was considering only hetrozygous/homozygous-alt calls. 

I am creating the gvcf files for the samples and will update on the same.


Also, the PI_HAT estimate requires reasonably accurate population allele frequencies, which you'd need to load with --read-freq.  Right now, it's just estimating allele frequencies from your 5 samples; this gives horrible results.  An alternative method that doesn't require an allele frequency file is plink 2.0's --make-king/--make-king-table.

Currently I am using PLINK v1.90p 64-bit  and it seems it doesnot have --make-king/make--kingt-able options. I will install the plink2 and run the same
Reply all
Reply to author
Forward
0 new messages