Minimum Pihat is 0.5 for totally unrelated sample

bassu

unread,

Sep 26, 2017, 2:54:25 AM9/26/17

to plink2-users

Hi,

The Minimum pihat value I got is 0.5 for totally unrelated 5 WGS samples (variant called using GATK). Please do let me know where did i go wrong.

I used the following steps to do filter before running in plink(PLINK v1.90p 64-bit )

Step1: Each individual VCF is Filtered for biallelic snp using bcftools

bcftools view -m2 -M2 -v snps -O z -o Sample1_BIA.vcf.gz Sample1.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample2_BIA.vcf.gz Sample2.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample3_BIA.vcf.gz Sample3.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample4_BIA.vcf.gz Sample4.vcf.gz
bcftools view -m2 -M2 -v snps -O z -o Sample5_BIA.vcf.gz Sample5.vcf.gz

Step2:Merged individual VCF to single vcf using bcftools:

bcftools merge -m all --threads 10 -O z -o Merged.vcf.gz *.vcf.gz

Step3: Filtered the Merged VCF for nonbiallelic SNP

bcftools view -m2 -M2 -v snps -O z -o Merged_BIA.vcf.gz Merged.vcf.gz

Step4: Only considered those biallelic SNP which is present in all the samples (no missing data) with good DP and GQ (1,318,224 positions)

Plink commands

plink2 --vcf Filtered_100.vcf --double-id --recode --out ALL --make-bed --no-pheno --no-sex
plink2 --bfile ALL --genome --out ALL_ibs

Output :

FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO

Sample1 Sample1 Sample2 Sample2 UN NA 0.0000 0.9397 0.0603 0.5302 -1 0.812023 1.0000 NA

Sample1 Sample1 Sample3 Sample3 UN NA 0.0000 0.8510 0.1490 0.5745 -1 0.829754 1.0000 NA

Sample1 Sample1 Sample4 Sample4 UN NA 0.0000 0.8491 0.1509 0.5754 -1 0.830140 1.0000 NA

Sample1 Sample1 Sample5 Sample5 UN NA 0.0000 0.9369 0.0631 0.5315 -1 0.812576 1.0000 NA

Sample2 Sample2 Sample3 Sample3 UN NA 0.0000 0.9500 0.0500 0.5250 -1 0.809958 1.0000 NA

Sample2 Sample2 Sample4 Sample4 UN NA 0.0000 0.9395 0.0605 0.5302 -1 0.812054 1.0000 NA

Sample2 Sample2 Sample5 Sample5 UN NA 0.0000 0.8500 0.1500 0.5750 -1 0.829959 1.0000 NA

Sample3 Sample3 Sample4 Sample4 UN NA 0.0000 0.1666 0.8334 0.9167 -1 0.966667 1.0000 NA

Sample3 Sample3 Sample5 Sample5 UN NA 0.0000 0.9484 0.0516 0.5258 -1 0.810282 1.0000 NA

Sample4 Sample4 Sample5 Sample5 UN NA 0.0000 0.9446 0.0554 0.5277 -1 0.811044 1.0000 NA

I thad tried with maf and max-maf cutoff too but the result was similar.

Looking forward for a reply.

Christopher Chang

unread,

Sep 26, 2017, 12:02:14 PM9/26/17

to plink2-users

How are homozygous-reference calls being handled here? A common problem with VCF data is that homozygous-reference calls are not explicitly stored. If that's true here, your SNP set will be highly skewed. A bad solution is to include "--missing-to-ref" in the bcftools merge step; a better one is to start with gVCF files instead of VCF files and use the additional information to distinguish between homozygous-reference and missing calls.

Also, the PI_HAT estimate requires reasonably accurate population allele frequencies, which you'd need to load with --read-freq. Right now, it's just estimating allele frequencies from your 5 samples; this gives horrible results. An alternative method that doesn't require an allele frequency file is plink 2.0's --make-king/--make-king-table.

bassu

unread,

Sep 27, 2017, 7:42:59 AM9/27/17

to plink2-users

Hi ,

Thanks for the reply.

How are homozygous-reference calls being handled here? A common problem with VCF data is that homozygous-reference calls are not explicitly stored. If that's true here, your SNP set will be highly skewed. A bad solution is to include "--missing-to-ref" in the bcftools merge step; a better one is to start with gVCF files instead of VCF files and use the additional information to distinguish between homozygous-reference and missing calls.

t I am considering only those calls which are present across all samples(no missing data), i.e; I was considering only hetrozygous/homozygous-alt calls.

I am creating the gvcf files for the samples and will update on the same.

Also, the PI_HAT estimate requires reasonably accurate population allele frequencies, which you'd need to load with --read-freq. Right now, it's just estimating allele frequencies from your 5 samples; this gives horrible results. An alternative method that doesn't require an allele frequency file is plink 2.0's --make-king/--make-king-table.

Currently I am using PLINK v1.90p 64-bit and it seems it doesnot have --make-king/make--kingt-able options. I will install the plink2 and run the same

Reply all

Reply to author

Forward