pi hat values starting at 0.5 for IBD

Deacon Sweeney

unread,

Oct 4, 2016, 1:14:22 PM10/4/16

to plink2-users

I'm having an issue when I run IBD with the --genome flag. My Z0 values are all at 0, and my Pi Hat values all start at 0.5.

Here is some example from my merged .tped file from 8 individuals:

6 chr6:152697706:C:T 0 152697706 0 0 C T 0 0 C T 0 0 C T 0 0 C T

7 chr7:4310365:C:T 0 4310365 0 0 T T 0 0 C T C T C T T T C T

7 chr7:13894276:A:G 0 13894276 A G 0 0 A G G G G G G G A G A G

7 chr7:137029838:T:C 0 137029838 T C T C T C T C 0 0 T C T C T C

8 chr8:28411072:T:C 0 28411072 T C T C T C T C T C T C C C T C

and here is some example from my .genome file:

FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO

1 1 2 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.733333 0.9959 NA

1 1 3 3 UN NA 0.0000 0.9111 0.0889 0.5445 -1 0.784314 0.9990 NA

1 1 1 2 OT 0 0.0000 0.9183 0.0817 0.5409 -1 0.782609 0.9990 NA

1 1 1 3 PO 0.5 0.0000 0.7572 0.2428 0.6214 -1 0.820755 0.9994 NA

There, 1 1 1 2 is the relationship between two parents, and 1 1 1 3 is the relationship between a parent and child.

This analysis uses just 92 SNPs. I have another analysis using 250k that gives similar results.

Thoughts? I'm not even sure what questions to ask because as far as I can tell this all looks right until the 0.0 for Z0 and 0.5 Pi-hats. Then again though this is my first plink project.

Thanks.

Jeff Staples

unread,

Oct 5, 2016, 7:59:41 AM10/5/16

to plink2-users

Under 100 SNPs is far too few for an IBD analysis. Please share the results from the 250 K SNP analysis.

The next issue is likely that you need to provide reference minor allele frequencies with the --read-freq option or merge these samples with a much larger dataset. Right now plink is inferring the population allele frequencies from the samples in your data set. Since your provided dataset appears to be very small and is expect to contain relatives, infer allele frequencies from it will give poor IBD estimates.

Jeff

Deacon Sweeney

unread,

Oct 5, 2016, 1:34:30 PM10/5/16

to plink2-users

Thanks Jeff.

The panel of 92 SNPs was designed specifically for the purpose of individual identification. See here:

http://link.springer.com/article/10.1007/s00439-009-0771-1

To merge my bed files for individual samples I use the following plink command, which includes the --read-freq option:

~/tools/plink-1.9/plink --bfile corrected_ids --read-freq plink.frq --extract pakstis_snps --make-bed --out cleaned

My frequences were generated on the same data set using:

~/tools/plink-1.9/plink --bfile corrected_ids --freq

Is that what you were referring to, or should I be calculating that from an orthogonal data set?

For the 250k SNP analysis, here is the equivalent set of lines from the genome file:

FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO

1 1 2 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.762557 1.0000 NA

1 1 3 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.764591 1.0000 NA

1 1 1 2 OT 0 0.0000 1.0000 0.0000 0.5000 -1 0.733975 1.0000 NA

1 1 1 3 PO 0.5 0.0000 0.9056 0.0944 0.5472 -1 0.833411 1.0000 NA

And here are some arbitrary lines from my 250k tped:

1 chr1:3303446:T:C 0 3303446 T C C C T C C C T C C C T C C C

1 chr1:3306855:T:G 0 3306855 T G G G T G G G T G G G G G G G

1 chr1:3311212:C:G 0 3311212 C G G G C G G G C G G G G G G G

1 chr1:3313282:G:A 0 3313282 G A G A G A G A G A A A G A A A

1 chr1:3315359:C:T 0 3315359 C T C T C T C T C T T T C T T T

Thoughts? I'll build a larger data set and post my results as it completes. Should the larger data be limited to unrelated individuals to generate minor allele frequencies?

Jeff Staples

unread,

Oct 6, 2016, 7:28:04 AM10/6/16

to plink2-users

The 92 SNP panel will work well for checking for 100% identity, but not for accurate genome wide sharing between 2 non-identical samples.

Yes, use an unrelated reference dataset of more that 50 individuals with a similar ancestral background to younsamples. You can either merge the reference with your data set or use the --freq on the reference and read it in the --read-freq for your dataset. Also restrict the analysis to common, high quality variants. I use --maf 0.1 and --geno 0.05.

Deacon Sweeney

unread,

Oct 6, 2016, 3:13:50 PM10/6/16

to plink2-users

That's great advice that I'll follow. I'm not sure it addresses my issue though, it looks like I'm doing something fundamentally wrong.

I've rerun with 87 samples and am seeing the same issue, namely that my Z0 score is 0.0 for all samples, and my pi-hat starts at 0.5.

FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO

1 1 2 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.863452 1.0000 NA

1 1 1 2 OT 0 0.0000 1.0000 0.0000 0.5000 -1 0.855480 1.0000 NA

1 1 6 1 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.856943 1.0000 NA

1 1 4 3 UN NA 0.0000 1.0000 0.0000 0.5000 -1 0.862032 1.0000 NA

1 1 1 3 PO 0.5 0.0000 0.8077 0.1923 0.5961 -1 0.925370 1.0000 NA

So I'm going in with a probability of non-familial relatedness of zero. Will frequencies from a large independent data set will fix this?

I've tried using gvcftools break_blocks to reduce the size of my vcf, and to break vcf blocks into individual positions to see if that will fix the issue:

gzip -dc %s/Variations/%s.genome.vcf.gz | ~/tools/gvcftools-0.16/bin/break_blocks --region-file /volume/Coord_for_GRM.bed --ref /volume/hg19_whole_genome.fa --exclude-off-target --include-variants

It didn't. Then I normalize, left-shift, and re-label with bcftools:

command += " | ~/tools/bcftools/bcftools norm -Ou -m -any"

command += " | ~/tools/bcftools/bcftools norm -Ou -f /volume/hg19_whole_genome.fa"

command += " | ~/tools/bcftools/bcftools annotate -Ob -x ID -I +"

command += '%CHROM:%POS:%REF:%ALT'

and create ped/fam files with vcftools:

command += " | ~/tools/vcftools_0.1.13/bin/vcftools --bcf - --plink --out %s/Variations/%s"%(d, pid)