pi hat values starting at 0.5 for IBD

1,102 views
Skip to first unread message

Deacon Sweeney

unread,
Oct 4, 2016, 1:14:22 PM10/4/16
to plink2-users
I'm having an issue when I run IBD with the --genome flag. My Z0 values are all at 0, and my Pi Hat values all start at 0.5.

Here is some example from my merged .tped file from 8 individuals:


6 chr6:152697706:C:T 0 152697706 0 0 C T 0 0 C T 0 0 C T 0 0 C T

7 chr7:4310365:C:T   0 4310365   0 0 T T 0 0 C T C T C T T T C T

7 chr7:13894276:A:G  0 13894276  A G 0 0 A G G G G G G G A G A G

7 chr7:137029838:T:C 0 137029838 T C T C T C T C 0 0 T C T C T C

8 chr8:28411072:T:C  0 28411072  T C T C T C T C T C T C C C T C


and here is some example from my .genome file:


 FID1 IID1 FID2 IID2 RT    EZ      Z0      Z1      Z2    PI_HAT PHE    DST       PPC     RATIO

   1   1   2    3    UN    NA    0.0000  1.0000  0.0000  0.5000  -1  0.733333  0.9959      NA

   1   1   3    3    UN    NA    0.0000  0.9111  0.0889  0.5445  -1  0.784314  0.9990      NA

   1   1   1    2    OT    0     0.0000  0.9183  0.0817  0.5409  -1  0.782609  0.9990      NA

   1   1   1    3    PO    0.5   0.0000  0.7572  0.2428  0.6214  -1  0.820755  0.9994      NA


There, 1 1 1 2 is the relationship between two parents, and 1 1 1 3 is the relationship between a parent and child.


This analysis uses just 92 SNPs. I have another analysis using 250k that gives similar results.


Thoughts? I'm not even sure what questions to ask because as far as I can tell this all looks right until the 0.0 for Z0 and 0.5 Pi-hats. Then again though this is my first plink project.


Thanks.

Jeff Staples

unread,
Oct 5, 2016, 7:59:41 AM10/5/16
to plink2-users
Under 100 SNPs is far too few for an IBD analysis. Please share the results from the 250 K SNP analysis.

The next issue is likely that you need to provide reference minor allele frequencies with the --read-freq option or merge these samples with a much larger dataset. Right now plink is inferring the population allele frequencies from the samples in your data set. Since your provided dataset appears to be very small and is expect to contain relatives, infer allele frequencies from it will give poor IBD estimates.

Jeff

Deacon Sweeney

unread,
Oct 5, 2016, 1:34:30 PM10/5/16
to plink2-users
Thanks Jeff.

The panel of 92 SNPs was designed specifically for the purpose of individual identification. See here:

http://link.springer.com/article/10.1007/s00439-009-0771-1

To merge my bed files for individual samples I use the following plink command, which includes the --read-freq option:

~/tools/plink-1.9/plink --bfile corrected_ids --read-freq plink.frq --extract pakstis_snps --make-bed --out cleaned


My frequences were generated on the same data set using:


~/tools/plink-1.9/plink --bfile corrected_ids --freq


Is that what you were referring to, or should I be calculating that from an orthogonal data set?



For the 250k SNP analysis, here is the equivalent set of lines from the genome file:


FID1 IID1 FID2 IID2 RT    EZ      Z0      Z1      Z2    PI_HAT  PHE   DST       PPC     RATIO

1    1    2    3    UN    NA    0.0000  1.0000  0.0000  0.5000  -1  0.762557  1.0000      NA

1    1    3    3    UN    NA    0.0000  1.0000  0.0000  0.5000  -1  0.764591  1.0000      NA

1    1    1    2    OT    0     0.0000  1.0000  0.0000  0.5000  -1  0.733975  1.0000      NA

1    1    1    3    PO    0.5   0.0000  0.9056  0.0944  0.5472  -1  0.833411  1.0000      NA


And here are some arbitrary lines from my 250k tped:

1 chr1:3303446:T:C 0 3303446 T C C C T C C C T C C C T C C C

1 chr1:3306855:T:G 0 3306855 T G G G T G G G T G G G G G G G

1 chr1:3311212:C:G 0 3311212 C G G G C G G G C G G G G G G G

1 chr1:3313282:G:A 0 3313282 G A G A G A G A G A A A G A A A

1 chr1:3315359:C:T 0 3315359 C T C T C T C T C T T T C T T T


Thoughts? I'll build a larger data set and post my results as it completes. Should the larger data be limited to unrelated individuals to generate minor allele frequencies?

Jeff Staples

unread,
Oct 6, 2016, 7:28:04 AM10/6/16
to plink2-users
The 92 SNP panel will work well for checking for 100% identity, but not for accurate genome wide sharing between 2 non-identical samples.

Yes, use an unrelated reference dataset of more that 50 individuals with a similar ancestral background to younsamples. You can either merge the reference with your data set or use the --freq on the reference and read it in the --read-freq for your dataset. Also restrict the analysis to common, high quality variants. I use --maf 0.1 and --geno 0.05.

Deacon Sweeney

unread,
Oct 6, 2016, 3:13:50 PM10/6/16
to plink2-users
That's great advice that I'll follow. I'm not sure it addresses my issue though, it looks like I'm doing something fundamentally wrong. 

I've rerun with 87 samples and am seeing the same issue, namely that my Z0 score is 0.0 for all samples, and my pi-hat starts at 0.5. 

 FID1 IID1 FID2 IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO

   1   1    2    3   UN    NA  0.0000  1.0000  0.0000  0.5000  -1  0.863452  1.0000      NA

   1   1    1    2   OT     0  0.0000  1.0000  0.0000  0.5000  -1  0.855480  1.0000      NA

   1   1    6    1   UN    NA  0.0000  1.0000  0.0000  0.5000  -1  0.856943  1.0000      NA

   1   1    4    3   UN    NA  0.0000  1.0000  0.0000  0.5000  -1  0.862032  1.0000      NA

   1   1    1    3   PO   0.5  0.0000  0.8077  0.1923  0.5961  -1  0.925370  1.0000      NA


So I'm going in with a probability of non-familial relatedness of zero. Will frequencies from a large independent data set will fix this?


I've tried using gvcftools break_blocks to reduce the size of my vcf, and to break vcf blocks into individual positions to see if that will fix the issue:

gzip -dc %s/Variations/%s.genome.vcf.gz | ~/tools/gvcftools-0.16/bin/break_blocks --region-file /volume/Coord_for_GRM.bed --ref /volume/hg19_whole_genome.fa --exclude-off-target --include-variants


It didn't. Then I normalize, left-shift, and re-label with bcftools:

    command += " | ~/tools/bcftools/bcftools norm -Ou -m -any"

    command += " | ~/tools/bcftools/bcftools norm -Ou -f /volume/hg19_whole_genome.fa"

    command += " | ~/tools/bcftools/bcftools annotate -Ob -x ID -I +"

    command += '%CHROM:%POS:%REF:%ALT'


and create ped/fam files with vcftools:


    command += " | ~/tools/vcftools_0.1.13/bin/vcftools --bcf - --plink --out %s/Variations/%s"%(d, pid)


Does that look right?

Thanks!

Reply all
Reply to author
Forward
0 new messages