calculating Relatedness

2,133 views
Skip to first unread message

Sal

unread,
Oct 9, 2014, 2:09:19 PM10/9/14
to plink2...@googlegroups.com
Hi,
I am working on finding which of my samples are related and which are not.

I used the following commands

plink --vcf myvcf.vcf.gz --indep-pairwise 50 5 0.2
plink --vcf myvcf.vcf.gz --extract plink.prune.in --genome

When i check the output, there is no pair below with pi_hat score of 0.2 not even 0.6 . The RT column only reports "UN" despite high pi_hat scores. There are lot of 0.0 which is also suspicious. 
I was wondering if someone can provide me some details where/what i am doing wrong?

thank you in advance.

Christopher Chang

unread,
Oct 9, 2014, 2:25:46 PM10/9/14
to plink2...@googlegroups.com
The RT column is only based on paternal/maternal ID information in the .fam file.  If you don't already have the pedigree, and instead are trying to infer it with the help of --genome, you should ignore RT.

Sal

unread,
Oct 9, 2014, 2:57:18 PM10/9/14
to plink2...@googlegroups.com
Cool, that is good to know.
What about the very high pi_hat scores and exact 0 values? I have either 0 or higher than 0.6. 
We know that not all samples are related. What can i do to check what i am doing wrong?

Jeff Staples

unread,
Oct 10, 2014, 4:05:01 AM10/10/14
to plink2...@googlegroups.com
Sorry for jumping in, but my research is very much in the area of relationship inference and pedigree reconstruction from the inferred relationships.

The 0.0 pi_hat score are expected between unrelated pairs of samples. Pi_hat > 0.6 seems a little inflated, but could still correspond to parent/child relationships, particularly if there is inbreeding in the population, if you have admixed in the samples, or if some of your samples come from a very different genetic background than other samples. 

pi_hat can be useful in some cases, but I have found the IBD0/1/2 (Z0/Z1/Z2 in the .genome output) more informative. There are expected IBD0/1/2 proportions for different types of relationships. For example, Parent/offspring would have about 0 IBD0, 1 IBD1, and 0 IBD2 (0/1/0), unrelated would be 1/0/0, MZ twins 0/0/1 and full-siblings would be 0.25/0.5/.25. Using these expected IBD proportions you can tell which time of relationships you have in your data. However, if your IBD proportions are not close to these expected proportions (0.9/0/0.1), then that is often an indication of some other problems that may be resolved if you provide good reference allele frequencies or remove ancestry informative markers. An inflated IBD2 value is often a good indicator that you may need to do some additional work to improve the IBD estimates.

I have developed a program (PRIMUS) that can help get good IBD estimates and reconstruct those estimates into their pedigree structures. The paper is in press at AJHG and the program should be available for download at primus.gs.washington.edu within a week.

I hope this was somewhat helpful.

All the best,

Jeff Staples

Gad Abraham

unread,
Oct 10, 2014, 6:12:14 AM10/10/14
to Jeff Staples, plink2-users
Jumping in to your jumping in, here's a concrete example that I can't
figure out, from another discussion here. Based on 98K LD-thinned
SNPs, --genome gives:

Z0 Z1 Z2 PI_HAT
0.8792 0.1208 0 0.0604
0.8797 0.1203 0 0.0602
0.8829 0.1171 0 0.0585
0.8837 0.1163 0 0.0581
0.8925 0.1075 0 0.0537

I strongly suspect that these individuals are not related, but pi-hat
seems too high for that. When you mention removing AIMs, do you mean
that allele frequency differences due to ancestry can confound IBD?
> --
> You received this message because you are subscribed to the Google Groups
> "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to plink2-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jeff Staples

unread,
Oct 10, 2014, 1:01:04 PM10/10/14
to plink2...@googlegroups.com, gra...@gmail.com
If you can rule out that possibility of a bottlenecked or inbred backgrounds, then I would venture a guess that it has something to do with allele frequencies.

Yes, allele frequencies difference due to ethnicity background will confound IBD estimates. For example, if your dataset includes African and European samples, then the Africans' IBDs will look more closely related to each other and the Europeans' IBD will look more closely related to each other than their true IBD estimates. In the case of unrelated samples, this confounding can cause them to look related. KING's IBD estimation method model allows for different allele frequencies among the samples, so it is less sensitive to the admixture and different ethnicity backgrounds. However, I have found that using PLINK after cleaning the AIMs and using appropriate reference allele frequencies from an unrelated reference dataset produce better IBD estimates than KING. I did hear that the authors of KING have a big announcement at ASHG this coming week, so maybe they have improved their method.

Are you using the --read-freq option when you get the IBD estimates in PLINK? If so, what dataset did you use to get those frequencies. If not, how many samples are in your dataset? The quality of allele frequencies is very dependent on these two details.

With regards to LD and the --genome option, I haven't seen much bias in IBD estimates when I don't remove LD. I have even obtained decent IBD estimates from exome data without pruning LD. You might try that to see if it has any affect, but to be honest, I haven't yet done thorough testing to compare IBD estimate with and without LD.

Richard Anney

unread,
Oct 21, 2014, 7:14:26 PM10/21/14
to plink2...@googlegroups.com
Hi Sal - I would suggest you have a look at the ids of the pairs which are "related". Poor genotyping can lead to spurious PI_HAT scores; perform a full QC pipeline on your data - re-run the --genome and then look at the number of observations of high PI_HAT per fid iid. In a recent QC of a control dataset, a single individual was "related" to over 1000 others - unlikely in even the friendliest community. Simple step-wise exclusion of the top offenders can clear up the problem.
Ric   

Mike Miller

unread,
Oct 23, 2014, 10:33:27 AM10/23/14
to Jeff Staples, plink2...@googlegroups.com
I would say from my experience that the Z2 tends to get inflated by
ancestral stratification in the data set. Jeff obviously has all good
ideas, but I would add that I think Z2 inflation also happens more if I
haven't done LD pruning. I don't know any papers on this.

So I'll see stuff like all members of some ancestral group are related to
one another with Z1=0 but Z2=0.12.

If you ask me, this...

Z0 Z1 Z2 PI_HAT
0.8792 0.1208 0 0.0604
0.8797 0.1203 0 0.0602
0.8829 0.1171 0 0.0585
0.8837 0.1163 0 0.0581
0.8925 0.1075 0 0.0537

...looks like related pairs with coefficient of relationship of about
.0625. So maybe first-cousins once removed, or similar. In large data
sets I have always found some previously unknown relatives. In the real
world, everyone is related to everyone else in some way.

Mike
>> On 10 October 2014 19:05, Jeff Staples <gra...@gmail.com <javascript:>>
>>> email to plink2-users...@googlegroups.com <javascript:>.
Reply all
Reply to author
Forward
0 new messages