Re: Individuals related to unbelievable number of other individuals in IBD estimation

Chris Chang

unread,

Sep 2, 2015, 12:39:53 AM9/2/15

to Molly Scannell Bryan, plink2-users

The Z0/Z1 estimation procedure is described in the original PLINK paper; an ungated copy is available at e.g. http://research.mssm.edu/statgen/library/purcell-2007-plink.pdf .

Meanwhile, the GCTA genetic relationship matrix (see http://www.ncbi.nlm.nih.gov/pubmed/21167468 ) is the more common choice nowadays for estimating relatedness. You can use "--rel-cutoff 0.0625" to prune the samples until no pair has relationship coefficient >= 0.0625, and --make-grm-gz to dump the relationship matrix in a text format. (It may still be necessary to use a higher cutoff with only 13000 variants, though.)

On Tue, Sep 1, 2015 at 9:55 AM, Molly Scannell Bryan <scan...@gmail.com> wrote:

Hello,
I'm using plink to analyze exome chip data, roughly following the QC that was suggested in this paper http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4441213/. I have removed variants and individuals with more than 5% missing, and removed people whose genotypic sex was not as expected. These participants are not known to be related to one another. I'm now trying to make sure that there isn't cryptic relatedness in the sample, and I am getting results that are difficult for me to interpret. The code I use is:

plink --bfile exome_s6 --maf 0.1 --exclude chr23_26-1.txt --indep-pairwise 50 5 0.2 --out exome_s7_indepautoSNP --noweb
plink --bfile exome_s6 --extract exome_s7_indepautoSNP.prune.in --make-bed --out exome_s8_independentsnps --noweb
plink --bfile exome_s8_independentsnps --freq --out exome_s8_independentsnps_freq --noweb
plink --bfile exome_s8_independentsnps --read-freq exome_s8_independentsnps_freq.frq --genome --min 0.0625 --out exome_s9_ibdestimate --noweb

The output file indicates that out of 4626 participants (~3600 cases and ~1000 controls), all of them have at least one "close" (PI_HAT > 0.0625) relative. Some more investigating shows that this is to some extent driven by a handful of participants who seem to be "related" to almost everyone (I'm attaching the plot of the number of cousins). There are about 160 participants who are calculated to be related to more than 5 other people in the data set.

Can anyone suggest what would give rise to this behavior? I've had a few thoughts, but none of them seemed to change the results:
Ethnicity: These samples are all the same self-described ethnicity. Although I haven't done PCA pruning yet, these samples are all of the same self-described ethnicity, so there shouldn't be many outliers by ancestry.
Calculating IBD based on rare variants: The --maf flag is there to keep rare variants from entering into the calculation. I've tried thresholds of 0.1, 0.05, and 0.01, all with qualitatively similar results
Calculating IBD on not enough variants: Because this is an exome chip, after pruning there are only 13,000 variants that are used in the IBD estimation. I understand this is not ideal, but the plink documentation suggests that 1000 is a bare minimum required (http://pngu.mgh.harvard.edu/~purcell/plink/ibdibs.shtml).
Is there a link to the paper that describes how Z0 and Z1 are estimated?

Has anyone else seen this behavior? I'm having a hard time figuring out why a person would show up as "related" to so many other people?

Thank you in advance.
-Molly

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeff Staples

unread,

Sep 2, 2015, 9:27:01 AM9/2/15

to plink2-users, scan...@gmail.com

I have been estimating IBD estimates on every growing exome datasets (now up to >40K exomes) and I have seen very similar occurrence in the data. I have sorted things out and developed my own custom pipeline to clean both individuals and SNPs from the dataset. Depending on how your dataset was collected, it is not entirely unreasonable to find several individuals with five 3rd degree relatives. If those IBD estimates look clean, and the samples are not flagged/removed for other reasons, I would be included to believe those relationships, especially if they are closer than 3rd degree and the Z2 values is close to 0 (except for full-sibs). You may try using PRIMUS (https://primus.gs.washington.edu/primusweb) to see if you can reconstruct those relationships into pedigrees. I would be very interested to see the Z0 vs Z1 plot as it should look something like the IBD0 vs IBD1 plot (Figure S5) in the PRIMUS paper (http://www.cell.com/ajhg/abstract/S0002-9297(14)00427-3).

In my dataset, I too found several individuals who are 3rd degree or closer relatives to 10s of thousands of others in the dataset, which is not possible. I often find that these samples are problematics and should be removed. One common problem is that they often have high heterozygosity rates compare to the other samples in the dataset. PLINK's --het option will give you the het rates and you can make a histogram of the output in R. Inflated het rates are often due to some form of contamination. Another reason I have found overly related samples is high SNP missingness in those samples. I would recommend removing samples with >10% missing calls (--mind 0.1), but make sure to run this option after you have ran the --geno. Other filters I have found useful are --het 0.1, --geno 0.05, --hwe 0.000001. Many people recommend removing SNPs in LD, but I often don't and I haven't seen any problems yet. I find that with exomes, the more good SNPs you can keep in the dataset, the better of you will be.

I also calculate reference allele frequencies for the population by using an unrelated subset of the data. I have found that the GCTA genetic relationship matrix calculations from PLINK are much less accurate at predicting relationships than my cleaned IBD estimates. I calculate naive IBD estimates, remove individuals with >100 relatives with pi_hat > 0.1875, prune the dataset to an unrelated set using the IBD estimates, then use the unrelated set to obtain reference allele frequencies (using the --freq option in PLINK). I then calculate IBD estimates by reading in the reference allele frequencies with the --read-freq option in PLINK. If the population appears to be admixed, then I also remove ancestry informative markers.

Much of this pipeline has been built out and automated in the public version of PRIMUS (call prePRIMUS or PRIMUS IBD pipeline), but you may find it easier and/or better to do the filtering on your own.

Christopher Chang

unread,

Sep 2, 2015, 2:16:55 PM9/2/15

to plink2-users, scan...@gmail.com

You've found PI_HAT to be better than the GCTA relationship coefficient even when you apply the same filters, use --read-freq, etc. when computing the relationship matrix? If so, I'd like to add a link to your comment in the --rel-cutoff section of the official documentation (and I might extend that command so it can operate on PI_HAT values instead of GRM entries), if you don't mind.

Jeff Staples

unread,

Sep 3, 2015, 9:14:35 AM9/3/15

to plink2-users, scan...@gmail.com

To be fair, I have not invested nearly as much time in optimizing the GCTA genetic relationship matrix as I have the IBD estimates, and as a result I may not have used all the same parameters with GCTA as I did with --genome. I think it will require additional validation before I feel comfortable making that claim across the board. However, I do know that some of colleagues have independently come to the same conclusion I have.

I'll see if I can do a fair apples to apples comparison between the two methods and report back on what I find.

Molly Scannell Bryan

unread,

Sep 22, 2015, 10:33:27 AM9/22/15

to Jeff Staples, plink2-users

This is a delayed thank you all for your thoughts. On your suggestions, I moved my heterozygosity QC to before the relatedness calculation, and that made quite a bit of difference.

Also with your suggestion, I have run the relatedness calculation in both PLINK and GCTA (using the same parameters, with the same QC before it), and ended up with estimates that were fairly different, in both absolute value and ordering. The horizontal red line is where I had decided to make the relationship cutoff for PLINK. The vertical red line is where the GCTA cutoff would have needed to be to exclude the same number of relationships. As you can see, there are quite a few relationships that look close in PLINK but not in GCTA, and vice versa.

I liked the idea of pruning on relationship with GCTA because it automatically removes the least number of people, but given the conversation that has happened here, does anyone have any thoughts on whether the GCTA or the PLINK IBD estimates are more reliable?

gctavsplinkzoom.jpg

Reply all

Reply to author

Forward