30 samples with KING kinship estimate equal to 0.5

35 views
Skip to first unread message

Diego Salazar Tortosa

unread,
Jun 28, 2024, 12:56:04 PMJun 28
to plink2-users

Dear Christopher,

First of all, thank you for developing (and continuing to develop) these amazing tools.

I am performing QC on a dataset with approximately 1400 samples and around 600K SNPs (hg38). I am using the KING-robust method implemented in PLINK 2 to detect and filter out second-degree relatives and above. As you explained in the documentation, I am using the geometric mean of the kinship coefficient of second and third-degree relatives (i.e., 0.088). The resulting KING table is attached.

As you can see, I have 15 pairs with a coefficient of 0.5, suggesting these are duplicates. Is this number common? or my KING results are strange? It seems a bit high to me, as it implies a significant number of errors during processing and/or genotyping the samples…

Please note that before this step, I removed duplicated SNPs and filtered SNPs by MAF (0.05), missingness (0.01), HWE (1e-25), and sample missingness (0.01). Around 270K SNPs passed all the filters. The KING table remains mostly the same even if I perform the analysis without these previous filters. Also, note that filtering by the usual pi_hat threshold for second-degree relatives (0.2) yields the exact same results. For pi_hat I used LD-prunned data (95K autosomal SNPs), but not for KING.

In case I could have an extremely abnormal proportion of twins in my data, I have checked that each pair of samples have different phenotypes values and, specifically, different ages, which is the case. Therefore, these are not twins. 

Finally, I have found that 4 of these samples have a mismatch between reported-biological sex and sex based on genetic data. In these cases, it is very clear that the genetic data of one sample has been duplicated and matched with another sample having a different self-reported sex.

All this leads me to think that I could actually have a high number of duplicated samples. If you think this is the case, what would be your recommendation for dealing with them?

The removal of samples with --king-cutoff essentially removes one sample in each pair, but I am unsure whether I should remove all these samples with a coefficient of 0.5. If two samples share two alleles in almost every loci, but have different phenotypes, which phenotype does this genotype belong to?

My point is that, in each pair, we have two different phenotypes and 1 genome, so we cannot be sure which is the correct phenotype that matches the genotype. I guess all the 30 samples should be removed. Does this make sense?

Thank you in advance!

Best regards,

Diego

king_table.kin0

Chris Chang

unread,
Jun 28, 2024, 1:00:42 PMJun 28
to Diego Salazar Tortosa, plink2-users
Yes, I would remove both copies of any duplicate where you’re unsure about the phenotype.

A ~1% rate of sample-handling errors is pretty typical.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/96724d63-917a-423b-941b-16d07e674461n%40googlegroups.com.

Diego Salazar Tortosa

unread,
Jul 3, 2024, 10:50:57 AM (13 days ago) Jul 3
to plink2-users
Ok, that's great. Thank you so much for the clarification!

Best,
Diego
Reply all
Reply to author
Forward
0 new messages