On gender calling

1,017 views
Skip to first unread message

Fengyuan Hu

unread,
Sep 26, 2014, 7:30:12 AM9/26/14
to plink2...@googlegroups.com
Dear Chris,

Thanks for improving the performance of this great toolset, I'm recommended to use plink to guess sample gender. My only input is a vcf file, could you advise the best practice to call gender please?

Best wishes,
Fengyuan

Christopher Chang

unread,
Sep 26, 2014, 10:51:18 AM9/26/14
to plink2...@googlegroups.com
This depends on what kind of Y chromosome calls you have, if any.


Without Ychr calls, I recommend the following steps:

1. Convert the VCF to plink format:

plink --vcf [VCF filename] --out vcf_data

(You can add "--chr 23-24" if the *only* plink operation you want to perform is gender imputation.)

2. Prune your SNP set to reduce LD.  The best parameters depend on SNP density, but the following seems to work well with 1000 Genomes-like data:

plink --bfile vcf_data --indep-pairphase 20000 2000 0.5 --chr 23-24
plink --bfile vcf_data --extract plink.prune.in --make-bed --out ld_pruned_xy

3. Run --check-sex just to see the distribution of F coefficients.

plink --bfile ld_pruned_xy --check-sex

4. You should get a clump of F coefficients near 1, and a bunch of dispersed values.  (You can load the .sexcheck file into R and make a plot.)  There should be at least a small gap between the two distributions, usually close to 0.8.

4a. If you're unsure, you can retry --check-sex with the non-LD-pruned dataset.  In this case, the gap tends to be near 0.9 in my experience.

5. Use --impute-sex with the gap position.  I.e. either

plink --bfile ld_pruned_xy --impute-sex [~0.8] [~0.8] --make-bed --out imputed_sex

or

plink --bfile vcf_data --impute-sex [~0.9] [~0.9] --make-bed --out imputed_sex

6. If necessary, import the imputed sex values into the main dataset.  E.g. if you used --impute-sex on the ld_pruned_xy dataset,

plink --bfile vcf_data --update-sex ld_pruned_xy.fam 3 --make-bed --out vcf_data_with_sexes


With Ychr data, you'll probably want to use the --check-sex 'ycount' or 'y-only' modifier; see https://www.cog-genomics.org/plink2/basic_stats#check_sex for some discussion.

Christopher Chang

unread,
Sep 26, 2014, 10:54:57 AM9/26/14
to plink2...@googlegroups.com
Correction: step 6 should have imputed_sex.fam in place of ld_pruned_xy.fam .

Christopher Chang

unread,
Sep 26, 2014, 11:48:04 AM9/26/14
to plink2...@googlegroups.com
Oh, forgot one other thing: you'll want to use --split-x between steps 1 and 2, unless you don't have any pseudoautosomal region variant calls.  E.g.

plink --bfile vcf_data_unsplit --split-x b37 --make-bed --out vcf_data


On Friday, September 26, 2014 7:51:18 AM UTC-7, Christopher Chang wrote:

Fengyuan Hu

unread,
Sep 26, 2014, 11:57:00 AM9/26/14
to Christopher Chang, plink2...@googlegroups.com
Dear Chris,

This is really informative! 

I can put a quick check on the vcf whether Y calls exist. One thing confuses me is, if I have Y calls in the data, at which step I should add  'ycount' or 'y-only' to the commands? Could you elaborate?

Thanks a lot.

Fengyuan

--
You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/28LESfNj64A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Fengyuan Hu
Bioinformatician
Department of Haematology
University of Cambridge
NHS Blood & Transplant Building
Long Road
Cambridge CB2 0PT
UK

Christopher Chang

unread,
Sep 26, 2014, 12:21:44 PM9/26/14
to plink2...@googlegroups.com
Replace --check-sex with "--check-sex ycount", and then see if there's an obvious gap in number of Y calls (which should line up with the F statistic clumps).  If there is, you may want to run --impute-sex with 'ycount'/'y-only', but it's not actually necessary unless it looks like F-statistic-based imputation would get a few samples wrong.


On Friday, September 26, 2014 8:57:00 AM UTC-7, Fengyuan Hu wrote:
Dear Chris,

This is really informative! 

I can put a quick check on the vcf whether Y calls exist. One thing confuses me is, if I have Y calls in the data, at which step I should add  'ycount' or 'y-only' to the commands? Could you elaborate?

Thanks a lot.

Fengyuan
To unsubscribe from this group and all its topics, send an email to plink2-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Fengyuan Hu

unread,
Sep 26, 2014, 12:27:06 PM9/26/14
to Christopher Chang, plink2...@googlegroups.com

Thank you! I'll give it a go.

To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.

freeseek

unread,
Jan 5, 2015, 12:07:49 PM1/5/15
to plink2...@googlegroups.com
I have grown a little wary of using the Y-chromosome for sex imputation. As it has been shown that the Y chromosome is missing in the blood of a significant fraction of old males (see http://dx.doi.org/10.1038/ng.2966 and http://dx.doi.org/10.1126/science.1262092), missingness on chromosome Y in old males can be due to somatic loss. If you have a large dataset with 1000s of non-young individuals, it is guaranteed you will run into this issue to some extent. This issue is much more common than the issue of XXY males, estimated to exist "in 1:500 to 1:1000 male live births" (http://en.wikipedia.org/wiki/Klinefelter_syndrome) for which "--impute-sex {female max F} {male min F}" will fail to identify these individuals as males while "--impute-sex y-only {female max Y obs} {male min Y obs}" would likely classify them as males. I believe females can also lose a copy of chromosome X in the blood (see http://dx.doi.org/10.1056/NEJMoa1409405) but this is much less common than losing chromosome Y.

Christopher Chang

unread,
Jan 5, 2015, 6:31:24 PM1/5/15
to plink2...@googlegroups.com
Yes, it makes more sense to use "ycount" mode, which treats X chromosome heterozygosity as the primary signal but requires a confirmatory Y chromosome nonmissing call count; this is likely to flag both the old males you mention and Klinefelter cases as unknown, while imputing most other sexes correctly.  "y-only" is mostly intended as a shortcut for setting genders when they were previously already called, but are missing from the immediate .fam file for some reason (e.g. conversion from VCF).

Brian Haas

unread,
May 7, 2016, 10:31:42 PM5/7/16
to plink2-users
Thanks for the helpful posts here!

I ran through the protocol described here like so:

plink --vcf gatk-HC-VQSR-annotated.final.vcf --out gatk-HC-VQSR-annotated.final.vcf.plink.data                                                                                     
plink --bfile gatk-HC-VQSR-annotated.final.vcf.plink.data --indep-pairphase 20000 2000 0.5 --chr 23-24                                                                             
plink --bfile gatk-HC-VQSR-annotated.final.vcf.plink.data --extract plink.prune.in --make-bed --out ld_pruned_xy                                                                   
plink --bfile ld_pruned_xy --check-sex ycount                                                                                                                                      
                                                                                                                                                                                   
# manual step, examine file 'imputed_sex.sexcheck' using R,  determine gap param   
# see attached histogram of F values                                                               
                                                                                                                                                                                   
plink --bfile ld_pruned_xy --impute-sex 0.6 0.6 --make-bed --out imputed_sex                                                                                                       
plink --bfile gatk-HC-VQSR-annotated.final.vcf.plink.data --update-sex ld_pruned_xy.fam 3 --make-bed --out vcf_data_with_sexes                                                     

  


but then when I examined the predictions in the 'imputed_sex.sexcheck' file, all the predictions were exactly reversed from what I expected (all males were inferred as females and vice-versa).  


Is the coding of male (1) and female (2) reversed in this output? (bug?)   or could it be that I did something wrong here?   


thanks!

F_hist.png

Brian Haas

unread,
May 7, 2016, 10:41:09 PM5/7/16
to plink2-users
Actually, plink is making the correct gender calls.  Another tool I'm using must have the assignments reversed...

apologies for the false alarm!

Ahmed Aljumaili

unread,
Oct 11, 2017, 10:28:58 AM10/11/17
to plink2-users
Dear Chris,

could I apply the same procedure to a non-human species with different sex chromosomes? if yes how?


any help will be appreciated, Thanks


Ahmed

Christopher Chang

unread,
Oct 11, 2017, 6:01:27 PM10/11/17
to plink2-users
If it's a ZW instead of an XY species, the hacky way to do this is
(i) flip the female and male gender codes, and accept that all of plink's output will have the sexes reversed.
(i) encode Z -> X, and W -> Y.
Reply all
Reply to author
Forward
0 new messages