--impute-sex/--check-sex

948 views
Skip to first unread message

Matthew Maher

unread,
Jun 18, 2022, 9:04:43 PM6/18/22
to plink2-users
Not sure if I'm misunderstanding something or if this is a bug. 

My apologies that this is so long - it's hard to explain!

Summary Question:  with PLINK 1.9, if I use --make-bed --impute-sex .... with no other options and then immediate apply --check-sex (with the same female-max/male-min parameters) should it be the case that I would get no 'PROBLEM's of the sort of PEDSEX != SNPSEX?   I was assuming so (but maybe I'm wrong), since that would seem to mean PLINK is disagreeing with the imputation that it just made. 

Or reworded:  should the F-stat calculation used by --check-sex/--impute-sex be in any way dependent on the current sex specified in the FAM file?  Again, I think not, but maybe I'm wrong.

Details:  My female-max parameter was the default of 0.2, and upon closer inspection, I could see that I have cases where the --impute-sex step calculated an F statistic of .2005 (indeterminate) but then the immediately following --check-sex on the newly created fileset calculated an F statistic of 0.1986 (female), resulting in a 'PROBLEM' report. 

I don't pretend to understand the math behind the F-statistic calculation, but the documentation makes clear it is based on allelic frequencies which you can supply/freeze with --read-freq.    And sure enough, when I supply the same *.frqx file to the before+after (sex-imputation) filesets, the discrepency goes away.  Okay, so it seems that the frequency data must be different between the before and after (sex-imputation) filesets.  And if I apply --frqx to both, sure enough the results are quite different.  The last few columns reference 'haploid' and 'male X chromosome' so I certainly would not expect those to match, since the SEX in the FAM just got updated.  But I also would think the --impute/check-sex would NOT use those columns.  But even the earlier columns ( C(HOM A1)    C(HET)    C(HOM A2) ) have quite different contents before/after the --impute-sex step, which makes me think those are also dependent on the SEX values in the FAM file.   But then that would create the situation when the act of inferring the SEX could change the answer if you ask again (with --check-sex).  

More details: prior to doing this, I had done --split-x and --indep-pairwise.... to get just a pruned chrX to work with.    After the --impute-sex step, I do see the warning for "het. haploid genotypes present ", which makes sense, but I guess I would expect those to not matter as I would expect --check/impute-sex to make some calculation that attempts to be blind to current SEX status. Or maybe I'm fundamentally misunderstanding things, and I really need to capture that -freqx output from before the imputatation if I want things to match?

Below are logs of PLINK runs interspersed with just the head of the *.frqx  file from before and after doing a --impute-sex operation, since I assume the difference in the *.frqx results underpins my question.  

Thanks for any info and thanks for PLINK(2)!



PLINK v1.90b6.26 64-bit (2 Apr 2022)           www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to SICtestXPruned.log.
Options in effect:
  --bfile SICtestXPruned
  --freqx
  --out SICtestXPruned

515437 MB RAM detected; reserving 257718 MB for main workspace.
30774 variants loaded from .bim file.
5143 people (0 males, 0 females, 5143 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SICtestXPruned.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5143 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998911.
--freqx: Allele frequencies (founders only) written to SICtestXPruned.frqx .

CHR    SNP    A1    A2    C(HOM A1)    C(HET)    C(HOM A2)    C(HAP A1)    C(HAP A2)    C(MISSING)
23    rs2306737    G    A    842    2663    1633    0    0    5
23    exm1625831    A    G    363    629    4151    0    0    0
23    rs5939320    G    A    1061    1193    2887    0    0    2
23    JHU_X.2701184    C    T    323    568    4247    0    0    5
23    rs113564299    CA    C    625    795    3722    0    0    1
23    JHU_X.2701482    G    A    0    8    5135    0    0    0
23    JHU_X.2701698    C    A    4    18    5118    0    0    3
23    rs111595179    C    T    142    2597    2387    0    0    17
23    rs1419931    A    G    908    1122    3112    0    0    1

PLINK v1.90b6.26 64-bit (2 Apr 2022)           www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to SICtestXPrunedSI.log.
Options in effect:
  --bfile SICtestXPruned
  --impute-sex 0.2 0.8
  --keep-allele-order
  --make-bed
  --out SICtestXPrunedSI

515437 MB RAM detected; reserving 257718 MB for main workspace.
30774 variants loaded from .bim file.
5143 people (0 males, 0 females, 5143 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SICtestXPrunedSI.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5143 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998911.
30774 variants and 5143 people pass filters and QC.
Note: No phenotypes present.
--impute-sex: 28606 Xchr and 0 Ychr variant(s) scanned, 5133/5143 sexes
imputed. Report written to SICtestXPrunedSI.sexcheck .
--make-bed to SICtestXPrunedSI.bed + SICtestXPrunedSI.bim +
SICtestXPrunedSI.fam ... done.

PLINK v1.90b6.26 64-bit (2 Apr 2022)           www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to SICtestXPrunedSI.log.
Options in effect:
  --bfile SICtestXPrunedSI
  --freqx
  --out SICtestXPrunedSI

515437 MB RAM detected; reserving 257718 MB for main workspace.
30774 variants loaded from .bim file.
5143 people (2434 males, 2699 females, 10 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SICtestXPrunedSI.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5143 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 313802 het. haploid genotypes present (see SICtestXPrunedSI.hh ); many
commands treat these as missing.
Total genotyping rate is 0.998911.
--freqx: Allele frequencies (founders only) written to SICtestXPrunedSI.frqx .

CHR    SNP    A1    A2    C(HOM A1)    C(HET)    C(HOM A2)    C(HAP A1)    C(HAP A2)    C(MISSING)
23    rs2306737    G    A    840    1309    560    2    1073    1359
23    exm1625831    A    G    52    624    2033    311    2118    5
23    rs5939320    G    A    286    1188    1234    775    1653    7
23    JHU_X.2701184    C    T    38    564    2102    285    2145    9
23    rs113564299    CA    C    162    792    1755    463    1967    4
23    JHU_X.2701482    G    A    0    6    2703    0    2432    2
23    JHU_X.2701698    C    A    0    18    2691    4    2427    3
23    rs111595179    C    T    14    309    2385    128    2    2305
23    rs1419931    A    G    235    1116    1358    673    1754    7

Christopher Chang

unread,
Jun 18, 2022, 9:38:40 PM6/18/22
to plink2-users
As you noticed, filling in missing sex information changes chrX allele frequency estimates... and yes, as you suspected, --check-sex is *not* blind to this.

However, it shouldn't really matter in practice.  There should be an obvious separation between the male and female F-statistic clusters.  Your mistake was sticking to the default 0.2 threshold chosen in ~2007, which we now know is inappropriate for many datasets; this is discussed in the plink 1.9 --check-sex documentation.  (To address this, when --check-sex is implemented in plink 2.0, there will be no default thresholds at all; you will be forced to provide them.)
Reply all
Reply to author
Forward
0 new messages