(with the same female-max/male-min parameters) should it be the case that I would get
'PROBLEM's of the sort of PEDSEX != SNPSEX? I was assuming so (but maybe I'm wrong), since that would seem to mean PLINK is disagreeing with the imputation that it just made.
Or reworded: should the F-stat calculation used by --check-sex/--impute-sex be in any way dependent on the current sex specified in the FAM file? Again, I think not, but maybe I'm wrong.
Details: My female-max parameter was the default of 0.2, and upon closer inspection, I could see that I have cases where the --impute-sex step calculated an F statistic of .2005 (indeterminate) but then the immediately following --check-sex on the newly created fileset calculated an F statistic of 0.1986 (female), resulting in a 'PROBLEM' report.
I don't pretend to understand the math behind the F-statistic calculation, but the documentation makes clear it is based on allelic frequencies which you can supply/freeze with --read-freq. And sure enough, when I supply the same *.frqx file to the before+after (sex-imputation) filesets, the discrepency goes away. Okay, so it seems that the frequency data must be different between the before and after (sex-imputation) filesets. And if I apply --frqx to both, sure enough the results are quite different. The last few columns reference 'haploid' and 'male X chromosome' so I certainly would not expect those to match, since the SEX in the FAM just got updated. But I also would think the --impute/check-sex would NOT use those columns. But even the earlier columns ( C(HOM A1) C(HET) C(HOM A2) ) have quite different contents before/after the --impute-sex step, which makes me think those are also dependent on the SEX values in the FAM file. But then that would create the situation when the act of inferring the SEX could change the answer if you ask again (with --check-sex).
More details: prior to doing this, I had done --split-x and --indep-pairwise.... to get just a pruned chrX to work with. After the --impute-sex step, I do see the warning for "het. haploid genotypes present ", which makes sense, but I guess I would expect those to not matter as I would expect --check/impute-sex to make some calculation that attempts to be blind to current SEX status. Or maybe I'm fundamentally misunderstanding things, and I really need to capture that -freqx output from before the imputatation if I want things to match?
Below are logs of PLINK runs interspersed with just the head of the *.frqx file from before and after doing a --impute-sex operation, since I assume the difference in the *.frqx results underpins my question.
Thanks for any info and thanks for PLINK(2)!
PLINK v1.90b6.26 64-bit (2 Apr 2022) www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to SICtestXPruned.log.
Options in effect:
--bfile SICtestXPruned
--freqx
--out SICtestXPruned
515437 MB RAM detected; reserving 257718 MB for main workspace.
30774 variants loaded from .bim file.
5143 people (0 males, 0 females, 5143 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SICtestXPruned.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5143 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998911.
--freqx: Allele frequencies (founders only) written to SICtestXPruned.frqx .
CHR SNP A1 A2 C(HOM A1) C(HET) C(HOM A2) C(HAP A1) C(HAP A2) C(MISSING)
23 rs2306737 G A 842 2663 1633 0 0 5
23 exm1625831 A G 363 629 4151 0 0 0
23 rs5939320 G A 1061 1193 2887 0 0 2
23 JHU_X.2701184 C T 323 568 4247 0 0 5
23 rs113564299 CA C 625 795 3722 0 0 1
23 JHU_X.2701482 G A 0 8 5135 0 0 0
23 JHU_X.2701698 C A 4 18 5118 0 0 3
23 rs111595179 C T 142 2597 2387 0 0 17
23 rs1419931 A G 908 1122 3112 0 0 1
PLINK v1.90b6.26 64-bit (2 Apr 2022) www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to SICtestXPrunedSI.log.
Options in effect:
--bfile SICtestXPruned
--impute-sex 0.2 0.8
--keep-allele-order
--make-bed
--out SICtestXPrunedSI
515437 MB RAM detected; reserving 257718 MB for main workspace.
30774 variants loaded from .bim file.
5143 people (0 males, 0 females, 5143 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SICtestXPrunedSI.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5143 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998911.
30774 variants and 5143 people pass filters and QC.
Note: No phenotypes present.
--impute-sex: 28606 Xchr and 0 Ychr variant(s) scanned, 5133/5143 sexes
imputed. Report written to SICtestXPrunedSI.sexcheck .
--make-bed to SICtestXPrunedSI.bed + SICtestXPrunedSI.bim +
SICtestXPrunedSI.fam ... done.
PLINK v1.90b6.26 64-bit (2 Apr 2022) www.cog-genomics.org/plink/1.9/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to SICtestXPrunedSI.log.
Options in effect:
--bfile SICtestXPrunedSI
--freqx
--out SICtestXPrunedSI
515437 MB RAM detected; reserving 257718 MB for main workspace.
30774 variants loaded from .bim file.
5143 people (2434 males, 2699 females, 10 ambiguous) loaded from .fam.
Ambiguous sex IDs written to SICtestXPrunedSI.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5143 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 313802 het. haploid genotypes present (see SICtestXPrunedSI.hh ); many
commands treat these as missing.
Total genotyping rate is 0.998911.
--freqx: Allele frequencies (founders only) written to SICtestXPrunedSI.frqx .
CHR SNP A1 A2 C(HOM A1) C(HET) C(HOM A2) C(HAP A1) C(HAP A2) C(MISSING)
23 rs2306737 G A 840 1309 560 2 1073 1359
23 exm1625831 A G 52 624 2033 311 2118 5
23 rs5939320 G A 286 1188 1234 775 1653 7
23 JHU_X.2701184 C T 38 564 2102 285 2145 9
23 rs113564299 CA C 162 792 1755 463 1967 4
23 JHU_X.2701482 G A 0 6 2703 0 2432 2
23 JHU_X.2701698 C A 0 18 2691 4 2427 3
23 rs111595179 C T 14 309 2385 128 2 2305
23 rs1419931 A G 235 1116 1358 673 1754 7