Check-sex .hh warning

235 views
Skip to first unread message

Rocio Mariana

unread,
Apr 21, 2019, 2:12:38 PM4/21/19
to plink2-users
Hi, I have this issue. When I did --ld analysis to run --check-sex to see if my samples ID are well identified, I have this warnings.


plink1.9 --bfile Bed --extract Bed_ld.prune.in --make-bed --out Bed_LD

Ambiguous sex ID written to Bed_LD.nosex .
--extract: 379174 variants remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 667 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 48318 het. haploid genotypes present (see Bed_LD.hh ); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing.
Total genotyping rate is 0.993207.


Then, when I run --check-sex I still have the .hh file as an output. So, I run --split-x to see if this can resolve this issue, and the ouput is:

Ambiguous sex ID written to prueba_split.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 667 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 48318 het. haploid genotypes present (see prueba_split.hh ); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing.
Total genotyping rate is 0.993207.
379174 variants and 667 people pass filters and QC.
Note: No phenotypes present.
Error: --split-x cannot be used when the dataset already contains an XY region.

I asked before, and it means that it is doesn't needed to do split-x to my data. So I defined F values for my data, when I use --check-sex 0.6 0.9.

Ambiguous sex ID written to Bed_sexcheck_X.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 667 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 48318 het. haploid genotypes present (see Bed_sexcheck_X.hh ); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing.
Total genotyping rate is 0.993207.
379174 variants and 667 people pass filters and QC.
Note: No phenotypes present.
--check-sex: 7124 Xchr and 0 Ychr variant(s) scanned, 26 problems detected.

And then I run --check-sex y-only, and I don't understand why I get so many IID with problems.

379174 variants loaded from .bim file.
667 people (429 males, 237 females, 1 ambiguous) loaded from .fam.
Ambiguous sex ID written to Bed_sexcheck_Y.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 667 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 48318 het. haploid genotypes present (see Bed_sexcheck_Y.hh ); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands treat these as missing.
Total genotyping rate is 0.993207.
379174 variants and 667 people pass filters and QC.
Note: No phenotypes present.
--check-sex: 0 Xchr and 45 Ychr variant(s) scanned, 238 problems detected.

I run --set-hh-missing after excluding all the variants that are in the .hh file and I have no problems any more in the --check-sex y-only analysis, except by one IID that has no defined sex before the analysis. I had a lot more of problems detections with the --check-sex analysis.

I apoligize to ask so much, but I want to know if this is right, or not.  I know that I just can delete the 26 IDs that have problems, but I want to know if is right. My sample is a little one, of close to 600 people, so any individual is going to affect the maf in the posterior case-control analysis.

Thanks!

Christopher Chang

unread,
Apr 21, 2019, 5:48:40 PM4/21/19
to plink2-users
“—check-sex y-only” is worthless after —set-hh-missing, and is generally just for recovering sex calls you already trust after it’s gone through an intermediate format (e.g. VCF) which doesn’t have a well-supported standard way of representing it.

Meanwhile, please read the online documentation on how your data should be preprocessed before running —check-sex; you should be able to eliminate most of the problems by following them (remember to recalibrate your —check-sex thresholds before your final run). It’s fine to use a lower threshold higher than 0.6; I did so even for 1000 Genomes data.

Rocio Mariana

unread,
Apr 21, 2019, 7:19:12 PM4/21/19
to plink2-users

checksex_X.png

So, I did --splix and my output was "Error: --split-x cannot be used when the dataset already contains an XY region." I run first  --freqx, second --indep-pairphase 20000 2000 0.5, and third --check-sex with --read-freq, and I did an histrogram. Then, I run --check-sex again with the threshold 0.66 and 0.88 and I still get the 25 IDs with problems in the --checksex analysis....it is my samples my problems? or do I still missing something? 

Christopher Chang

unread,
Apr 21, 2019, 8:54:33 PM4/21/19
to plink2-users
You are supposed to choose the thresholds based on your data. You haven’t done that.

Christopher Chang

unread,
Apr 21, 2019, 9:18:49 PM4/21/19
to plink2-users
More precisely, have you looked at what the F values and sex calls in the last set of —check-sex problems are? Do that. If it’s a mix of female -> male and male -> female miscalls, with F values all mixed up, that’s really weird. Otherwise, fix your thresholds.

Rocio Mariana

unread,
Apr 21, 2019, 9:52:58 PM4/21/19
to plink2-users
I have, and it is mixed. 

CTR015   CTR015            1            2      PROBLEM     -0.05316
EN185    EN185            1            2      PROBLEM     -0.06334
EN189    EN189            2            0      PROBLEM       0.4609
EN197    EN197            1            0      PROBLEM        0.485
GC068    GC068            2            0      PROBLEM       0.2357
GC170    GC170            2            1      PROBLEM       0.9723

El domingo, 21 de abril de 2019, 14:12:38 (UTC-4), Rocio Mariana escribió:

Christopher Chang

unread,
Apr 21, 2019, 11:27:15 PM4/21/19
to plink2-users
That’s some seriously low-quality data; I’ve usually seen clerical error rates closer to 1%. I would try to figure out why you are getting such a high rate of incorrect sex calls before proceeding further, since there probably are significant sources of error which affect other parts of your data.

Rocio Mariana

unread,
Apr 21, 2019, 11:40:04 PM4/21/19
to plink2-users
Ok. So maybe this error I'm having it because I'm do this step first to start the quality control analysis. Maybe I should to do this step after doing the missing filter.



El domingo, 21 de abril de 2019, 14:12:38 (UTC-4), Rocio Mariana escribió:

Rocio Mariana

unread,
Apr 22, 2019, 10:01:34 AM4/22/19
to plink2-users
I did the missing analysis first, and I removed 15 individuals, with a missing rate > 0.05, then I did all the steps that I did before.

1.indep-pairphase 20000 2000 0.5
2.--freqx
3.--read-freq freq.frqx --check-sex
4. F histogram 
5. --check-sex 0.6 0.9

And I still have 13 individuals with problems as I showed you before. I'm guessing to this point that the best thing to do is to remove them from the analysis, but do you know why is this happening?


El domingo, 21 de abril de 2019, 14:12:38 (UTC-4), Rocio Mariana escribió:

Rocio Mariana

unread,
Apr 22, 2019, 1:38:38 PM4/22/19
to plink2-users
I forgot to say, that when I did --check-sex y-only I have 235 problems.

I have problems with all the women, which is expected but most of them have this kind of output:
FID IID PEDSEX SNPSEX STATUS YCOUNT
   IID1    IID1            2            1      PROBLEM       15

Should I just stay with the -check-sex problems? if I do a grep with the 13 samples with problems in the --check-sex analysis into the --check-sex y-only output, I have only 5 problems.

El domingo, 21 de abril de 2019, 14:12:38 (UTC-4), Rocio Mariana escribió:

Christopher Chang

unread,
Apr 22, 2019, 5:31:16 PM4/22/19
to plink2-users
1. You misunderstood my last comment. I was saying that you appear to have low-quality *phenotype* information, and you should look into this more. You may want to throw out ALL the samples that came from whatever source is responsible for most of these misrecorded sexes, if one source has a much higher error rate than the rest.

2. I already mentioned earlier that “—check-sex y-only” does not appear to be relevant to your use case.
Reply all
Reply to author
Forward
0 new messages