extract male ChrY data from UKB's ChrXY

40 views
Skip to first unread message

jie huang

unread,
Jun 25, 2024, 10:31:24 AMJun 25
to plink2-users

Dear Chris:

I am trying to extract male ChrY data to feed into 23&Me's Yhaplo software (https://github.com/23andMe/yhaplo), which takes VCF format data.

I used the following PLINK command:  plink2 --pfile chrXY --filter-males --mind 0.02 --geno 0.02 --export vcf id-paste=iid bgz --out chrY

But then I got an error from Yhaplo: ValueError: VCF must include exactly one contig with a label in: ['24', 'Y', 'chrY']. Observed: {'X', 'XY'}

Below is the first 5 lines of the PLINK2 generated VCF file:
##fileformat=VCFv4.3
##fileDate=20240625
##source=PLINKv2.00
##contig=<ID=XY>
##contig=<ID=X>


Then I removed the 4th line ##contig=<ID=XY> and changed the 5th line to ##contig=<ID=Y> and recreated the ChrY.vcf.gz file. The Yhaplo program seems to work this time. 

I just want to make sure if the above step is correct for extracting male ChrY data.

Thank you & best regards,
Jie

Christopher Chang

unread,
Jul 2, 2024, 12:55:46 PMJul 2
to plink2-users
As noted in e.g. the --chr documentation, the "XY" chromosome code in a plink-formatted fileset is an old way to refer to the pseudoautosomal region, which is shared between chrX and chrY.

Because this sequence is present at the ends of chrY, changing its chromosome code to Y is fine for some purposes.  However, the POS values are relative to chrX, so you should convert them to be relative to chrY.  Specifically:
- If your data is based on the GRCh38 reference genome, the first part of the pseudoautosomal region has identical coordinates on chrX and chrY, but the second part is [155701383, 156030895) on chrX and [56887903, 57217415) on chrY, so you should subtract 98813480 from the later POS values.
- Refer to the linked Wikipedia article for the GRCh37 coordinates.

Meanwhile, most of chrY lies outside the shared pseudoautosomal region.  It sounds like your main job is to find that data.
Message has been deleted

jie huang

unread,
Jul 4, 2024, 6:54:58 PM (12 days ago) Jul 4
to plink2-users

Dear Chris:

I am working on the UK Biobank genotyped dataset, which is based on GRCh37 coordinates. 

Please see the screenshot below. There are a total of  39,431 SNPs after I run plink2 --pfile chrXY --filter-males --mind 0.02 --geno 0.02
The the 36,581th SNP is the  breaking point for part-1 and part-2, that is, the 3rd row of the following screenshot. 

屏幕截图 2024-07-05 064705.png

Based on your explanation, it seems that these  39,431 SNPs belong to the Pseudo region, therefore, there is no real ChrY SNPs from this dataset.
Therefore, I could not use this extracted dataset to run Y-chromosome based phylogenetic analysis, correct?

Thank you very much & best regards,
JIE

Reply all
Reply to author
Forward
0 new messages