recode vcf and sex chromosomes

3,211 views
Skip to first unread message

freeseek

unread,
May 6, 2014, 11:54:23 AM5/6/14
to plink2...@googlegroups.com
When using "--recode vcf-iid", chromosomes 23, 24, and 26 get encoded with numbers rather than X, Y, and MT. This makes for VCF files that are compliant with the VCF standard, but not very compliant with typical human genome VCF files. Not sure what the best solution would be. It is also worth it noticing that with hg38/grch38 the chromosomes names will converge to chr1, chr2, rather than 1, 2, etc. so it might help to be able to use this alternative nomenclature as well.

Christopher Chang

unread,
May 6, 2014, 12:46:17 PM5/6/14
to plink2...@googlegroups.com
This issue is not specific to VCF.  In fact, it theoretically affects every output file that currently contains chromosome IDs.

Fortunately, most of the infrastructural work needed to support "X"/"chr1"/"chrX" output was done when support for arbitrary contig names was added, so it's now largely down to choosing an appropriate interface.  How about this:

--chr-prefix will cause all numeric (or X/Y/XY/MT) chromosome codes to be preceded by "chr" in output files.
--chr-alphacode will cause X/Y/XY/MT chromosome codes to be represented alphabetically instead of numerically.  So --chr-prefix + --chr-alphacode leads to "chrX", and --chr-alphacode by itself leads to "X".  (If you can think of a more appropriate flag name than "alphacode", I'm all ears.)

freeseek

unread,
May 6, 2014, 1:16:56 PM5/6/14
to plink2...@googlegroups.com
On Tuesday, May 6, 2014 12:46:17 PM UTC-4, Christopher Chang wrote:
--chr-prefix will cause all numeric (or X/Y/XY/MT) chromosome codes to be preceded by "chr" in output files.
--chr-alphacode will cause X/Y/XY/MT chromosome codes to be represented alphabetically instead of numerically.  So --chr-prefix + --chr-alphacode leads to "chrX", and --chr-alphacode by itself leads to "X".  (If you can think of a more appropriate flag name than "alphacode", I'm all ears.)

I am a little opposed. Do notice that hg19 uses chrM while b37 uses MT, and hg38/grch38 should converge to chrM, so it is not as clean as you would imagine at first. An alternative suggestion would be to use --chr-encode [build code] in a "--split-x" fashion. You would need to include a few dictionaries for different builds (two for humans) and it could be easily extended to other species. You could then also extend this capability for people that want to use their own dictionary provided as a file (for example if someone really wants to use the unplaced contigs he might be interested in using the numbers after 26 for those).

Christopher Chang

unread,
May 6, 2014, 5:43:19 PM5/6/14
to plink2...@googlegroups.com
Oh, I was not aware of the chrM vs. chrMT issue.  Hmm.

User-specified dictionaries seem like overkill here--it's almost easier to write a shell script for that case, since most of the files involved have chromosome codes in the first column.  The --chr-encode idea sounds reasonable, though; I just wonder if support for plain "M" and/or "MT" without the "chr" prefix is also important.

Christopher Chang

unread,
May 8, 2014, 11:22:51 AM5/8/14
to plink2...@googlegroups.com
The 8 May development build has a --chr-output flag, which you provide your desired human mitochondrial code to.  E.g.

"--chr-output 26" has no effect
"--chr-output chr26" adds chr prefixes and does nothing else
"--chr-output M" has no effect on autosomes but uses X/Y/XY/M coding for 23-26.
(the other three options are "MT", "chrM", and "chrMT", as you'd expect)

I ended up going with "--chr-output" instead of "--chr-encode" to make it a little bit clearer that there's no need to use this just to properly read differently-encoded chromosome IDs.  I decided against build codes since some of the MT -> M switchover actually happened before build 38.


In other news, the beta 1 build is scheduled for 11 May, so any bug reports that arrive in the next three days are especially valuable.

On Wednesday, May 7, 2014 1:16:56 AM UTC+8, freeseek wrote:

freeseek

unread,
Aug 8, 2016, 5:23:36 PM8/8/16
to plink2-users
I am trying to setup an imputation pipeline with plink/eagle/minimac3.

I need to recode plink files as VCF files and I am using the following command:
plink --recode vcf-iid bgz --output-chr MT

However, chromosome 25 gets encoded as XY in the VCF file. This does not help, as it is not the way GRCh37 likes things to be encoded.

The same problem happens when using the following command:
plink --recode vcf-iid bgz --output-chr chrM

Chromosome 25 gets encoded as chrXY in the VCF file, which is not the way GRCh38 encodes the pseudo-autosomal regions.

Is this really the desired behavior? I cannot guess a proper way to generate a correct VCF file, short of using the "--merge-x" option.

I think "--output-chr MT" should cause "25 -> X" and "--output-chr chrM" should cause "25 -> chrX".

Also, I have noticed that the option "--merge-x no-fail" causes the output "Error: --merge-x doesn't accept parameters.", which is definitely not the desired behavior.

Maybe two easy pigeon bugs with one stone here.

Christopher Chang

unread,
Aug 8, 2016, 8:52:25 PM8/8/16
to plink2-users
Yes, --merge-x no-fail's guaranteed failure is obviously a bug; this will be fixed in today's development build.

I'm very wary of mapping 25 to X/chrX, though, because that produces an unsorted X chromosome.  Maybe I'll add a warning recommending the use of --merge-x when a VCF containing some pseudoautosomal region data is being exported.

freeseek

unread,
Aug 9, 2016, 2:11:35 AM8/9/16
to plink2-users
Yes, that makes sense. I just wished there was a single command to export the pseudo-autosomal regions, as using "--merge-x no-fail" cannot be combined with "--recode vcf-iid bgz".

Sander W. van der Laan

unread,
Nov 8, 2017, 5:21:58 PM11/8/17
to plink2-users
Aha!

I didn't understand the explanation here: https://www.cog-genomics.org/plink/1.9/data#irreg_output. After reading this I do understand now, this is not immediately clear from the description. May I suggest the following:


--output-chr [MT code]

Normally, autosomal/sex/mitochondrial chromosome codes in PLINK output files are numeric, e.g. '23' for human X. The  --output-chr flag lets you specify a different coding scheme by providing the desired human mitochondrial code. The supported options are: '26' (default), 'M', 'MT', '0M', 'chr26', 'chrM', and 'chrMT'. 

(Note: PLINK 1.9 correctly interprets all of these encodings in input files.)

Examples: 

--output-chr 26  --> all chromosomes are numeric, 1-26

--output-chr M  --> all autosomal chromosomes are numeric, then X, XY, Y, M

--output-chr MT  --> all autosomal chromosomes are numeric, then X, XY, Y, M

--output-chr 0M  --> all autosomal chromosomes are numeric (01-09, 10-22), then 0X, XY, 0Y, 0M

--output-chr chrM  --> all autosomal chromosomes are numeric (chr1-chr22), then chrX, chrXY, chrY, chrM

and so on.

This behaviour was not immediately clear to me.

Thanks!

Sander
Reply all
Reply to author
Forward
0 new messages