CHR POS recoded when subsetting samples in vcf

182 views
Skip to first unread message

Just Me

unread,
May 28, 2020, 4:55:59 PM5/28/20
to plink2-users
Hello,

I am trying to subset individuals from an imputed vcf file, but the resulting vcf has altered chr and position. For example, in chr 1, the chr changes to chr 5 part way through the file, then to ":"+pos further down, causing the pos to change to the snp id. There are no errors, but I discovered the issue when tabix failed. If there is anything I can do to correct this, any advice would be appreciated.

Example commands:

 plink2 --vcf original_chr1_dose.vcf.gz dosage=DS \           #### original imputed vcf file
   --keep keep_IDs.txt \                                     #### File with list of IIDs contained in vcf for subsetting
   --export vcf vcf-dosage=DS-force bgz \                    #### output in bgzipped vcf format retaining dosage
   --out subset_chr1_dose                                    #### output prefix

Example switch in chr/pos map:

gunzip -cd subset_dose.vcf.gz | cut -f 1-3 > chr1_chr_pos_subset.txt


##fileformat=VCFv4.3
##fileDate=20200526
##source=PLINKv2.00
## rest of header not posted
#CHROM    POS    ID
1    13305    chr1:13305:T:C
1    15778    chr1:15778:G:A
...
1    976639    chr1:976639:C:A
1    976655    chr1:976655:G:A
1    976669    chr1:976669:T:C
5    976671    chr1:976671:G:A
5    976675    chr1:976675:C:T
...
5    119885467    chr1:119885467:C:T
5    119885474    chr1:119885474:A:T
:0119885518    chr1:119885518:C:A    C
:0119885527    chr1:119885527:T:C    T
:0119885543    chr1:119885543:C:T    C

Christopher Chang

unread,
May 28, 2020, 5:10:16 PM5/28/20
to plink2-users
Can you post the full .log file from this run?

Just Me

unread,
May 28, 2020, 5:23:35 PM5/28/20
to plink2-users
I overwrote the log for chr1, but it happened for the others as well, so here is another log file for chr14 with filenames and paths altered.

PLINK v2.00a2LM 64-bit Intel (21 May 2018)
Options in effect:
  --export vcf vcf-dosage=DS-force bgz
  --keep keep_IDs_Freeze_19May2019.txt
  --memory 30000
  --out subset_chr14_dose
  --threads 6
  --vcf original_dose.vcf.gz dosage=DS

Hostname: omitted
Working directory: /extract_vcfs
Start time: Tue May 26 09:37:58 2020

Random number seed: 1590500278
257768 MiB RAM detected; reserving 30000 MiB for main workspace.
Using up to 6 compute threads.
--vcf: 3574348 variants scanned.
Warning: 14265 multiallelic variants skipped (not yet supported).
--vcf:
subset_chr14_dose-temporary.pgen +
subset_chr14_dose-temporary.pvar +
subset_chr14_dose-temporary.psam
written.
86089 samples (0 females, 0 males, 86089 ambiguous; 86089 founders) loaded from
subset_chr14_dose-temporary.psam.
3574348 variants loaded from
subset_chr14_dose-temporary.pvar.
Note: No phenotype data present.
--keep: 1858 samples remaining.
1858 samples (0 females, 0 males, 1858 ambiguous; 1858 founders) remaining
after main filters.
Warning: '_' present in original sample IDs; --export vcf will not be able to
reconstruct them. Consider rerunning with a suitable --export id-delim= value.
--export vcf bgz to
subset_chr14_dose.vcf.gz ... done.

End time: Tue May 26 11:45:52 2020

Just Me

unread,
May 28, 2020, 5:25:39 PM5/28/20
to plink2-users
Just to illustrate that it happened for chr14:

#CHROM  POS     ID      REF
0       1106777528      chr14:106777528:A:G     A
0       1106777532      chr14:106777532:G:C     G
0       1106777570      chr14:106777570:G:A     G
0       1106777611      chr14:106777611:C:T     C
0       1106777622      chr14:106777622:G:A     G
0       1106777623      chr14:106777623:C:T     C
...                                                                                                                   
14      106775945       chr14:106775945:T:A     T
14      106776073       chr14:106776073:G:T     G
14      106776118       chr14:106776118:G:A     G
14      106776119       chr14:106776119:G:T     G
14      106776182       chr14:106776182:C:T     C
14      106776337       chr14:106776337:G:A     G
14      106776442       chr14:106776442:G:A     G
14      106776448       chr14:106776448:C:T     C
14      106776527       chr14:106776527:C:T     C
14      106776528       chr14:106776528:G:A     G



On Thursday, May 28, 2020 at 4:55:59 PM UTC-4, Just Me wrote:

Christopher Chang

unread,
May 28, 2020, 5:27:24 PM5/28/20
to plink2-users
Can you try rerunning this with a more recent plink2 build?
Message has been deleted

Just Me

unread,
May 28, 2020, 5:37:10 PM5/28/20
to plink2-users
I tried to with the new version but got the below error.

PLINK v2.00a3LM AVX2 Intel (11 May 2020)

Options in effect:
  --export vcf vcf-dosage=DS-force bgz
  --keep keep_IDs_Freeze_19May2019.txt
  --memory 15000
  --out subset_dose
  --threads 6
  --vcf original_chr1_dose.vcf.gz dosage=DS

Hostname: omitted
Working directory: /omitted/extract_vcfs
Start time: Thu May 28 10:22:30 2020

Random number seed: 1590675750
40073 MiB RAM detected; reserving 15000 MiB for main workspace.

Using up to 6 compute threads.
--vcf: 9128437 variants scanned.
Error: GparseWriteByteCt multiallelic dosage size request.

On Thursday, May 28, 2020 at 4:55:59 PM UTC-4, Just Me wrote:

Christopher Chang

unread,
May 28, 2020, 5:43:03 PM5/28/20
to plink2-users
Dosages aren't supported for multiallelic variants yet by plink2.  You'll need to split or filter out the multiallelic variants first with e.g. bcftools.

Just Me

unread,
May 28, 2020, 5:54:29 PM5/28/20
to plink2-users
When I have used the same script for 1KGP3 imputed files, it just skips over the multiallelic variants, and did not produce this same kind of output with the CHR altered (example log below). Could there be something different here that has caused it to alter the chr? The successful subsets were for 1KGP3 in build 37, these current failed ones are in TOPMed imputed build38.

PLINK v2.00a2LM 64-bit Intel (21 May 2018)
Options in effect:
  --export vcf vcf-dosage=DS-force bgz
  --keep AHI_IDs.txt
  --maf 0.0005
  --out AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose
  --threads 4
  --vcf /omitted/original_chr10.recode_clean.vcf.gz dosage=DS

Hostname: omitted
Working directory: /omitted/vcf_AHI
Start time: Tue Apr 14 12:22:12 2020

Random number seed: 1586881332
257768 MiB RAM detected; reserving 128884 MiB for main workspace.
Using up to 4 compute threads.
--vcf: 1232567 variants scanned.
Warning: 2862 multiallelic variants skipped (not yet supported).
--vcf: AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose-temporary.pgen +
AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose-temporary.pvar +
AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose-temporary.psam written.
89462 samples (0 females, 0 males, 89462 ambiguous; 89462 founders) loaded from
AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose-temporary.psam.
1232567 variants loaded from
AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose-temporary.pvar.

Note: No phenotype data present.
--keep: 7034 samples remaining.
7034 samples (0 females, 0 males, 7034 ambiguous; 7034 founders) remaining
after main filters.
Calculating allele frequencies... done.
76233 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
1156334 variants remaining after main filters.

Warning: '_' present in original sample IDs; --export vcf will not be able to
reconstruct them. Consider rerunning with a suitable --export id-delim= value.
--export vcf bgz to AHI_chr10_1kgp3_Rsqfiltered_maf0.0005_dose.vcf.gz ... done.

End time: Tue Apr 14 13:36:42 2020



On Thursday, May 28, 2020 at 4:55:59 PM UTC-4, Just Me wrote:

Christopher Chang

unread,
May 28, 2020, 6:03:25 PM5/28/20
to plink2-users
Well, the relevant question is whether the bug is still in the latest plink2 build.

To find out, it’s necessary to explicitly split or filter out multiallelic variants, because multiallelic variants *are* normally supported now; it’s only your rare multiallelic + dosage use case that is still under construction. Sorry about the inconvenience.

Just Me

unread,
May 28, 2020, 6:21:56 PM5/28/20
to plink2-users
Ahh, no worries. I would like to assist in finding the potential bug and appreciate your help. I'll see if I can figure it out how to remove the multiallelics and follow-up when I am able to retry. I will also try this build38 file with the maf flag to see if that helps. Thanks!


On Thursday, May 28, 2020 at 4:55:59 PM UTC-4, Just Me wrote:

Just Me

unread,
Jun 22, 2020, 11:59:27 PM6/22/20
to plink2-users
The issue does not appear if I either remove multiallelic variants using bcftools first, or don't specify DS option.

Christopher Chang

unread,
Jun 23, 2020, 12:35:43 AM6/23/20
to plink2-users
Thanks for verifying this.
Reply all
Reply to author
Forward
0 new messages