Trouble merging UKB imputed data in plink 2.0

316 views
Skip to first unread message

Dorothy Chen

unread,
Jul 1, 2022, 1:18:13 PM7/1/22
to plink2-users
Hi Christopher, 

Hope you're well. I'm relatively new to PLINK and have been having some trouble merging autosomal imputed UKB genetic data using pmerge-list in plink2 (3 Jun 2022 update). Specifically, when attempting the concat job, I'm getting the following error in my log file: 
__________________________

PLINK v2.00a3.3LM 64-bit Intel (3 Jun 2022)
Options in effect:
 --make-pgen
 --memory 30000
 --merge-max-allele-ct 2
 --out <path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.20220627
 --pmerge-list <path_to_file>/merged_autosome_bfile/autosomefiles_pgen.list
  --threads 1

Hostname: <redacted>

Working directory:  <redacted>

Start time: Wed Jun 29 21:09:28 2022

Random number seed: 1656562168

385624 MiB RAM detected; reserving 30000 MiB for main workspace.

Using 1 compute thread.

--pmerge-list: 22 filesets specified.

--pmerge-list: 487409 samples present.

--pmerge-list: Merged .psam written to <path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.20220627-merge.psam

--pmerge-list: 22 .pvar files scanned.

Concatenation job detected.

Concatenating... 93095623/93095623 variants complete.

Results written to

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.20220627-merge.pgen

+

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.20220627-merge.pvar

.

487409 samples (264247 females, 222956 males, 206 ambiguous; 487409 founders)

loaded from

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.20220627-merge.psam.

Error: Line 1 of

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.20220627-merge.pvar

has fewer tokens than expected.

End time: Thu Jun 30 20:28:07 2022

2

________________________

To get to this point, I used the following code to rename the variants and convert bgen-formatted UKB genetic data to plink2 binary pgen files, adapted from here: https://groups.google.com/g/plink2-users/c/i942_CQaBc4/m/TQM-b8cKEQAJ 

________________________

plink2 \

 --bgen $scratchpath/ukb22828_c${c}_b0_v3.bgen ref-first \

 --sample $scratchpath/ukb22828_b0_v3_s487203.sample \

 --make-pgen \

 --memory 10000 \

 --threads 1 \

 --new-id-max-allele-len 100 missing \

 --set-all-var-ids @:#\$r,\$a \

 --out $scratchpath/ukb.renamed.chr${c}.${outdate}

________________________

I checked the headers of individual chromosome files to ensure that they were more or less in line with the description that you wrote up here: https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf

psam file header (missing paternal & maternal IDs, phenotype information): 

#FID    IID     SEX

pvar header 

#CHROM  POS     ID      REF     ALT

-----------

I then submitted a merge job, where the file list contained files listed out in .pgen .psam .pvar:

________________________

for chr in `seq 1 1 22`

do

        echo "$pgenfiles/ukb.renamed.chr${chr}.${date}.pgen $pgenfiles/ukb.renamed.chr${chr}.${date}.pvar $pgenfiles/ukb.renamed.chr${chr}.${date}.psam" >> $autofile

done

# Merge autosomes list using plink2alpha3.3 (downloaded into home directory) 

<path-to-files./plink2_v2.00_3.3/plink2 \

 --pmerge-list $autofile \

 --make-pgen \

 --merge-max-allele-ct 2 \

 --threads 1 \

 --memory 30000 \

 --out $outpath/ukb.autosomes.renamed.merged.${outdate}

_______________________

The output from the merge job also appears to have the same format for the pvar headers, and each individual pvar chromosome file appears to have the correct number of columns as well. We've had some storage issues on the cluster that have lead to some incomplete files, but I believe those issues should have been fixed. Is there an error in the plink code I submitted that is leading to this error? 

I've experimented with a truncated set of chromosomes (1-3) to see if the error replicates but was able to get it to run successfully (log file below) 

_______________________

PLINK v2.00a3.3LM 64-bit Intel (3 Jun 2022)

Options in effect:

  --make-pgen

  --memory 30000

  --merge-max-allele-ct 2

  --out <path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630

  --pmerge-list <path_to_file>/merged_autosome_bfile/autosomefiles_pgen_trunc.list

  --threads 1


Hostname: <redacted>

Working directory: <path_to_file>/Genetic_Data_Cleaning

Start time: Thu Jun 30 15:46:15 2022


Random number seed: 1656629175

1031989 MiB RAM detected; reserving 30000 MiB for main workspace.

Using 1 compute thread.

--pmerge-list: 3 filesets specified.

--pmerge-list: 487409 samples present.

--pmerge-list: Merged .psam written to

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630-merge.psam

.

--pmerge-list: 3 .pvar files scanned.

Concatenation job detected.

Concatenating... 22228534/22228534 variants complete.

Results written to

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630-merge.pgen

+

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630-merge.pvar

.

487409 samples (264247 females, 222956 males, 206 ambiguous; 487409 founders)

loaded from

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630-merge.psam.

22228534 variants loaded from

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630-merge.pvar.

Note: No phenotype data present.

Writing

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630.psam

... done.

Writing

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630.pvar

... done.

Writing

<path_to_file>/merged_autosome_bfile/ukb.autosomes.renamed.merged.trunc.20220630.pgen

... done.

_______________________

I'm currently running a version of the merge script that includes flags for specifying irregular input files; I'll update on whether that job is successful once it finishes running. Any help would be appreciated! Feel free to also point me to specific documentation - I've read over the plink2 pages but may have missed something. Thanks for your help! 

_______________________

<path-to-software>/plink2_v2.00_3.3/plink2 \

 --no-parents \

 --no-pheno \

 --pmerge-list $autofile \

 --make-pgen \

 --merge-max-allele-ct 2 \

 --threads 1 \

 --memory 30000 \

 --out $outpath/ukb.autosomes.renamed.merged.${outdate}





Christopher Chang

unread,
Jul 1, 2022, 1:55:08 PM7/1/22
to plink2-users
This looks like a bug; I'm guessing there's something involving your irregular input flags which I didn't previously test and is currently mishandled by --pmerge-list.

If you could create a minimal example (e.g. try going down to just 2 filesets, and one variant per file) that I can use to reproduce the error you're seeing, that would be great.

Christopher Chang

unread,
Jul 1, 2022, 2:01:49 PM7/1/22
to plink2-users
Er, ignore the last message unless you run into some irregular-input-specific error.

It's interesting that the merge works for chromosomes 1-3 and only fails for 1-22.  This makes me suspect some write error is not being bubbled up properly.  What's the size of the ukb.autosomes.renamed.merged.20220627-merge.pvar produced by the crashing run?  Is it possible for you to send me just the first 1-2 lines of that file?

Dorothy Chen

unread,
Jul 1, 2022, 4:45:19 PM7/1/22
to plink2-users
Hi Christopher, 

Absolutely - I've had to remove the ukb.autosomes.renamed.merged.20220627-merge files off of the cluster due to storage restrictions but I'll run the script that generated the error again and send you the size/ file data. More soon!
Reply all
Reply to author
Forward
0 new messages