Merging many vcf sequence files

4,129 views
Skip to first unread message

Roger Lee

unread,
Nov 25, 2014, 7:45:30 PM11/25/14
to plink2...@googlegroups.com
Hi,

I have about 200 vcf sequence files (v. 4.1).  Each file represent a single person.  I want to merge these 200 files and get summary allele frequency.

Can I use plink2 to merge 200 files?  should I use a text file for listing all the file names?  each file is about 5 mb.

thank you.
chris

Christopher Chang

unread,
Nov 25, 2014, 9:21:05 PM11/25/14
to plink2...@googlegroups.com
Yes, converting each VCF to PLINK binary format, and then using --merge-list to merge all 200 at once, would work.

Roger Lee

unread,
Nov 26, 2014, 4:20:21 AM11/26/14
to plink2...@googlegroups.com
HI, 

I tried --merge-list and got the following error 

Warning: Multiple positions seen for variant '.'.

I do have a bunch of snps in the VCF file without snpnames. They are indicated by '.'

Can this be resolved?  I wasn't able to get an merged file from the output (just fam file and missnp file) 

The exact command I used is plink --merge-list datanames.txt --out merged_data

In the datanames.txt, I just have the prefix of each binary file per line
dataname1
dataname2
.
.
.
dataname200

Thank you.


 

On Tuesday, November 25, 2014 4:45:30 PM UTC-8, Roger Lee wrote:

Christopher Chang

unread,
Nov 26, 2014, 4:42:01 AM11/26/14
to plink2...@googlegroups.com
You need to assign position-based names for your SNPs.  See the --set-missing-var-ids documentation (https://www.cog-genomics.org/plink2/data#set_missing_var_ids ).

Roger Lee

unread,
Nov 26, 2014, 4:37:40 PM11/26/14
to plink2...@googlegroups.com
Hi, 

I used the following command for each vcf file

plink --vcf person1.vcf  --set-missing-var-ids @:#$1,$2  --make-bed --out corrected_snpname1

When I tried to merge the files using --merge-list, I got the following error message

Warning: Multiple positions seen for variant 'rs2297463'.
Warning: Multiple positions seen for variant 'rs3117557,rs71252698'.
Warning: Multiple positions seen for variant 'rs3840862'.
Error: 11 variants with 3+ alleles present.

Did the Error about having 3+ alleles stopped the program?  how can I fix the multi-allelic issue?  and how should i address the multiple positions for some snps?

Thank you and sorry for the trouble.

On Tuesday, November 25, 2014 4:45:30 PM UTC-8, Roger Lee wrote:

Christopher Chang

unread,
Nov 26, 2014, 4:45:26 PM11/26/14
to plink2...@googlegroups.com
* PLINK 1's data format is completely incapable of handling triallelic variants.  When you want to keep them (and then e.g. use plink --vcf to keep the reference and the most common alternate allele), you must perform the merge with another tool.

For your job, though, it's safe to just exclude those 11 variants.  The failed merge will generate a .missnp file.  Use something like

plink --bfile corrected_snpname1 --exclude failed_merge.missnp --make-bed --out tmp_snpname1

to remove those variants from every fileset, and then retry the merge.

* If you're lazy, you can just ignore the multiple position warnings.  You might want to figure out why your data sources are inconsistent with each other, though.


On Wednesday, November 26, 2014 1:37:40 PM UTC-8, Roger Lee wrote:

Roger Lee

unread,
Nov 26, 2014, 7:09:34 PM11/26/14
to plink2...@googlegroups.com
Thank you. That worked.  Sorry I have one more question.

After the files are merged and a merged binary file is outputted, is there a way to write out a merged vcf file from the merged binary file?

thanks


On Tuesday, November 25, 2014 4:45:30 PM UTC-8, Roger Lee wrote:

Christopher Chang

unread,
Nov 26, 2014, 7:11:48 PM11/26/14
to plink2...@googlegroups.com
plink --bfile merged_data --recode vcf --out [...]
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
Message has been deleted
0 new messages