loading lots of vcfs - help please

61 views
Skip to first unread message

d vanheel

unread,
Mar 26, 2012, 1:02:20 PM3/26/12
to plinkseq-users
Hi plinkseq groups,

How do I load a set of selected vcfs (dont want all vcfs in the
directory).

I thought plinkseq might read piped input, but it doesnt seem to.
See below.

thanks for any help, david



[hmw208@sunfirex4600 plinkseq_projects]$ ls ../interim/*.vcf | grep -F
-f ../SampleList_lib01_1412PostQCof1419of1536samples_gt57300calls.txt
| xargs cat | /data_n2/hmw208/software/plinkseq-0.08-x86_64/pseq
lib01_26Mar2012_proj1 load-vcf --vcf -
xargs: cat: terminated by signal 13

Brett Thomas

unread,
Mar 26, 2012, 1:17:12 PM3/26/12
to plinkse...@googlegroups.com
Hi David -- 

I'd reorganize the command line a little, using find instead of ls to get absolute path: 
find interim | grep -F -f [samples] | xargs pseq proj load-vcf --vcf

May be something different about your data, but this command works on the tutorial files: 
find data | grep vcf.gz | xargs pseq proj1 load-vcf --vcf

Brett

Purcell, Shaun

unread,
Mar 26, 2012, 2:05:09 PM3/26/12
to <plinkseq-users@googlegroups.com>

just something like 

  --vcf `cat list | grep XYZ | etc`   

 would work (unless the # of VCFs is really so large that the shell expansion doesn't work)

but probably easiest just to use to make the "project specification" file manually, from the grep statement, and then run "load-vcf" ( i.e. it doesn't necessarily require a "--vcf" option: it will go through the project spec. file and load any VCF that hasn't already been loaded.)

d vanheel

unread,
Mar 26, 2012, 4:58:05 PM3/26/12
to plinkseq-users
thanks gents,

tried Brett's suggestion first and it works.

(and Shaun - yes, in fact have had the shell argument length limit
problem demultiplexing the flowcell lane with barcoded samples)

ps: plinkseq could not have come at a better time for our project &&
data...great stuff.

regards, david vh

d vanheel

unread,
Mar 27, 2012, 6:55:04 PM3/27/12
to plinkseq-users
Shaun,

Sorry, your suggestion above doesnt seem to work:

"but probably easiest just to use to make the "project specification"
file manually, from the grep statement, and then run "load-vcf" ( i.e.
it doesn't necessarily require a "--vcf" option: it will go through
the project spec. file and load any VCF that hasn't already been
loaded.) "

I had to remake a new project file, and so just appended the VCF names
to the file manually. See below for first few lines of the project
file.
Nothing happens when I run load-vcf (thinks for a few seconds, but no
output at all). Running summary gives 0 variants, 0 samples, then at
the end lots of 'Added VCF...". But these dont then seem to have been
added when running summary again or other commands.

Did I need to bgzf compress the VCFs, or something?

thanks, david



[hmw208@sunfirex4600 pseq_projects]$ more lib01_26Mar2012_proj2
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/pseq_projects/
lib01_26Mar2012_proj2_out/ OUTPUT
/data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/ RESOURCES
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/pseq_projects/
lib01_26Mar2012_proj2_out/vardb VARDB
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/pseq_projects/
lib01_26Mar2012_proj2_out/inddb INDDB
/data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/locdb LOCDB
/data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/refdb REFDB
/data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/seqdb SEQDB
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/plinkseq_projects/../
interim/101010002E__1511065030.sorted.realigned.bam.raw.vcf VCF
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/plinkseq_projects/../
interim/101010009M__1511065030.sorted.realigned.bam.raw.vcf VCF
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/plinkseq_projects/../
interim/101010016V__1511065030.sorted.realigned.bam.raw.vcf VCF
/data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/plinkseq_projects/../
interim/101010027G__1511065030.sorted.realigned.bam.raw.vcf VCF
************** and lots more VCFs after this.....





[hmw208@sunfirex4600 pseq_projects]$ ./pseq lib01_26Mar2012_proj2
summary

---Variant DB summary---

0 unique variants

---Individual DB summary---

0 unique individuals

---Locus DB summary---

Group : refseq (33234 entries) n/a

---Reference DB summary---

Group : dbsnp (50776096 entries) : dbsnp (default name)

---Sequence DB summary---

chr1:1..249250621 MB=249
chr2:1..243199373 MB=243
chr3:1..198022430 MB=198
chr4:1..191154276 MB=191
chr5:1..180915260 MB=180
chr6:1..171115067 MB=171
chr7:1..159138663 MB=159
chr8:1..146364022 MB=146
chr9:1..141213431 MB=141
chr10:1..135534747 MB=135
chr11:1..135006516 MB=135
chr12:1..133851895 MB=133
chr13:1..115169878 MB=115
chr14:1..107349540 MB=107
chr15:1..102531392 MB=102
chr16:1..90354753 MB=90
chr17:1..81195210 MB=81
chr18:1..78077248 MB=78
chr19:1..59128983 MB=59
chr20:1..63025520 MB=63
chr21:1..48129895 MB=48
chr22:1..51304566 MB=51
chrX:1..155270560 MB=155
chrY:1..59373566 MB=59
chrM:1..16571 MB=0

SEQDB meta-information: BUILD = hg19
SEQDB meta-information: DESC = from-UCSC-20-dec-2010
SEQDB meta-information: IUPAC = 0
SEQDB meta-information: NAME = hg19
SEQDB meta-information: REPEATMODE = lower

---File-index summary---

Core project specification index : lib01_26Mar2012_proj2
Core OUTPUT file : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
pseq_projects/lib01_26Mar2012_proj2_out/
Core RESOURCES file : /data_n2/hmw208/software/plinkseq-0.08-x86_64/
hg19/
Core LOCDB file : /data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/
locdb
Core INDDB file : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
pseq_projects/lib01_26Mar2012_proj2_out/inddb
Core VARDB file : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
pseq_projects/lib01_26Mar2012_proj2_out/vardb
Core LOG file : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
pseq_projects/lib01_26Mar2012_proj2_out/log.txt
Core SEQDB file : /data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/
seqdb
Core REFDB file : /data_n2/hmw208/software/plinkseq-0.08-x86_64/hg19/
refdb
Added VCF : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
plinkseq_projects/../interim/
101010002E__1511065030.sorted.realigned.bam.raw.vcf
Added VCF : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
plinkseq_projects/../interim/
101010009M__1511065030.sorted.realigned.bam.raw.vcf
Added VCF : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
plinkseq_projects/../interim/
101010016V__1511065030.sorted.realigned.bam.raw.vcf
Added VCF : /data_n2/hmw208/Fluidigm_resequencing/1014AAP11O1/
plinkseq_projects/../interim/
101010027G__1511065030.sorted.realigned.bam.raw.vcf
************** and lots more VCFs after this.....

d vanheel

unread,
Mar 27, 2012, 7:01:18 PM3/27/12
to plinkseq-users
OK, sorry, fixed my problem in above post.

I had some symbolic links as shortcuts to various directories.
These seemed to mess up the full path filenames used in the project
specification file.

load-vcf works fine now.

david
Reply all
Reply to author
Forward
0 new messages