So, I'll pitch my vote in favor of being able to create a .udb (or .vdb) database!
Support for UDB files have been added now. The following commands are supported: --makeudb_usearch, --udbinfo, --udbstats, and --udb2fasta. The database specified by the --db option to the --usearch_global now detects UDB databases automatically.
The changes have not been tested extensively.
The changes are not yet in any release, but have been committed to the Github repo.
Enjoy!
Feedback appreciated.
time vsearch --usearch_global sub.fna --db 99_otus.udb --id .97 --uc test2.uc --threads 10
vsearch v2.4.4_linux_x86_64, 62.9GB RAM, 24 cores
https://github.com/torognes/vsearch
Reading UDB file 99_otus.udb 0%
Fatal error: Unable to read from UDB file or invalid UDB file
real 1m18.338s
user 0m0.002s
sys 1m1.863s
test_directory> vsearch --udbstats
CGDBv2.0.udb
vsearch v2.4.4_linux_x86_64, 252.3GB RAM, 20 cores
https://github.com/torognes/vsearch
Reading UDB file 100%
test_directory>
vsearch --usearch_global short_sequences.fa -db my.udb ...
vsearch --usearch_global test_primers.fa -db CGDBv2.0.udb --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_udb.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 8192 --threads 15 --maxseqlength 1000000
vsearch --makeudb_usearch CGDBv2.0.fa --id 0.83 --minwordmatches 1 --wordlength 8 --output CGDB2.0.udb -dbmask none -qmask none
--strand both --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 -maxrejects 4096 --maxseqlength 1000000
vsearch --usearch_global test_primers.fa -db CGDBv2.0.udb --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_udb.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 4096 --threads 15 --maxseqlength 1000000
vsearch --usearch_global test_primers.fa -db CGDBv2.0.fa --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_fasta.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 4096 --threads 15 --maxseqlength 1000000
> wc -l against_*.vsearch
340137 against_fasta.vsearch
337004 against_udb.vsearch
677141 total
> cat <(awk '{printf "fasta\t%s\n", $0}' against_fasta.vsearch) <(awk '{printf "udb\t%s\n", $0}' against_udb.vsearch) | sort -k2,13 | uniq -u -f 1 > uniq-both.lines
> wc -l uniq-both.lines
22621 uniq-both.lines
> egrep -c "^fasta" uniq-both.lines
12877
> egrep -c "^udb" uniq-both.lines
9744