Indexing database

502 views
Skip to first unread message

patrick...@gmail.com

unread,
May 2, 2016, 6:05:54 AM5/2/16
to VSEARCH Forum
Hey Guys,

Did you implement this option :

http://drive5.com/usearch/manual/cmd_makeudb_usearch.html

I'm --usearch_global with a big database in FASTA (45 GB) but it's very slow ...

Torbjørn Rognes

unread,
May 2, 2016, 7:46:46 AM5/2/16
to VSEARCH Forum
Hi!

No, we haven't implemented any of the udb-file related options. We have considered it, but so far we haven't found compelling reasons to make the effort.

Do you save considerable time by creating a udb database as opposed to search the FASTA database directly?

Are you performing many individual searches against the same database?

- Torbjørn

patrick...@gmail.com

unread,
May 2, 2016, 7:55:52 AM5/2/16
to VSEARCH Forum
I didn't try the 64 bit Usearch version, but based on that figure http://www.drive5.com/usearch/features.html, it seems really faster than Blast.

Actually, I've used blastn for the research and It tooks 5h (with indexing database) whereas fasta-db research implemented in vsearch took me 4 days...
So I think it should be good to implement this in your next update.


Torbjørn Rognes

unread,
May 2, 2016, 8:04:13 AM5/2/16
to VSEARCH Forum
Hi

vsearch is generally as fast as usearch, but it depends a bit on the length of the sequences.

vsearch uses similar algorithms as usearch, it is just the use of the udb files that we have not implemented. We still use similar indices.

What you save by using udb files is to the time used to index the database each time you perform a search. That time is only saved if you perform many searches against the same database. Also, that time is often negligible compared to the actual search time.

- Torbjørn

christian...@gmail.com

unread,
Mar 21, 2017, 3:28:11 PM3/21/17
to VSEARCH Forum
Hi Torbjorn,
    I would like to add a voice to the need of a "makeudb" feature for vsearch. I am performing *many* individual searches against the same database, and of course the indexing takes 98% of the time. So having a precomputed database to load would make a big difference for my applications.

     Regards,
     Christian

Torbjørn Rognes

unread,
Mar 21, 2017, 3:31:52 PM3/21/17
to VSEARCH Forum
Hi

Thanks for your suggestion. Could you say something about the number and lengths of the query and database sequences? Are you using the usearch_global command? Any unusual options?

- Torbjørn

christian...@gmail.com

unread,
Mar 21, 2017, 3:44:21 PM3/21/17
to VSEARCH Forum
Yes, I am using usearch_global.

The query sequences are 18-24 nt, and if the udb functionality were available then I would have ~50-200 query sequences per vsearch run. Unusual options: wordlength 6 and minwordmatches 3.

The database is   1322811516 nt in 619226 seqs, min 32, max 49287, avg 2136

While we're on the topic, what would be ideal is to be able to load the udb into memory and keep it there between runs so that multiple calls to vsearch wouldn't need to keep reloading from disk. (I'm sure that you've got a laundry list of todos for vsearch,  plus a regular day job,. I just wanted to go for broke here in terms of bells and whistles requests...just in case).

eric.norm...@gmail.com

unread,
Apr 6, 2017, 3:00:33 PM4/6/17
to VSEARCH Forum
I am also running multiple `vsearch --usearch_global` searches on the same database. Indexing the database accounts for a very significant proportion (50-90%) of the time needed for the search, even though I am using 16 CPUs for the search. Being able to pre-index the database would reduce that time much and keeping it in memory (I have no clue how you would achieve this however) would be very fast.

So, I'll pitch my vote in favor of being able to create a .udb (or .vdb) database!

eric.norm...@gmail.com

unread,
Apr 6, 2017, 3:01:52 PM4/6/17
to VSEARCH Forum
The sequences in my database are 300-700bp and my queries 300-400bp.

br...@ciad.mx

unread,
Apr 17, 2017, 2:37:53 PM4/17/17
to VSEARCH Forum
I agree with the community, implementing creation and use of udb's will be most appreciated!

Torbjørn Rognes

unread,
Sep 8, 2017, 2:00:26 PM9/8/17
to VSEARCH Forum

Support for UDB files have been added now. The following commands are supported: --makeudb_usearch, --udbinfo, --udbstats, and --udb2fasta. The database specified by the --db option to the --usearch_global now detects UDB databases automatically.


The changes have not been tested extensively.


The changes are not yet in any release, but have been committed to the Github repo.


Enjoy!


Feedback appreciated.

christian...@gmail.com

unread,
Sep 8, 2017, 4:00:41 PM9/8/17
to VSEARCH Forum
Thanks so much! This will be a very useful addition to vsearch. Now, on to the bug reports...

This works fine: vsearch --makeudb_usearch CGDBv2.0.fa --wordlength 11 --output CGDBv2.0.udb

And so does this:
> vsearch --udbinfo CGDBv2.0.udb                                                                                                  
vsearch v2.4.4_linux_x86_64, 252.3GB RAM, 20 cores
https://github.com/torognes/vsearch

           Seqs  631167
     SeqIx bits  32
          Alpha  nt (4)
     Word width  11
          Slots  0
      Dict size  4194304 (4194.3k)
         DBstep  1
        DBAccel  100%

Now the  badness:
> vsearch --udbstats CGDBv2.0.udb
vsearch v2.4.4_linux_x86_64, 252.3GB RAM, 20 cores
https://github.com/torognes/vsearch

Reading UDB file 0% 

Fatal error: Unable to read from UDB file or invalid UDB file

I get the same fatal error when I try:
vsearch --usearch_global test_seqs.fa -db CGDBv2.0.udb ...

Torbjørn Rognes

unread,
Sep 8, 2017, 4:52:17 PM9/8/17
to VSEARCH Forum
Thanks for reporting these bugs. I will look into this asap. It seems like it does not handle non-default wordlengths correctly.

christian...@gmail.com

unread,
Sep 8, 2017, 4:56:13 PM9/8/17
to VSEARCH Forum
I get the same fatal error even if I make the database with default wordlength (vsearch --makeudb_usearch CGDBv2.0.fa --output CGDBv2.0.udb):

Colin Brislawn

unread,
Sep 8, 2017, 6:27:09 PM9/8/17
to VSEARCH Forum
I also have this error. Maybe I'm using this command wrong?

time vsearch --usearch_global sub.fna --db 99_otus.udb --id .97 --uc test2.uc --threads 10

vsearch v2.4.4_linux_x86_64, 62.9GB RAM, 24 cores

https://github.com/torognes/vsearch


Reading UDB file 99_otus.udb 0%  


Fatal error: Unable to read from UDB file or invalid UDB file


real 1m18.338s

user 0m0.002s

sys 1m1.863s


Colin


Torbjørn Rognes

unread,
Sep 8, 2017, 7:02:26 PM9/8/17
to VSEARCH Forum
I am not able to reproduce these bugs with the data I am using. Would it be possible for any of you to share a FASTA file that generates these errors?

There was an error with the progress indicator when making udb's, but it's just cosmetic. Fixed in latest commit.

Just to make it clear: The binary UDB format is similar to that used in usearch version 8, but not necessarily exactly the same, so the UDB files should be both written and read with VSEARCH. I am sure you already do that, but I just wanted to point that out.

Some examples:

To make a udb file:
vsearch --makeudb_usearch db.fasta --output db.udb

To get info:
vsearch --udbinfo db.udb

To search with a udb database:
vsearch --usearch_global query.fasta --db db.udb --id 0.9 --uc out.uc

Torbjørn Rognes

unread,
Sep 9, 2017, 6:34:12 AM9/9/17
to VSEARCH Forum
I was able to reproduce it now, so no need to share any data. Seems like used too small files. Looking into it.

Torbjørn Rognes

unread,
Sep 9, 2017, 8:53:26 AM9/9/17
to VSEARCH Forum
I think the bug is fixed in the latest commit. There was a problem when VSEARCH tried to read too large parts of a file at a time.

christian...@gmail.com

unread,
Sep 9, 2017, 11:08:11 AM9/9/17
to VSEARCH Forum
I just checked out the latest commit.

vsearch --udbstats no longer generates the fatal error, although no information is printed:

test_directory> vsearch --udbstats
CGDBv2.0.udb                                                                                                        
 
vsearch v2
.4.4_linux_x86_64, 252.3GB RAM, 20 cores
https
://github.com/torognes/vsearch

Reading UDB file 100%
test_directory
>


On the other hand

vsearch
--usearch_global short_sequences.fa -db my.udb ...

still generates the same fatal error

Torbjørn Rognes

unread,
Sep 11, 2017, 8:55:24 AM9/11/17
to VSEARCH Forum
I think I have removed these and some other UDB-related bugs in the latest commit. I have also restructured the code.

The udbstats command only writes its report to the log file. Please specify the log file with the "--log" option. This is similar to usearch.

christian...@gmail.com

unread,
Sep 11, 2017, 3:08:42 PM9/11/17
to VSEARCH Forum
Appears to work as expected now.

As an aside, I ran a test to compare vsearch output when using as the database a) a fasta file and b) a udb file generated from the same fasta file. All other command line parameters were the same. To my surprise, the difference in the set of database matches returned by each vsearch was large---even when I set max_rejects to values up to 4,000 or more. Since vsearch uses a heuristic search, I would expect differences in the database matches returned up to a point. But I thought that for fairly "thorough" settings the sets of database matches returned for fasta file vs udb file would converge.

Here is the udb version of the vsearch command used. (Note: test_primers are 18-23 nt).
vsearch --usearch_global test_primers.fa -db CGDBv2.0.udb --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_udb.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 8192 --threads 15 --maxseqlength 1000000

Anyhow, this observation is tangential to this thread. Thanks for implementing the --makeudb_usearch option. It really helps.



Torbjørn Rognes

unread,
Sep 11, 2017, 3:16:52 PM9/11/17
to VSEARCH Forum
Thanks for your feedback.

The results should be exactly the same no matter if you use a FASTA or UDB database. If there are diffences there must be bugs remaining. I did a similar test myself, and obtained identical results. I will perform additional tests to investigate.

Torbjørn Rognes

unread,
Sep 12, 2017, 11:37:01 AM9/12/17
to VSEARCH Forum
I would very much like to identify the cause of the differences you see when you run searches directly on a FASTA file and on a UDB file generated from the same FASTA file. The results should be identical, otherwise there is something wrong. If you run with multiple threads the order of the matches in the result files could be different, but the results should essentially be the same.

What options did you use when you generated the UDB file from the FASTA file? Did you include the same option regarding word length, database masking, etc (--wordlength 8 --dbmask none --maxseqlength 1000000)?

christian...@gmail.com

unread,
Sep 12, 2017, 1:33:57 PM9/12/17
to VSEARCH Forum
Trying this again to address your question about parameters used in the database build, I remade the database with all possible options (even though many are probably ignored or not used):

vsearch --makeudb_usearch CGDBv2.0.fa --id 0.83 --minwordmatches 1  --wordlength 8 --output CGDB2.0.udb -dbmask none -qmask none
--strand both --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 -maxrejects 4096 --maxseqlength 1000000

I then ran vsearch with and without the udb file as the database:
vsearch --usearch_global test_primers.fa -db CGDBv2.0.udb --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_udb.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 4096 --threads 15 --maxseqlength 1000000

and
vsearch --usearch_global test_primers.fa -db CGDBv2.0.fa --id 0.83 --minwordmatches 1 --wordlength 8 --dbmask none --qmask none --strand both --userout against_fasta.vsearch --userfields 'query+qstrand+qlo+qhi+qrow+target+tstrand+tilo+tihi+trow+mism+opens' --maxaccepts 10000000 --query_cov 0.95 --maxgaps 1 --maxrejects 4096 --threads 15 --maxseqlength 1000000

Straightaway I can see that the two output files are different:
> wc -l against_*.vsearch                                                                                                        
340137 against_fasta.vsearch
337004 against_udb.vsearch
677141 total

To determine how many different lines there are in total:
> cat <(awk '{printf "fasta\t%s\n", $0}' against_fasta.vsearch) <(awk '{printf "udb\t%s\n", $0}' against_udb.vsearch) | sort -k2,13 | uniq -u -f 1 > uniq-both.lines
> wc -l uniq-both.lines
22621 uniq-both.lines

The number of results that the fasta-based output has that the udb-based output does not have, and conversely:
> egrep -c "^fasta" uniq-both.lines
12877
> egrep -c "^udb" uniq-both.lines
9744

So, roughly the same.






eric.norm...@gmail.com

unread,
Oct 6, 2017, 2:11:42 PM10/6/17
to VSEARCH Forum
I see that support for .udb has been added in 2.5.0. Is this bug you were reporting / investigating resolved?

Thanks for adding .udb support! This will gain me ~20-50% in my similarity searches depending on the dataset. 

Torbjørn Rognes

unread,
Oct 6, 2017, 2:29:13 PM10/6/17
to VSEARCH Forum
I have not been able to reproduce the error yet, but will continue my effort.

Torbjørn Rognes

unread,
Oct 24, 2017, 7:32:06 AM10/24/17
to VSEARCH Forum
The source of the differences in results reported with and without udb databases might be due to the `--minseqlength` option. By default this option is 32 for the `--usearch_global` command and some other commands, but 1 for `--makeudb_usearch`. When some of the database sequences are shorter than 32 bp this could give different results unless the `--minseqlength 32` option is passed when running `--makeudb_usearch`. I will correct this problem.

Could you try to run makeudb_usearch with --minseqlength 32, rerun the tests and see if the results still differ?

eric.norm...@gmail.com

unread,
Jan 27, 2020, 9:47:54 AM1/27/20
to VSEARCH Forum
More than two years later, I want to create a database in which with some of the sequences shorter than 32 bp.

It seems like the `--minseqlength` command is not accepted by `vsearch --makeudb_usearch`

See error message (for vsearch v2.14.1_linux_x86_64, 220.3GB RAM, 40 cores)

```
vsearch --makeudb_usearch db_teleo_vsearch_formatted.fasta --output test.vsearchdb --minseqlength 100
Fatal error: Invalid options to command makeudb_usearch
Invalid option(s): --minseqlength
The valid options for the makeudb_usearch command are: --bzip2_decompress --dbmask --gzip_decompress --hardmask --log --no_progress --notrunclabels --output --quiet --threads --wordlength
```

Torbjørn Rognes

unread,
Jan 28, 2020, 6:31:55 AM1/28/20
to VSEARCH Forum
Thank your for reporting this. This is indeed a bug and I have created an issue on Github for it:


I'll fix it soon.

- Torbjørn

Reply all
Reply to author
Forward
0 new messages