using vsearch-1.8.0 in qiime

Diana

unread,

Oct 30, 2015, 3:48:36 PM10/30/15

to VSEARCH Forum

Hi,

I am doing analyses of 16S data in qiime and am at the point where I need to do chimera detection, dereplicate etc. I am running qiime 1.9.1 on a virtualbox and it does not have usearch61 installed, but based on what I have read on the qiime forum and elsewhere I would like to use vsearch anyways. I downloaded vsearch-1.8.0 from https://github.com/torognes/vsearch and I have renamed the binary file usearch61 so it will be recognized in qiime. A PATH was created as follows: qiime@qiime-190-virtual-box:/$ export PATH=/home/qiime/Desktop/vsearch-1.8.0/bin/:$PATH

As a test I ran the below script on a subset of my data and received the below message. The output directory contained 5 files, which I also listed below along with the details contained in the 2 log files. I am not sure what to expect as this is my attempt at 16S data analyses and using these tools. Any feedback is greatly appreciated.

Thanks,
Diana
______

qiime@qiime-190-virtual-box:~/Desktop/TestFolder/Seqs$ identify_chimeric_seqs.py -i seqs.fna -m usearch61 -o usearch_check_chimeras/ -r rdp_gold.fa

Traceback (most recent call last):
File "/usr/local/bin/identify_chimeric_seqs.py", line 354, in <module>
    main()
File "/usr/local/bin/identify_chimeric_seqs.py", line 350, in main
    threads=threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 774, in usearch61_chimera_check
    log_lines, verbose, threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 961, in identify_chimeras_usearch61
    HALT_EXEC=HALT_EXEC)
File "/usr/local/lib/python2.7/dist-packages/bfillings/usearch.py", line 2411, in usearch61_chimera_check_ref
    app_result = app()
File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 285, in __call__
    'StdErr:\n%s\n' % open(errfile).read())
burrito.util.ApplicationError: Unacceptable application exit status: 1
Command:
cd "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/"; usearch61 --mindiffs 3 --uchime_ref "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta" --minh 0.28 --xn 8.0 --minseqlength 64 --threads 0.5 --mindiv 0.8 --uchimeout "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_ref.uchime" --dn 1.4 --strand plus --db "/home/qiime/Desktop/TestFolder/Seqs/rdp_gold.fa" --log "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_ref.log" > "/tmp/tmpDG5xUFHBJqxjVyIYCcBE.txt" 2> "/tmp/tmpRKnFMjRLIuRmYHXT9wUl.txt"
StdOut:

StdErr:

Fatal error: Illegal option argument

List of 5 files obtained in the output directory:

seqs.fna_chimeras_denovo.log
seqs.fna_chimeras_denovo.uchime
seqs.fna_consensus_fixed.fasta
seqs.fna_consensus_with_abundance.fasta
seqs.fna_consensus_with_abundance.uc
seqs.fna_smallmem_clustered.log

Below I have copy-pasted the contents of the log files

seqs.fna_chimeras_denovo.log
vsearch v1.8.0_linux_x86_64, 4.4GB RAM, 2 cores
usearch61 --mindiffs 3 --uchime_denovo /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta --minh 0.28 --xn 8.0 --minseqlength 64 --mindiv 0.8 --abskew 2.0 --uchimeout /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_chimeras_denovo.uchime --dn 1.4 --log /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_chimeras_denovo.log
Started Fri Oct 30 09:55:16 2015166166 nt in 730 seqs, min 71, max 283, avg 228

    0.28 minh
    8.00 xn
    1.40 dn
    1.00 xa
    0.80 mindiv
    0.55 id
       2 maxp

/home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta: 233/730 chimeras (31.9%)

Finished Fri Oct 30 09:55:16 2015
Elapsed time 00:00
Max memory 5.2MB

seqs.fna_smallmem_clustered.log
vsearch v1.8.0_linux_x86_64, 4.4GB RAM, 2 cores
usearch61 --maxaccepts 1 --consout /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_consensus_with_abundance.fasta --usersort --id 0.97 --sizeout --minseqlength 64 --wordlength 8 --uc /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_consensus_with_abundance.uc --cluster_smallmem /home/qiime/Desktop/TestFolder/Seqs/seqs.fna --maxrejects 8 --strand plus --log /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_smallmem_clustered.log
Started Fri Oct 30 09:54:32 201540848859 nt in 173083 seqs, min 163, max 283, avg 236

      Alphabet nt
    Word width 8
     Word ones 8
        Spaced No
        Hashed No
         Coded No
       Stepped No
         Slots 65536 (65.5k)
       DBAccel 100%

Clusters: 730 Size min 1, max 25992, avg 237.1
Singletons: 231, 0.1% of seqs, 31.6% of clusters

Finished Fri Oct 30 09:55:15 2015
Elapsed time 00:43
Max memory 87.2MB

Torbjørn Rognes

unread,

Oct 30, 2015, 5:09:01 PM10/30/15

to VSEARCH Forum

Hi Diana!

Thanks for trying out vsearch. Even though we have tried to make the options to vsearch very similar to usearch, they are not all identical. Also, we have not tested if it can directly replace usearch61 in qiime. But it seems that some have successfully used vsearch instead of usearch as you indicate.

In this case it seems like the argument "0.5" to the option "--threads" is what vsearch does not like, as far as I can see from the logs. To this option vsearch requires an integer argument. If it is zero, vsearch will launch as many threads as there are cores in the computer, if it is 1 or higher it will launch that number of threads. It does not make much sense to supply a decimal number, so I do not understand why qiime supplies the number 0.5 as the argument to this option. I think usearch will simply round this down to zero, but vsearch treats it as an error and aborts with a fatal error message.

We could change vsearch to accept such a number, but the best thing would be to change qiime so it does not supply a decimal number to this option. Are you able to specify the number of threads that qiime should uses with some setting when you run qiime?

- Torbjørn

Diana

unread,

Oct 31, 2015, 8:54:22 AM10/31/15

to VSEARCH Forum

Hi Torbjørn,

Thanks for your quick response. I checked the options with 'identify_chimeric_seqs.py -h' and the number of threads can be specified as follows:

--threads=THREADS     Specify number of threads per core to be used for
                        usearch61 commands that utilize multithreading. By
                        default, will calculate the number of cores to utilize
                        so a single thread will be used per CPU. Specify a
                        fractional number, e.g. 1.0 for 1 thread per core, or
                        0.5 for a single thread on a two core CPU. Only
                        applies to usearch61. [default: one_per_cpu]

Therefore I tried the below script which also gave an error message, along with the 5 files in the output directory. I pasted the new log files below for reference. Thoughts?

As vsearch is not yet supported within qiime maybe I should just use it separately? I don't mind that at all, but I am not sure how within the virtualbox so will appreciate help getting started.

Thanks,
Diana

-----------------
qiime@qiime-190-virtual-box:~/Desktop/TestFolder/Seqs$ identify_chimeric_seqs.py -i seqs.fna -m usearch61 -o usearch_check_chimeras/ -r rdp_gold.fa --threads 1.0

Traceback (most recent call last):
File "/usr/local/bin/identify_chimeric_seqs.py", line 354, in <module>
    main()
File "/usr/local/bin/identify_chimeric_seqs.py", line 350, in main
    threads=threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 774, in usearch61_chimera_check
    log_lines, verbose, threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 961, in identify_chimeras_usearch61
    HALT_EXEC=HALT_EXEC)
File "/usr/local/lib/python2.7/dist-packages/bfillings/usearch.py", line 2411, in usearch61_chimera_check_ref
    app_result = app()
File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 285, in __call__
    'StdErr:\n%s\n' % open(errfile).read())
burrito.util.ApplicationError: Unacceptable application exit status: 1
Command:

cd "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/"; usearch61 --mindiffs 3 --uchime_ref "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta" --minh 0.28 --xn 8.0 --minseqlength 64 --threads 1.0 --mindiv 0.8 --uchimeout "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_ref.uchime" --dn 1.4 --strand plus --db "/home/qiime/Desktop/TestFolder/Seqs/rdp_gold.fa" --log "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_ref.log" > "/tmp/tmpn9J3xHYLh3YxxdGJhbKB.txt" 2> "/tmp/tmpkHoAyvAJfEA1XrcY7y8P.txt"

StdOut:

StdErr:

Fatal error: Illegal option argument

Log files generated:

1. seqs.fna_chimeras_denovo.log

vsearch v1.8.0_linux_x86_64, 4.4GB RAM, 2 cores
usearch61 --mindiffs 3 --uchime_denovo /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta --minh 0.28 --xn 8.0 --minseqlength 64 --mindiv 0.8 --abskew 2.0 --uchimeout /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_chimeras_denovo.uchime --dn 1.4 --log /home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_chimeras_denovo.log
Started Fri Oct 30 09:55:16 2015166166 nt in 730 seqs, min 71, max 283, avg 228

    0.28 minh
    8.00 xn
    1.40 dn
    1.00 xa
    0.80 mindiv
    0.55 id
       2 maxp

/home/qiime/Desktop/TestFolder/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta: 233/730 chimeras (31.9%)

Finished Fri Oct 30 09:55:16 2015
Elapsed time 00:00
Max memory 5.2MB

2. seqs.fna_smallmem_clustered.log

Torbjørn Rognes

unread,

Nov 2, 2015, 4:29:48 AM11/2/15

to VSEARCH Forum

Hi Diana

Thanks for trying vsearch again. The problem is still the floating point number (1.0) specified for threads, as vsearch will not accept a floating point number here, only integers. You could try an integer by specifying "--threads 1" as an option to identify_chimeric_seqs.py (instead of "--threads 1.0").

I'll probably release vsearch version 1.8.1 later today that fixes a different problem. I'll include a fix for this problem as well, so that it will accept a floating point number for the threads option.

- Torbjørn

Diana

unread,

Nov 2, 2015, 8:12:59 AM11/2/15

to VSEARCH Forum

Hi Torbjørn,

Thanks for the suggestion. I reran the script using an integer and still get the error message along with the 5 output files. The log files are included below for reference.

Diana

qiime@qiime-190-virtual-box:~/Desktop/TestFolder/Seqs$ identify_chimeric_seqs.py -i seqs.fna -m usearch61 -o usearch_check_chimeras/ -r rdp_gold.fa --threads 1

Traceback (most recent call last):
File "/usr/local/bin/identify_chimeric_seqs.py", line 354, in <module>
    main()
File "/usr/local/bin/identify_chimeric_seqs.py", line 350, in main
    threads=threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 774, in usearch61_chimera_check
    log_lines, verbose, threads)

File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 891, in identify_chimeras_usearch61
    consout_filepath=output_consensus_fp)
File "/usr/local/lib/python2.7/dist-packages/bfillings/usearch.py", line 2278, in usearch61_smallmem_cluster
    app = Usearch61(params, WorkingDir=output_dir, HALT_EXEC=HALT_EXEC)
File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 201, in __init__
    self._error_on_missing_application(params)
File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 468, in _error_on_missing_application
    "Is it in your path?" % command)
burrito.util.ApplicationNotFoundError: Cannot find usearch61. Is it installed? Is it in your path?
qiime@qiime-190-virtual-box:~/Desktop/TestFolder/Seqs$ PATH=/home/qiime/Desktop/vsearch-1.8.0/bin/:$PATH

qiime@qiime-190-virtual-box:~/Desktop/TestFolder/Seqs$ identify_chimeric_seqs.py -i seqs.fna -m usearch61 -o usearch_check_chimeras/ -r rdp_gold.fa --threads 1

Traceback (most recent call last):
File "/usr/local/bin/identify_chimeric_seqs.py", line 354, in <module>
    main()
File "/usr/local/bin/identify_chimeric_seqs.py", line 350, in main
    threads=threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 774, in usearch61_chimera_check
    log_lines, verbose, threads)
File "/usr/local/lib/python2.7/dist-packages/qiime/identify_chimeric_seqs.py", line 961, in identify_chimeras_usearch61
    HALT_EXEC=HALT_EXEC)
File "/usr/local/lib/python2.7/dist-packages/bfillings/usearch.py", line 2411, in usearch61_chimera_check_ref
    app_result = app()
File "/usr/local/lib/python2.7/dist-packages/burrito/util.py", line 285, in __call__
    'StdErr:\n%s\n' % open(errfile).read())
burrito.util.ApplicationError: Unacceptable application exit status: 1
Command:

cd "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/"; usearch61 --mindiffs 3 --uchime_ref "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta" --minh 0.28 --xn 8.0 --minseqlength 64 --threads 1.0 --mindiv 0.8 --uchimeout "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_ref.uchime" --dn 1.4 --strand plus --db "/home/qiime/Desktop/TestFolder/Seqs/rdp_gold.fa" --log "/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_ref.log" > "/tmp/tmpQPOLftAfyAby92GdKfCO.txt" 2> "/tmp/tmp449gKCICNB8cZ0KyECY8.txt"

StdOut:

StdErr:

Fatal error: Illegal option argument

1. seqs.fna_chimeras_denovo.log

vsearch v1.8.0_linux_x86_64, 4.4GB RAM, 2 cores

usearch61 --mindiffs 3 --uchime_denovo /home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta --minh 0.28 --xn 8.0 --minseqlength 64 --mindiv 0.8 --abskew 2.0 --uchimeout /home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_denovo.uchime --dn 1.4 --log /home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_chimeras_denovo.log
Started Mon Nov 2 05:59:00 2015166166 nt in 730 seqs, min 71, max 283, avg 228

    0.28 minh
    8.00 xn
    1.40 dn
    1.00 xa
    0.80 mindiv
    0.55 id
       2 maxp

/home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_fixed.fasta: 233/730 chimeras (31.9%)

Finished Mon Nov 2 05:59:00 2015

Elapsed time 00:00
Max memory 5.2MB

2. seqs.fna_smallmem_clustered.log

vsearch v1.8.0_linux_x86_64, 4.4GB RAM, 2 cores

usearch61 --maxaccepts 1 --consout /home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_with_abundance.fasta --usersort --id 0.97 --sizeout --minseqlength 64 --wordlength 8 --uc /home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_consensus_with_abundance.uc --cluster_smallmem /home/qiime/Desktop/TestFolder/Seqs/seqs.fna --maxrejects 8 --strand plus --log /home/qiime/Desktop/TestFolder/Seqs/usearch_check_chimeras/seqs.fna_smallmem_clustered.log
Started Mon Nov 2 05:58:22 201540848859 nt in 173083 seqs, min 163, max 283, avg 236

      Alphabet nt
    Word width 8
     Word ones 8
        Spaced No
        Hashed No
         Coded No
       Stepped No
         Slots 65536 (65.5k)
       DBAccel 100%

Clusters: 730 Size min 1, max 25992, avg 237.1
Singletons: 231, 0.1% of seqs, 31.6% of clusters

Finished Mon Nov 2 05:59:00 2015
Elapsed time 00:38
Max memory 87.2MB

Torbjørn Rognes

unread,

Nov 2, 2015, 10:48:10 AM11/2/15

to VSEARCH Forum

I've just released vsearch version 1.8.1 that should fix your problem by allowing a floating point argument to the threads option.

- Torbjørn

Diana

unread,

Nov 2, 2015, 1:17:36 PM11/2/15

to VSEARCH Forum

Hi Torbjørn,

Thank you! I reinstalled the new version and the qiime script now works.

I am now try out vsearch independently of qiime and following workflow suggestions posted by vsearch users on the qiime forum. My first step was dereplication of seqs (input being the seqs.fna file I generated with the multiple_split_libraries_fastq.py script) using the below script. It seemed to work, I got the output file, but also a warning, which I don't understand. Appreciate your feedback

Diana

qiime@qiime-190-virtual-box:~/Desktop/RSL_NewAnalyses/Mult_Split_Lib_Output2$ vsearch -derep_fulllength seqs.fna -output seqs.dereplicated.fna
vsearch v1.8.1_linux_x86_64, 4.4GB RAM, 2 cores
https://github.com/torognes/vsearch

Reading file seqs.fna 100%
604622440 nt in 2534485 seqs, min 163, max 283, avg 239
WARNING: 2579255 sequences shorter than 32 nucleotides discarded.
Dereplicating 100%
Sorting 100%
456971 unique sequences, avg cluster 5.5, median 1, max 13174
Writing output file 100%

Torbjørn Rognes

unread,

Nov 2, 2015, 2:12:32 PM11/2/15

to VSEARCH Forum

Hi Diana

For some commands (dereplication, search and clustering), vsearch will automatically discard all sequences shorter than a certain length (default 32). This is similar to usearch.

It seems like more than half of your sequences (2579255 discarded and 2534485 kept) are shorter than 32 nucleotides. That's a lot. Is it correct?

You can set this limit lower if you want to, using the "minseqlength" option, e.g. "--minseqlength 1" to include all.

- Torbjørn

Diana

unread,

Nov 2, 2015, 3:51:00 PM11/2/15

to VSEARCH Forum

Hi Torbjørn,

Thanks for taking a look. I think I found the problem - the seqs.fna input file contains the barcode sequences that were removed in the previous step (i.e. multiple_extract_barcodes.py). Not sure why, but I am working on correcting it.

Diana

Diana

unread,

Nov 2, 2015, 4:17:43 PM11/2/15

to VSEARCH Forum

Hi Torbjørn,

Ok, I seem to have fixed the issue by moving the files containing the stripped primer sequences (not barcode) out into a separate directory and then running 'multiple_split_libraries_fastq.py' to generate a new seqs.fna file that I used as input into vsearch for dereplication. The script is below and it looks like 18% of the sequences are retained (469571 of 2534485), and hope this is not unusual.

In the next step I tried to remove singletons with the -sortbysize function and the output file was empty. I am not sure what this means, am running this script at the wrong time in the work flow? Output is below.

Diana

qiime@qiime-190-virtual-box:~/Desktop/RSL_NewAnalyses/Mult_Split_Lib_Output$ vsearch -derep_fulllength seqs.fna -output seqs.dereplicated.fna

vsearch v1.8.1_linux_x86_64, 4.4GB RAM, 2 cores
https://github.com/torognes/vsearch

Reading file seqs.fna 100%
604622440 nt in 2534485 seqs, min 163, max 283, avg 239

Dereplicating 100%
Sorting 100%
456971 unique sequences, avg cluster 5.5, median 1, max 13174
Writing output file 100%

qiime@qiime-190-virtual-box:~/Desktop/RSL_NewAnalyses/Mult_Split_Lib_Output$ vsearch -sortbysize seqs.dereplicated.fna -minsize 2 -output seqs.derep.mc2.fasta

vsearch v1.8.1_linux_x86_64, 4.4GB RAM, 2 cores
https://github.com/torognes/vsearch

Reading file seqs.dereplicated.fna 100%
107904317 nt in 456971 seqs, min 163, max 283, avg 236
Getting sizes 100%
Sorting 100%
Median abundance: 0
Writing output 100%

Torbjørn Rognes

unread,

Nov 2, 2015, 4:26:27 PM11/2/15

to VSEARCH Forum

Hi Diana

You need to use the --sizeout option with --derep_fulllength for vsearch to include the abundance information in the output file. Then use the --sizein option with the --sortbysize command to read that information. vsearch will add abundance info to the fasta header by appending something like ";size=123;" at the end.

You can also dereplicate and remove singletons in one step by using the "--minuniquesize 2" option during dereplication.

- Torbjørn

Diana

unread,

Nov 6, 2015, 10:53:32 AM11/6/15

to VSEARCH Forum

Hi Torbjørn,

Thanks for that tip! I successfully processed my data through the vsearch work flow and was impressed by the speed and ease of use. After reviewing my workflow I have a question regarding the clustering step. I used the recentering strategy described by R. Edgar in the usearch documentation (http://drive5.com/usearch/manual6/recenter.html). Is this a valid approach within vsearch?? I have copy pasted the commands (bold) and output below.

Thanks,

Diana

qiime@qiime-190-virtual-box:~/Desktop/RSL_NewAnalyses/vsearch_output$ vsearch -cluster_fast seqs.dereplicated_min2.fna -consout cons.fasta -id 0.97 -sizeout

vsearch v1.8.1_linux_x86_64, 4.4GB RAM, 2 cores

https://github.com/torognes/vsearch

Reading file seqs.dereplicated_min2.fna 100%

19199207 nt in 81411 seqs, min 163, max 283, avg 236

Masking 100%

Sorting by length 100%

Counting unique k-mers 100%

Clustering 100%

Sorting clusters 100%

Writing clusters 100%

Clusters: 520 Size min 1, max 7207, avg 156.6

Singletons: 54, 0.1% of seqs, 10.4% of clusters

Multiple alignments 100%

qiime@qiime-190-virtual-box:~/Desktop/RSL_NewAnalyses/vsearch_output$ vsearch -sortbysize cons.fasta -output cons_bysize.fasta

vsearch v1.8.1_linux_x86_64, 4.4GB RAM, 2 cores

https://github.com/torognes/vsearch

Reading file cons.fasta 100%

121634 nt in 520 seqs, min 45, max 283, avg 234

Getting sizes 100%

Sorting 100%

Median abundance: 54

Writing output 100%

qiime@qiime-190-virtual-box:~/Desktop/RSL_NewAnalyses/vsearch_output$ vsearch -cluster_smallmem cons_bysize.fasta -id 0.97 -centroids otus.fasta -usersort

vsearch v1.8.1_linux_x86_64, 4.4GB RAM, 2 cores

https://github.com/torognes/vsearch

Reading file cons_bysize.fasta 100%

121634 nt in 520 seqs, min 45, max 283, avg 234

Masking 100%

Counting unique k-mers 100%

Clustering 100%

Sorting clusters 100%

Writing clusters 100%

Clusters: 362 Size min 1, max 9, avg 1.4

Singletons: 289, 55.6% of seqs, 79.8% of clusters

Torbjørn Rognes

unread,

Nov 9, 2015, 7:38:31 AM11/9/15

to VSEARCH Forum

Hi Diana

This approach should be as valid for vsearch as it is with usearch. It seems to sort by length, cluster, sort by abundance and then cluster again.

You can also sort by abundance and cluster in one step in vsearch using the "cluster_size" command (instead of sortbysize + cluster_smallmem).

I would also recommend looking at other clustering algorithms, like Swarm: https://peerj.com/preprints/1222/

A peer-reviewed version of the paper has just been accepted in will be available in a few days.

- Torbjørn

Diana

unread,

Nov 12, 2015, 10:18:09 PM11/12/15

to VSEARCH Forum

Hi Torbjørn,

Thanks for the heads-up. I will give it a try... new tools appear faster than I can get through my data set :)

Diana

Reply all

Reply to author

Forward