cluster_fast and cluster_size issue

445 views
Skip to first unread message

emily...@gmail.com

unread,
Apr 24, 2019, 6:25:52 PM4/24/19
to VSEARCH Forum
Hi all!

I've been clustering using cluster_fast or cluster_size after dereplicating my sequences using --derep_fulllength. I derepelicated using the loop: 

for %i in (*.fasta) do vsearch --derep_fulllength %i --minuniquesize 1 --sizeout --relabel UNIQ_ --output %i.UNIQ --log %i.UNIQ_LOG

so my dereplicated sequences have the labels "UNIQ_##;size=N" 

This seems to work well. 

HOWEVER, 

I was expecting cluster_fast and cluster_size to then use the given abundances from the dereplicated sequence files, but I've noticed that this is not happening. 

I clustered using the loop: 

for %i in (*.UNIQ) do vsearch -cluster_size %i -centroids %i.OTU_min1 -sizein -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8

and I've also tried the same with cluster_fast as well, and both with and without the -sizein option. 

In USEARCH, you could specify the sort order of your input file for clustering so that the most abundant sequences were chosen as the cluster centroids, and I was expecting that giving the clustering command the sorted dereplicated sequence file would lead to the same outcome.

I just looked at all of my VSEARCH generated -uc files that I've run recently, and I've noticed that the order is by increasing alpha-numeric label order. The abundance is NOT being taken into account. 

Is there an option that I'm missing to make sure that cluster centroids are the most abundant sequence in the cluster, and that the labels are being ignored? 

Also, if you chose the -relabel option for clustering, and do not check the -uc file, it is not obvious that the sort order was by alpha numberic label and not by sequence abundance, which is why I initially missed this issue. 

Thank you! 
Emily

Colin Brislawn

unread,
Apr 24, 2019, 7:38:01 PM4/24/19
to VSEARCH Forum
Hello Emily,

This is a pretty major bug! The --cluster_size function should sort the dereplicated reads by size before clustering them. 

I wonder if the uc file is showing the input reads by alpha-numeric order, even though they were really clustered by size... That could explain the discrepancy without a bug occuring. 

Colin

emily...@gmail.com

unread,
Apr 24, 2019, 7:43:44 PM4/24/19
to VSEARCH Forum

I’m pretty sure they weren’t sorted - I didn’t relabel the centroids, and they appear as they do in the -uc file ex: first in the -uc file is UNIQ_100000000;size=1 and the first centroid is UNIQ_100000000;size=1.

Usually (in USEARCH) the first centroid would be UNIQ_1, because it has the largest size.

toro...@gmail.com

unread,
Apr 25, 2019, 10:06:44 AM4/25/19
to VSEARCH Forum
Hi Emily

Thank you for reporting this problem.

It may look like VSEARCH for some reason did not recognize the abundance numbers in your dereplicated files, even though you included the "--sizeout" option when dereplicating and "--sizein" when clustering. It seems like all sequences have got size=1, which would be the case if VSEARCH does not find the abundance in the headers.

The only idea I have about what could be wrong is this: VSEARCH requires the size-attribute (with the abundance), e.g. ";size=12345", to come before the first space on the header lines. Earlier, some people have had problems because there was a space in the beginning of the header and the size-attribute was at the end.

To investigate what have happened, could you please show the first few lines of one of the dereplicated input files (using e.g. "head prefix.UNIQ" where prefix is the beginning of the name of one of the files).

What version of VSEARCH are you using? On which platform (Linux x86/ARM/POWER, Mac, Windows)? Look at the first line of output from "vsearch -v".

I am also a bit curious about why you write %i for the loop variable. Which scripting language is this?

When clustering, VSEARCH shall sort the sequences in order of decreasing abundance or length when using cluster_size and cluster_fast, respectively. If two sequences are equal by abundance or length, they should be sorted alphanumerically by the header, as you observe. With the cluster_smallmem command and the "--usersort" option it will use the original order.

Best,

- Torbjørn

emily...@gmail.com

unread,
Apr 25, 2019, 1:40:08 PM4/25/19
to VSEARCH Forum
I'm using VSEARCH V2 10.4, for windows. 

The loops are just for windows cmd. 

The first four sequences from my dereplicated input file are below: 


>UNIQ_1;size=5513
TACGTAGGTGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGATAGCAAGTCAGCTGTGAAAACTA
TGGGCTCAACCCATAAACTGCAGTTGAAACTGTTATTCTTGAGTGGAGTAGAGGCAAGCGGAATTCCGAGTGTAGCGGTG
AAATGCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGGCTCTAACTGACGCTGAGGCTCGAAAGTGTG
GGGAGCAAACA
>UNIQ_2;size=3036
TACGTAGGGGGCAAGCGTTGTCCGGAATAATTGGGCGTAAAGGGCGCGTAGGCGGCTCGGTAAGTCTGGAGTGAAAGTCC
TGCTTTTAAGGTGGGAATTGCTTTGGATACTGTCGGGCTTGAGTGCAGGAGAGGTTAGTGGAATTCCCAGTGTAGCGGTG
AAATGCGTAGAGATTGGGAGGAACACCAGTGGCGAAGGCGACTAACTGGACTGTAACTGACGCTGAGGCGCGAAAGTGTG
GGGAGCAAACA
>UNIQ_3;size=2810
TACGTAGGTGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGATAGCAAGTCAGCTGTGAAAACTA
TGGGCTCAACCCATAAACTGCAGTTGAAACTGTTATTCTTGAGTGGAGTAGAGGCAAGCGGAATTCCGAGTGTAGCGGTG
AAATGCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGGCTCTAACTGACGCTGAGGCTCGAAAGGGTG
GGGAGCAAACA
>UNIQ_4;size=1315
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGTGTAGCCGGGAGGGCAAGTCAGATGTGAAATCCA
CGGGCTCAACTCGTGAACTGCATTTGAAACTACTCTTCTTGAGTATCGGAGAGGCAATCGGAATTCCTAGTGTAGCGGTG
AAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGATTGCTGGACGACAACTGACGGTGAGGCGCGAAAGCGTG
GGGAGCAAACA

emily...@gmail.com

unread,
Apr 25, 2019, 5:23:58 PM4/25/19
to VSEARCH Forum
I just ran a quick check to see if it was the underscore causing the "sort by header" issue. 

I ran 

for %i in (*.UNIQ) do vsearch -cluster_fast %i -centroids %i.OTU_min1 -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8

for %i in (*.UNIQ) do vsearch -cluster_size %i -centroids %i.OTU_min1 -sizein -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8

for %i in (*.UNIQ) do vsearch -cluster_size %i -centroids %i.OTU_min1 -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8

for %i in (*.UNIQ) do vsearch -cluster_size %i -centroids %i.OTU_min1 -sizein -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8

for %i in (*.UNIQ) do vsearch -cluster_smallmem %i -centroids %i.OTU_min1 -usersort -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8

all with a file of dereplicated sequences formatted as below (the relabel and size annotations were generated by VSEARCH). 

The only command that used the size sort order was the last command, using -cluster_smallmem with --usersort. 



>UNIQ1;size=1324
TACGGGGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGCAAGGTAAGTCAGCGGTGAAAGACC
AGGGCTCAACTCTGGAAGTGCCGTTGATACTGTCTGGCTAGAATGATTCCGCCGTGGGAGGAATGAGTAGTGTAGCGGTG
AAATGCATAGATATTACTCAGAACACCGATTGCGAAGGCATCTCACGAGGGGTTCATTGACGCTGAGGCACGAAAGCGTG
GGGATCGAACA
>UNIQ2;size=1269
TACGGAAGGTCCAGGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGCAGGCGGCGGCGTAAGTCAGTTGTGAAATCGT
GCGGCTTAACCGTGCAATTGCAGTTGATACTGCGTCGCTTGAGTGCACACAGGGATGTTGGAATTCATGGTGTAGCGGTG
AAATGCTTAGATATCATGAAGAACTCCGATCGCGAAGGCATATGTCCGGAGTGCAACTGACGCTGAGGCTCGAAAGTGTG
GGTATCAAACA
>UNIQ3;size=1017
TACGGAGGATTCAAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGCGGTTAGATAAGTTAGAGGTGAAATCCC
GGGGCTTAACTCCGAAATTGCCTCTAATACTGTTTGACTAGAGAGTAGTTGCGGTAGGCGGAATGTATGGTGTAGCGGTG
AAATGCTTAGAGATCATACAGAACACCGATTGCGAAGGCAGCTTACCAAACTATATCTGACGTTGAGGCACGAAAGCGTG
GGGAGCAAACA
>UNIQ4;size=878
TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGTGCGTAGGTGGCAAGGCAAGTCTGAAGTGAAAATCC
GGGGCTCAACCCCGGAACTGCTTTGGAAACTGTTTAGCTGGAGTACAGGAGAGGTAAGTGGAATTCCTAGTGTAGCGGTG
AAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGACTTACTGGACTGCTACTGACACTGAGGCACGAAAGCGTG
GGGAGCAAACA

toro...@gmail.com

unread,
Apr 26, 2019, 7:45:02 AM4/26/19
to VSEARCH Forum
I've tried to reproduce the problem, but I am not able to.

I have run vsearch 2.10.4 on Windows 7 (technically through the VirtualBox virtual machine on a Mac) with those four sequences (shuffled) as input and I always get the correct result.

This is probably unrelated to the problem, but the "-minsize" option is ignored by this command. It is only used with the fastx_filter command. In version 2.13.0 it is reported as an error. It should be removed.

I am sorry, but I have no idea about where the error is.

Could you try using the latest version (2.13.0)?

- Torbjørn

emily...@gmail.com

unread,
Apr 26, 2019, 12:07:02 PM4/26/19
to VSEARCH Forum
Hi there! 

Could you try again with the following?

>UNIQ_1;size=5513
TACGTAGGTGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGATAGCAAGTCAGCTGTGAAAACTA
TGGGCTCAACCCATAAACTGCAGTTGAAACTGTTATTCTTGAGTGGAGTAGAGGCAAGCGGAATTCCGAGTGTAGCGGTG
AAATGCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGGCTCTAACTGACGCTGAGGCTCGAAAGTGTG
GGGAGCAAACA
>UNIQ_1000;size=32
TACGTAGGGGGCAAGCGTTGTCCGGAATAATTGGGCGTAAAGGGCGCGTAGGCGGCTCGGTAAGTCTGGAGTGAAAGTCC
TGCTTTTAAGGTGGGAATTGCTTTGGATACTGTCGGGCTTGAGTGCAGGAGAGGTTAGTGGAATTCCCAGTGTAGCGGTG
AAATGCGTAGAGATTGGGAGGAACACCAGTGGCGAAGGCGACTAACTGGACTGTAACTGACGCTGAGGCGCGAAAGTGTG
GGGAGCAAACA
>UNIQ_10001;size=1
TACGTAGGTGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGAGCGCAGGCGGGATAGCAAGTCAGCTGTGAAAACTA
TGGGCTCAACCCATAAACTGCAGTTGAAACTGTTATTCTTGAGTGGAGTAGAGGCAAGCGGAATTCCGAGTGTAGCGGTG
AAATGCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGGCTCTAACTGACGCTGAGGCTCGAAAGGGTG
GGGAGCAAACA
>UNIQ_1019;size=16
TACGTAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGCGTGTAGCCGGGAGGGCAAGTCAGATGTGAAATCCA
CGGGCTCAACTCGTGAACTGCATTTGAAACTACTCTTCTTGAGTATCGGAGAGGCAATCGGAATTCCTAGTGTAGCGGTG
AAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGATTGCTGGACGACAACTGACGGTGAGGCGCGAAAGCGTG
GGGAGCAAACA

toro...@gmail.com

unread,
Apr 29, 2019, 6:45:31 AM4/29/19
to VSEARCH Forum
Hi

I've tried the new sequences, and they work well. Both on Windows and Mac.

Notice that you need to add the --sizeout option for the total cluster abundances to be shown in the centroids file. I don't know if this matters for you.

I am sorry that I cannot help you. Perhaps someone else are able to reproduce.

- Torbjørn

emily...@gmail.com

unread,
Apr 29, 2019, 1:25:09 PM4/29/19
to VSEARCH Forum
So strange! 

Any idea why mine may not be working correctly? 

I just ran with the sequences above in Test_Sequences.txt

vsearch -cluster_size Test_Sequences.txt -centroids Test_Sequences.txt.OTU_min1  -sizeout -id 0.97 -uc Test_Sequences.txt.class_OTU_min1 -biomout Test_Sequences.txt.OTUTAB

and my -uc file looks like this: 

Uc file.PNG


and my -centroids file looks like this: 

Centroids File.PNG


To me it looks like it is interpreting the ;size=# from the dereplication as part of the header, and doesn't recognise it is the size... 
Also, the new cluster sizes are put on a new line, instead of with the header. I haven't done anything weird to the headers that would add in a hidden character, they're just passed from the dereplication step to the clustering step. 

Emily 

toro...@gmail.com

unread,
Apr 30, 2019, 3:19:03 AM4/30/19
to VSEARCH Forum
This looks interesting. The size labels on the second lines is not how it should be. Perhaps this is something Windows-related. I'll look into it.

- Torbjørn

toro...@gmail.com

unread,
Apr 30, 2019, 4:07:05 AM4/30/19
to VSEARCH Forum
Hi Emily

I have found the bug and is working on a fix. I will release a new version later today.

I have opened an issue on GitHub for it:


Thanks again for reporting this bug!

- Torbjørn

toro...@gmail.com

unread,
Apr 30, 2019, 7:35:26 AM4/30/19
to VSEARCH Forum
The problem should be fixed in vsearch version 2.13.2 just released. Please install this version and test it.


You may need to perform the dereplication again and then the clustering.

The old version did not properly remove the carriage return (CR) characters (ascii character no 13) that are present at the end of lines on Windows systems. When the size attribute was added at the end of the header with the sizeout option, the header got an illegal format.

Sorry for this. I hope this resolves the case for you.

- Torbjørn

toro...@gmail.com

unread,
Apr 30, 2019, 8:05:25 AM4/30/19
to VSEARCH Forum
Sorry, please use version 2.13.3 instead:


I introduced a new bug in version 2.13.2.

- Torbjørn

Reply all
Reply to author
Forward
0 new messages