Hi all!
I've been clustering using cluster_fast or cluster_size after dereplicating my sequences using --derep_fulllength. I derepelicated using the loop:
for %i in (*.fasta) do vsearch --derep_fulllength %i --minuniquesize 1 --sizeout --relabel UNIQ_ --output %i.UNIQ --log %i.UNIQ_LOG
so my dereplicated sequences have the labels "UNIQ_##;size=N"
This seems to work well.
HOWEVER,
I was expecting cluster_fast and cluster_size to then use the given abundances from the dereplicated sequence files, but I've noticed that this is not happening.
I clustered using the loop:
for %i in (*.UNIQ) do vsearch -cluster_size %i -centroids %i.OTU_min1 -sizein -minsize 1 -id 0.95 -uc %i.class_OTU_min1 -biomout %i.OTUTAB -threads 8
and I've also tried the same with cluster_fast as well, and both with and without the -sizein option.
In USEARCH, you could specify the sort order of your input file for clustering so that the most abundant sequences were chosen as the cluster centroids, and I was expecting that giving the clustering command the sorted dereplicated sequence file would lead to the same outcome.
I just looked at all of my VSEARCH generated -uc files that I've run recently, and I've noticed that the order is by increasing alpha-numeric label order. The abundance is NOT being taken into account.
Is there an option that I'm missing to make sure that cluster centroids are the most abundant sequence in the cluster, and that the labels are being ignored?
Also, if you chose the -relabel option for clustering, and do not check the -uc file, it is not obvious that the sort order was by alpha numberic label and not by sequence abundance, which is why I initially missed this issue.
Thank you!
Emily