Clustering unique ASV sequences (no abundances)

447 views
Skip to first unread message

Peter Pellitier

unread,
Feb 18, 2021, 2:32:37 PM2/18/21
to VSEARCH Forum
Hi VSEARCH,

I have approximately 1.5 million unique ASV sequences processed using DADA2, and I would like to cluster them into 97% OTU. I have a fasta file where each line is an individual ASV, but no abundance information is contained for each ASV.  I would like to pass this fasta file to cluster_size to cluster and obtain a list of ASV that map to each OTU. From there, the idea would be to map back the ASV into the appropriate OTU cluster and collapse an ASV abundance table into and OTU abundance table. Do I need frequency based information for each ASV sequence in the clustering input?

I think this may be useful for other folks who want to go from Dada2/phyloseq to 97% clustering and back into phyloseq. 
Thanks,
Peter

Peter Pellitier

unread,
Feb 22, 2021, 2:25:08 PM2/22/21
to VSEARCH Forum
Hi all,

I would like to repost my question: I am trying to do a simple 97% clustering approach with ITS ASV that vary in length. I have each ASV as the fasta input (DOB.fasta) into the following command. Can you please advise how I would be able to get the list of input ASV sequences that map onto individual OTU clusters?

Thanks,
Peter

VSEARCH --cluster_size DOB.fasta \
    --threads 4 \
    --id 0.97 \
    --strand plus \
    --sizein \
    --sizeout \
    --fasta_width 0 \
    --uc all.clustered.uc \
    --relabel OTU_ \
    --centroids all.otus.fasta \
    --otutabout all.otutab.txt

Torbjørn Rognes

unread,
Feb 23, 2021, 3:59:01 AM2/23/21
to VSEARCH Forum
Hi Peter,

I think you can obtain the information you want from the ".uc" file produced during clustering (all.clustered.uc in the example).

The lines starting with an S (and C) in these lines contains the label of the seed (centroid) sequences in the 9th column (separated by tabs).
The lines starting with an H contains the label of the other sequences belonging to a cluster in the 9th column and the corresponding seed in the 10th column
There is more info about the uc files here: https://www.drive5.com/usearch/manual/opt_uc.html

Best wishes,
- Torbjørn

Peter Pellitier

unread,
Feb 23, 2021, 11:38:55 AM2/23/21
to VSEARCH Forum

Thank you,

I'm a bit confused however, how the label of the query sequence is generated (9th column). The input sequences I provided have fasta headers as sequential integers, ASV1, ASV2, ASV3, etc.
Are the zero's for the ASV labels place holders somehow?

Screen Shot 2021-02-23 at 8.26.04 AM.png

Torbjørn Rognes

unread,
Feb 24, 2021, 11:20:19 AM2/24/21
to VSEARCH Forum
Could it be that you have more than a million input sequences labeled ASV1, ASV2, ... ASV10, ... ASV100, ... ASV1000, ... ASV1000011, ... and that they are shown in alphabetical and not numerical order?

Peter Pellitier

unread,
Feb 26, 2021, 4:04:20 PM2/26/21
to VSEARCH Forum
Thats it. Thank you! 
Reply all
Reply to author
Forward
0 new messages