# dereplicate reads
vsearch --derep_fulllength seqs.fna --output derep.fna --minuniquesize 2 --fasta_width 0 --sizeout --relabel_keep;
# must sort because searching done greedily
# see http://drive5.com/usearch/manual/uparseotu_algo.html
vsearch --fasta_width 0 --sortbysize derep.fna -output derep_sorted.fna --relabel_keep --notrunclabels;
# search against Green Genes to generate closed ref OTUs
# reemember to set SET MAX_REJECTS, etc
vsearch --fasta_width 0 --usearch_global derep_sorted.fna --threads 0 --dbmask none --qmask none --rowlen 0 --top_hits_only --notmatched closed_ref_fail.fna --db GG_97.fasta --id 0.97 --matched closed_ref.fna --uc closed_ref.uc --relabel_keep --notrunclabels;
# sort samples failed closed ref reads
vsearch --fasta_width 0 --sortbysize closed_ref_fail.fna -output closed_ref_fail_sorted.fna --relabel_keep --notrunclabels;
# randomly subsample 10% of failed closed ref reads
# this will already be sorted by abundance since the input is sorted
vsearch --fasta_width 0 --fastx_subsample closed_ref_fail_sorted.fna --fastaout closed_ref_fail_subsample_sorted.fna --sample_pct 10 --relabel_keep --notrunclabels;
# cluster failed closed ref subsample reads -> ref DB for new ref OTU
vsearch --fasta_width 0 --cluster_size closed_ref_fail_subsample_sorted.fna --clusterout_id --consout new_ref_db.fna --id 0.97 --qmask none --relabel_keep --notrunclabels;
# search against new ref DB
# hits are considered New.ReferenceOTU
# failures are considered New.CleanUp
vsearch --fasta_width 0 --usearch_global closed_ref_fail_sorted.fna --threads 0 --dbmask none --qmask none --rowlen 0 --top_hits_only --notmatched new_ref_fail.fna --db new_ref_db.fna --id 0.97 --matched new_ref.fna --uc new_ref.uc --relabel_keep --notrunclabels;
# denovo cluster of new ref failures
vsearch --fasta_width 0 --cluster_size new_ref_fail.fna --clusterout_id --consout new_ref_cleanup.fna --id 0.97 --qmask none --uc new_ref_cleanup.uc --relabel_keep --notrunclabels;
Sequence labels must have sample identifiers (input set) and OTU identifiers (database) as explained later in this page. This means that you cannot use the input file to cluster_otus for this step because several samples often have the same unique sequence, so the dereplicated (unique) sequence labels either do not have a sample identifier, or have a misleading sample identifier because the same sequence may be found in other samples. The way to deal with this is usually to go back to the "raw" reads after merging or truncating to a fixed length. See sample identifiers for ways to add sample identifiers to the read labels.
If a size annotation is found in an read label, the abundance will be added to the total for its OTU.
cat step1_otus/closed_ref.fna step2_otus/new_ref.fna step3_otus/new_ref_cleanup.fna >> final.fna;
vsearch --fasta_width 0 --usearch_global derep_sorted.fna --threads 0 --dbmask none --qmask none --rowlen 0 --top_hits_only --db final.fna --id 0.97 --uc final.uc --relabel_keep --notrunclabels
biom from-uc -i final.uc -o final.biom --rep-set-fp final.fna;
ValueError: Not all sequence identifiers in the input BIOM file are present in description fields in the representative sequence fasta file.