Inconsistent reads between derep.fa and otutab?

45 views
Skip to first unread message

Hu Peiyao

unread,
Oct 11, 2023, 11:00:16 AM10/11/23
to VSEARCH Forum
Hi,
The reads number in the final OTUTAB for each sample is always less than derep.fa. I can't figure out the process from dereplication of the sample to the final clustering.
 
Thanks!

Colin J Brislawn

unread,
Oct 11, 2023, 2:26:52 PM10/11/23
to VSEARCH Forum
I wonder if this is a bug, or a default setting doing something unexpected?

Can you replicate this finding using the vsearch test data and post the result here?

Colin

Hu Peiyao

unread,
Oct 16, 2023, 4:29:02 AM10/16/23
to VSEARCH Forum
The analysis was presented below:

```sh
#######################
THREADS=5

PERL=$(which perl)
VSEARCH=$(which vsearch)
MAP=/public/home/pyhu/project/mGWAS/mGWAS-rawdata/mGWAS-vsearch/database/map.pl
REF=/public/home/pyhu/project/mGWAS/mGWAS-rawdata/mGWAS-vsearch/database/gold.fasta
SEQ_DIR=/public/home/pyhu/project/xjw/20230922-xjwabs/1_RawData/
sample_file=/public/home/pyhu/project/xjw/20230922-xjwabs/samplename
sample_id=$(cat $sample_file | awk -F , '{print $1}')
s=PR2-18S-rRNA-V4
echo
echo Dereplicate at sample level and relabel with sample_n

$VSEARCH --threads $THREADS \
--derep_fulllength $s.fsa \
--strand plus \
--output $s.derep.fasta \
--sizeout \
--uc $s.derep.uc \
--relabel $s. \
--fasta_width 0  


echo Sum of unique sequences in each sample: $(grep -c "^>" *.derep.fasta)

# At this point there should be one fasta file for each sample                  
# It should be quality filtered and dereplicated.                              

echo
echo ====================================
  echo Processing all samples together
echo ====================================
  cat *.derep.fasta > all.fasta

echo
echo Dereplicate across samples and remove singletons

$VSEARCH --threads $THREADS \
--derep_fulllength all.fasta \
--minuniquesize 2 \
--sizein \
--sizeout \
--relabel_keep \
--fasta_width 0 \
--uc all.derep.uc \
--output all.derep.fasta

echo Unique non-singleton sequences: $(grep -c "^>" all.derep.fasta)

## unoise the reads 2019.09.04
echo
echo Unoise the sequences


echo
echo Precluster at 98% before chimera detection

$VSEARCH --threads $THREADS \
--cluster_size all.derep.fasta \
--id 0.98 \
--strand plus \
--sizein \
--sizeout \
--fasta_width 0 \
--uc all.preclustered.uc \
--centroids all.preclustered.fasta

echo Unique sequences after preclustering: $(grep -c "^>" all.preclustered.fasta)



echo
echo De novo chimera detection

$VSEARCH --threads $THREADS \
--uchime_denovo all.preclustered.fasta \
--sizein \
--sizeout \
--fasta_width 0 \
--nonchimeras all.denovo.nonchimeras.fasta \

echo Unique sequences after de novo chimera detection: $(grep -c "^>" all.denovo.nonchimeras.fasta)

echo
echo Reference chimera detection

$VSEARCH --threads $THREADS \
--uchime_ref all.denovo.nonchimeras.fasta \
--db $REF \
--sizein \
--sizeout \
--fasta_width 0 \
--nonchimeras all.ref.nonchimeras.fasta

echo Unique sequences after reference-based chimera detection: $(grep -c "^>" all.ref.nonchimeras.fasta)

echo
echo Extract all non-chimeric, non-singleton sequences, dereplicated

$PERL $MAP all.derep.fasta all.preclustered.uc all.ref.nonchimeras.fasta > all.nonchimeras.derep.fasta

echo Unique non-chimeric, non-singleton sequences: $(grep -c "^>" all.nonchimeras.derep.fasta)

echo
echo Extract all non-chimeric, non-singleton sequences in each sample

$PERL $MAP all.fasta all.derep.uc all.nonchimeras.derep.fasta > all.nonchimeras.fasta

echo Sum of unique non-chimeric, non-singleton sequences in each sample: $(grep -c "^>" all.nonchimeras.fasta)

echo
echo Cluster at 97%, 99% and 100% and relabel with OTU_n, generate OTU table

$VSEARCH --threads $THREADS \
--cluster_size all.nonchimeras.fasta \
--id 0.97 \
--strand plus \
--sizein \
--sizeout \
--fasta_width 0 \
--uc all.clustered.uc \
--relabel OTU_ \
--relabel_keep \
--centroids all.otus.fasta \
--otutabout all.otutab.txt

echo Number of OTUs: $(grep -c "^>" all.otus*.fasta)
#######################


The number of sequences in each fasta file is as below:
=============================================
all.fasta 61048
all.derep.fasta 11695
all.nonchimeras.derep.fasta 11579
all.nonchimeras.fasta 11579
all.preclustered.fasta 5782
all.ref.nonchimeras.fasta 5699
all.denovo.nonchimeras.fasta 5699
all.otus.fasta 4695
=============================================

Hu Peiyao

unread,
Oct 16, 2023, 7:44:19 AM10/16/23
to VSEARCH Forum
In the OTUTAB-all.otutab.txt , column_sum = 56907 which is inconsistent with all.fasta (61048) or any other files.

Looking forward to your reply. THANKS!

Frédéric Mahé

unread,
Oct 17, 2023, 10:10:47 AM10/17/23
to VSEARCH Forum
Between all.fasta and all.otutab.txt, there is a global dereplication step where singletons are removed:

--minuniquesize 2  # discard sequences with a post-dereplication abundance value smaller than 2

Maybe this is the point where reads are seemingly lost?

Hu Peiyao

unread,
Oct 17, 2023, 1:22:15 PM10/17/23
to VSEARCH Forum
I got it, Thanks a lot!
Reply all
Reply to author
Forward
0 new messages