OTU Table Labels

Marita White

unread,

Jun 26, 2024, 2:46:40 PM6/26/24

to VSEARCH Forum

Hello,

I am using VSEARCH to process microbiome sequencing data. I have many different samples, and they are all in different files. I have successfully made OTU tables for each of my samples (referenced against the filtered, dereplicated, de-chimera-ed aggregate OTUs from all samples). However, the sample ID in each of my tables doesn't contain the full sample name, but truncates it at the first "-", resulting in many tables with the same sample label.

For example, this is the top of the OTU table for my first sample. As you can see, the sample is identified as "C1". However, the full name of the sample should be "C1-1-rhizo-ITS1". I have several other samples that start with the same characters (for example, "C1-1-soil-ITS1, C1-2-rhizo-ITS1, etc.) Based on the VSEARCH documentation, I'm guessing that the sample ID label is truncating at the first hyphen. Is there something I can do to work around this?

In case it's helpful, this is the code I used to make the OTU tables:

for i in *_filtered.fasta;

do vsearch -usearch_global ${i} --db zotus_nochime.fa --id 0.97 --otutabout OTU_tables/${i/_*}_otu_counts_its.txt; done

torognes

unread,

Jun 27, 2024, 4:24:50 AM6/27/24

to VSEARCH Forum

Hi

VSEARCH usually extracts the sample names from the header lines in the FASTA files used as input. You can specify the sample name using the "sample" tag, like this:

>abc123;sample=C1-1-rhizo-ITS1;size=5

acgt...

In this case you can use any printable character except semicolon (;) in the name. Here VSEARCH will use "C1-1-rhizo-ITS1" as the sample name.

If you don't use the sample tag VSEARCH will use the initial part of the header, like this:

>C1-1-rhizo-ITS1

acgt...

But in this case only letters (A-Z,a-z), digits (1-9) and underscore (_) are allowed. Anything other than and after illegal characters will be ignored. Here only "C1" will be used as the sample name. This is probably what has happened in your case.

I'll advice you to put the sample name after the "sample" tag as in my first example. Or alternatively, replace the dashes with underscores.

- Torbjørn

Frédéric Mahé

unread,

Jun 27, 2024, 4:45:53 AM6/27/24

to VSEARCH Forum

Hello,

building on Torbjørn's answer, here is a minimalist but complete pipeline.

- create two minimalist samples,
- use the option --sample to add arbitrary header labels (dashes do not matter),
- concatenate samples (cat),
- use the command --usearch_global to compare the sample sequences with a fake database (--db),
- use the --otutabout to produce an occurrence table with the correct sample names

SAMPLE1=$(mktemp)
SAMPLE2=$(mktemp)

printf ">s1\nA\n" | \
vsearch \
--fastx_filter - \
--sample "sample1-ITS1-good" \
--quiet \
--fastaout ${SAMPLE1}
printf ">s1\nA\n" | \
vsearch \
--fastx_filter - \
--sample "sample2-ITS1-good" \
--quiet \
--fastaout ${SAMPLE2}

cat ${SAMPLE1} ${SAMPLE2} | \
vsearch \
--usearch_global - \
--db <(printf ">s\nA\n") \
--minseqlength 1 \
--id 1.0 \
--quiet \
--otutabout - 2> /dev/null

rm ${SAMPLE1} ${SAMPLE2}

The table produced:

#OTU ID sample1-ITS1-good sample2-ITS1-good
s 1 1

Marita White

unread,

Jun 27, 2024, 12:01:56 PM6/27/24

to VSEARCH Forum

Thanks both of you for your helpful and quick replies!

I ended up adding the --sample option to my filtering step, which added a sample name to my files, then made the OTU tables as I had previously:

for i in *merged.fq; do vsearch

-fastq_filter ${i}

-fastq_maxee 1

-fastq_minlen 200

-fastq_qmax 42

-sample ${i/_*}

-fastaout ${i/_*}_filtered.fasta;

done

Reply all

Reply to author

Forward