Clustering and OTU table generation

955 views
Skip to first unread message

sonumu...@gmail.com

unread,
May 16, 2017, 9:48:51 AM5/16/17
to VSEARCH Forum

Dear VSEARCH experts,

I have a question about clustering and OTU table generation using VSEARCH.

Using below mentioned procedure, I generated an uc mapping file (see data below). Now I would like to convert this into an OTU table with original abundance. I tried several options (uc2otutable.py, map.pl etc) but unable to produce an OUT table.

Could you please let me know function/procedure in vsearch or alternative option to make final OUT table? I heard PIVOT table option work as well but not sure how to do it?

 

Also, please let me know if there is something wrong in the procedure I used here for clustering.

 

Looking forward to hear from you…!

 

Regards

Sunil



20. Dereplication using VSEARCH and removing global singleton

vsearch -derep_fulllength clean.fasta -output clean_derep.fna -sizeout --minuniquesize 2

Writing output file 100%

245823 uniques written, 933142 clusters (global singletons) discarded (79.1%)

 

21. Clustering

module load vsearch/2.0.3

vsearch --cluster_size clean_derep.fasta --id 0.97 --sizein --sizeout --sizeorder --relabel OTU_ --centroids otus97_vsearch_repset.fasta --uc clean_derep_cluster_uc

Reading file clean_derep.fasta 100%

48008630 nt in 245823 seqs, min 100, max 297, avg 195

Masking 100%

Sorting by abundance 100%

Counting unique k-mers 100%

Clustering 100%

Sorting clusters 100%

Writing clusters 100%

Clusters: 10547 Size min 2, max 3470234, avg 23.3

Singletons: 0, 0.0% of seqs, 0.0% of clusters

 

21. making OUT table

vsearch -usearch_global clean_derep.fasta -db otus97_vsearch_repset.fasta -strand plus -id 0.97 -uc otu_table_mapping.uc



UC mapping look like this


H 5 145 100.0 + 0 0 = A10_226;size=734721; OTU_6;size=917068;

H 145 146 99.3 + 0 0 146M A10_89;size=330416; OTU_146;size=37375;

H 4 192 100.0 + 0 0 = A10_539;size=762083; OTU_5;size=929140;

H 8 159 100.0 + 0 0 = A10_12;size=401560; OTU_9;size=458630;

H 2 213 100.0 + 0 0 = A10_34;size=1644162; OTU_3;size=2322780;

H 1 190 100.0 + 0 0 = A10_581;size=1975476; OTU_2;size=3250102;

H 4469 190 97.4 + 0 0 189MD A10_712;size=397379; OTU_4470;size=372;

H 7 214 100.0 + 0 0 = A10_171;size=416077; OTU_8;size=536604;

H 0 186 100.0 + 0 0 = A10_21;size=2391901; OTU_1;size=2852990;

H 3 191 100.0 + 0 0 = A10_820;size=1641279; OTU_4;size=3470234;

H 3 191 99.5 + 0 0 191M A10_936;size=287918; OTU_4;size=3470234;

H 8568 191 97.4 + 0 0 141M5I50M A10_598;size=223423; OTU_8569;size=10;

H 10 198 100.0 + 0 0 = A10_124;size=218744; OTU_11;size=278533;

H 12 146 100.0 + 0 0 = A10_88;size=156791; OTU_13;size=248216;

H 11 148 100.0 + 0 0 = A10_3;size=172717; OTU_12;size=337548;

H 13 144 100.0 + 0 0 = A10_139;size=145594; OTU_14;size=268905;

H 9 267 100.0 + 0 0 = A10_1325;size=389839; OTU_10;size=2642492;

H 3 191 99.5 + 0 0 191M A10_26295;size=147308; OTU_4;size=3470234;

H 9 281 100.0 + 0 0 14D267M A10_6781756;size=313058; OTU_10;size=2642492;

H 2746 267 98.5 + 0 0 14I78MD12MI176M A10_1224;size=200518; OTU_2747;size=46973;

H 3 191 99.5 + 0 0 191M A10_464;size=136128; OTU_4;size=3470234;

H 1 190 99.5 + 0 0 190M A10_685;size=134315; OTU_2;size=3250102;

H 15 147 100.0 + 0 0 = A10_3906;size=95598; OTU_16;size=133092;

H 3 191 99.5 + 0 0 191M A10_394;size=103326; OTU_4;size=3470234;

H 19 145 100.0 + 0 0 = A10_39;size=64603; OTU_20;size=97331;

H 9600 191 98.4 + 0 0 2D189M A10_747;size=96852; OTU_9601;size=156;

H 17 145 100.0 + 0 0 = A10_2947;size=71629; OTU_18;size=135810;

H 2 215 99.1 + 0 0 18M2D195M A10_122;size=125897; OTU_3;size=2322780;

H 2677 190 97.9 + 0 0 179MI11M A10_1919;size=76653; OTU_2678;size=1570;

H 108 148 99.3 + 0 0 139MI9M A10_261;size=89155; OTU_109;size=22316;

H 16 142 100.0 + 0 0 = A10_567;size=87479; OTU_17;size=100297;

H 2 214 99.5 + 0 0 18MD195M A10_431;size=90304; OTU_3;size=2322780;

H 2746 281 98.6 + 0 0 92MD12MI176M A10_6781947;size=156876; OTU_2747;size=46973;

H 14 192 100.0 + 0 0 = A10_258;size=125837; OTU_15;size=422795;

H 18 161 100.0 + 0 0 = A10_1677;size=66888; OTU_19;size=83076;

H 6 146 100.0 + 0 0 = A10_13;size=658161; OTU_7;size=1285564;

H 3832 145 97.2 + 0 0 145M4I A10_708;size=85814; OTU_3833;size=103;

H 11 147 99.3 + 0 0 138MI9M A10_108;size=75713; OTU_12;size=337548;

H 5183 145 97.2 + 0 0 142M3D A10_6;size=67760; OTU_5184;size=115;

H 20 139 100.0 + 0 0 = A10_816;size=60761; OTU_21;size=79621;

H 22 103 100.0 + 0 0 = A10_423;size=54082; OTU_23;size=57714;

H 23 144 100.0 + 0 0 = A10_416;size=53035; OTU_24;size=70642;

H 7336 191 97.9 + 0 0 191M A10_10991;size=43215; OTU_7337;size=142;

H 1 190 98.9 + 0 0 190M A10_2144;size=41371; OTU_2;size=3250102;

H 24 173 100.0 + 0 0 = A10_44;size=51278; OTU_25;size=62582;

H 27 148 100.0 + 0 0 = A10_2368;size=45170; OTU_28;size=76682;

H 347 144 97.9 + 0 0 I144M A10_1264;size=33918; OTU_348;size=4456;

H 28 158 100.0 + 0 0 = A10_5478;size=44799; OTU_29;size=80258;

H 7547 191 97.9 + 0 0 172MI19M A10_7590;size=32535; OTU_7548;size=276;

H 3 191 99.0 + 0 0 191M A10_447;size=38856; OTU_4;size=3470234;

H 31 146 100.0 + 0 0 = A10_192;size=32419; OTU_32;size=101444;

H 7080 146 97.3 + 0 0 45I146M A10_294;size=29998; OTU_7081;size=44;

H 33 145 100.0 + 0 0 = A10_1042;size=29135; OTU_34;size=38026;

H 34 146 100.0 + 0 0 = A10_3194;size=26760; OTU_35;size=72476;

H 9600 191 97.9 + 0 0 2D189M A10_5466;size=22991; OTU_9601;size=156;

H 6671 186 98.3 + 0 0 7D179M A10_117;size=25941; OTU_6672;size=205;

H 12 146 99.3 + 0 0 146M A10_52;size=28543; OTU_13;size=248216;



Torbjørn Rognes

unread,
May 18, 2017, 3:58:25 AM5/18/17
to VSEARCH Forum
Hi

You can use the otutabout, mothur_shared_out or biomout option with the usearch_global command to produce an OTU table in the different formats. The query sequence headers must contain sample labels and the database sequence headers must contain otu labels. Please see the manual for details.

- Torbjørn

sonumu...@gmail.com

unread,
May 18, 2017, 5:15:33 AM5/18/17
to VSEARCH Forum
Thank you for you reply.
I am using vsearch/2.0.3, and it says unrecognised command for --biomout or --otutabout options.

My main question was how to back assign the original abundance data (which was reduced during dereplication) in the OTU table. Using the above mentioned steps (20, 21) I manage to produce an OTU table, but abundance information is only connected to dereplicated file not original fasta file.

Looking forward to hear from you..!
Regards
Sunil

Torbjørn Rognes

unread,
May 18, 2017, 5:26:20 AM5/18/17
to VSEARCH Forum
You need to use vsearch version 2.2.0 or later for the biomout, otutabout or mothur_shared_out options. These options will take the original abundances into account when producing the OTU table.

Make sure you use the "sizein" and "sizeout" options at all steps to propagate abundance information all the way.

There exists script to convert uc files into otu tables, but they are not included with vsearch.

- Torbjørn



sonumu...@gmail.com

unread,
May 18, 2017, 1:25:28 PM5/18/17
to VSEARCH Forum
I manage to use --otutabout option with latest vsearch release, but still it looks strange.

My original fasta sequence file looks like this (sample name before underscore and sequnce number after underscore).
Q 1. is this header format is compatible with vsearch?
>A10_3
TAACCACTCAAGCTCTCGCTTGGTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGCCGGTTGCTTGCCAACAACCCCCAAATTTTTTTACA
>A34_1254822
TAACCACTCAAGCTCTCGCTTGGTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGCCGGTTGCTTGCCAACAACCCCCAAATTTTTTTACA
>A41_1699888
TAACCACTCAAGCTCTCGCTTGGTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGCCGGTTGCTTGCCAACAACCCCCAAATTTTTTTACA
>A41_1742307
TAACCACTCAAGCTCTCGCTTGGTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGCCGGTTGCTTGCCAACAACCCCCAAATTTTTTTACA
>A41_1758110
TAACCACTCAAGCTCTCGCTTGGTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGCCGGTTGCTTGCCAACAACCCCCAAATTTTTTTACA
>A48_2079057
TAACCACTCAAGCTCTCGCTTGGTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGCCGGTTGCTTGCCAACAACCCCCAAATTTTTTTACA
>A4_2149621
GTATTGGGGTGCGCGGCTCCGCGGCCCCTAAAGTCAGTGGCGGTGCCTGTCGGCTCTACGCGTAGTAATACTCCTCGCGTCTGGGTCCGGC

After using derep_fulllength fasta file looks like this
is this correct format after dereplication?
>A10_21;size=2391901;
AATTCTCAACCTTCAACTTTATTGATGAAGGCTTGGACTTGGAGGTTGTGTCGGCTCTTGTAGTCGACTCCTCTGAAATGCATTAGTGCGAACGTTACCAGCCGCTTCAGCGTGATAATTATCTGCGTTGCTGTGGAGGGTATTCTGGTGTTCACGCTTCGAACCGTCTTCGGACAAATTTCTGAA
>A10_581;size=1975476;
AATTCTCAACCTATAAATCCTTGTGATATATAGGCTTGGACTTGGAGGCTTGCTGGCCCTTGCGGTCGGCTCCTCTTGAACGCATTAGCTTGATTCCGTACGGATCGGCTCTCAGTGTGATAATTGTCTACGCTGTGACCGTGAAGTGTTTTGGCGAGCTTCTAACCGTCCACTAGGACAACTTTTTAAC
>A10_34;size=1644162;
AATTCTCAACCTTATCAGTTTTTTATTAAATTGGTTCAAGGCTTGGATTTTGGGAGTTGCAGGCTTCTCTTTTGAAGTCA
GCTCTTCTTAAATGTATTAGTGGAGACTTGTAACCGTCGCCTTGGTGTGATAATTATCTGCGCCTTGGTGTATTGGTGACTAGTAATGTCTTTGCTTATAACAGTCCATTAGATTGGACAATCACTTTATGAC
>A10_820;size=1641279;
AATTCTCAACTTATAAATCCTTGTGATCTATAAGCTTGGACTTGGAGGCTTGCTGG

Dereplication uc file looks like this. from above step it seems that there are 2391901 reads, which are similar to A10_21 sequence. Therefore size of A10_10 is 2391901. But all these reads are not coming from same sample. They belonged to other samples as well here for ex to Sample A16, A1, A20, A21, A23 etc.
if this is correct? how this information will transfer to next step, as we are not uc file from derep step further. Although size data will be used which mean all reads belong to sample A10?
I am bit confused here, about dereplication steps?

S       0       186     *       *       *       *       *       A10_21  *
H       0       186     100.0   +       0       0       *       A10_31391       A10_21
H       0       186     100.0   +       0       0       *       A16_311341      A10_21
H       0       186     100.0   +       0       0       *       A1_527493       A10_21
H       0       186     100.0   +       0       0       *       A20_537172      A10_21
H       0       186     100.0   +       0       0       *       A20_548589      A10_21
H       0       186     100.0   +       0       0       *       A20_559905      A10_21
H       0       186     100.0   +       0       0       *       A20_561449      A10_21
H       0       186     100.0   +       0       0       *       A20_568473      A10_21
H       0       186     100.0   +       0       0       *       A21_585914      A10_21
H       0       186     100.0   +       0       0       *       A23_684625      A10_21
H       0       186     100.0   +       0       0       *       A29_935134      A10_21

Looking forward to hear from you
Regards
Sunil

Colin Brislawn

unread,
Jun 11, 2017, 11:01:33 PM6/11/17
to VSEARCH Forum
Hello Sunil,

But all these reads are not coming from same sample. They belonged to other samples as well here for ex to Sample A16, A1, A20, A21, A23 etc.
That is correct!

how this information will transfer to next step
Dereplication removes info about individual samples. This information is not transferred to the next step through dereplication. Instead, information about specific samples is added back in the final usearch_global step. 
vsearch -usearch_global clean_derep.fasta -db otus97_vsearch_repset.fasta
Note how clean_derep.fasta has info about individual samples, which shows up in the final OTU table.

Does that help answer your question?
Colin

sonumu...@gmail.com

unread,
Jun 14, 2017, 4:57:09 PM6/14/17
to VSEARCH Forum
Dear Colin,
Thank you for reply.
In the email you mentioned that "dereplicated file" should be referred while generating OTU table using --usearch_global, but I guess this should be the original fasta file which was input while dereplication.
I tried to use vsearch workflow as mentioned below.
Q1. I am getting many false positive results while chimera analysis. I checked chimera detected by vsearch but they looks fine in blast analysis. Should I change some thing in chimera analysis text?

Q2. I know that using vsearch directly otu_table.txt can be generated using --otutabout, but if I select this option it takes a lot of time (days) and also output file is quite big (in GBs). When I try to open this file in excel crashed, and also it didnt look like a OTU matrix. Could you please comment on this?

Q3. Alternatively, I used python script (from usearch) and QIIME workflow for generating OTU table. Please let me know if this is correct. In this way problem is, some time I get exact abundance data but some time abundance data looks strange and does not match as it was after chimera analysis.


#12 Dereplication and sort by abundance using vsearch
################################################
vsearch --derep_fulllength Fungal_lib.fasta --output derep.fasta --minuniquesize 2 --sizeout --relabel uniq --uc derep.uc --log=log

#13 Clustering at 97% using vsearch
################################################
module load vsearch/2.4.3
vsearch --cluster_size derep.fasta --id 0.97 --sizein --sizeout --relabel OTU_ --centroids derep_repset.fasta --uc derep_repset.uc

#14. Denovo chimera analysis
###############################################
module load vsearch/2.4.3
vsearch --uchime_denovo derep_repset.fasta --sizein --sizeout --abskew 2 --mindiffs 3 --mindiv 0.8 --minh 0.28 --chimeras chimera.derep_repset.fasta --nonchimeras nonchimera.derep_repset.fasta

#15. OTU table
###############################################
module load vsearch/2.4.3
vsearch --usearch_global Fungal_lib.fasta --sizein --sizeout --db nonchimera.derep_repset.fasta --strand plus --id 0.97 --uc otu_table.uc --biomout otu_table.biom


#15.1 Converrting UC file to cluster file
python mesas-uc2clust.py otu_table.uc seqs.txt

#15.2 making OTU table using QIIME
make_otu_table.py -i seqs.txt -o final_otu_table.biom

#15.3 Converting biom to txt using QIIME
biom convert -i final_otu_table.biom -o final_otu_table.txt --to-tsv


I really would like to use VSEARCH for my datasets, but need to make sure if I am analysing in correct order. Your inputs are highly appreciated.

Looking forward to hear back from you.

Regards
Sunil
Researcher,
University of Oslo

Colin Brislawn

unread,
Jul 12, 2017, 9:01:49 PM7/12/17
to VSEARCH Forum
Hello Sunil,

Go to hear from you.

but I guess this should be the original fasta file which was input while dereplication.
Good catch!
 
Q1. I am getting many false positive results while chimera analysis. I checked chimera detected by vsearch but they looks fine in blast analysis. Should I change some thing in chimera analysis text?
You could also look a the identified chimeras to see if they make sense by using the --uchimealns command. Maybe blast is not catching all chimeras? You could change the setting of the uchime algorithm (things like --mindiffs, --minh --xn). 

Q2. On --otutabout not working well.
This is going to be a flat text file, which is really poor for holding large distance matrices (it has n^2 size). Try --biomout, which makes a OTU table with JSON encoding. This could be smaller.

Q3. Alternatively, I used python script (from usearch) and QIIME workflow for generating OTU table.
I've used a different method for making .biom files out of .uc files. Some methods make the same output, but qiime removes white spaces, so the formatting looks different and they have different md5 hashes. Validating this can be tricky, but it's possible.


Your pipeline as a whole looks very good to me. I think it should work, but the true testing is up to you!
If you are interested, here is a similar pipeline I used on a recent project.
If you open the 'scripts' folder, scripts 4, 5, and 6 cover the process you mention here. 

Colin
 

Reply all
Reply to author
Forward
0 new messages