Dereplication, chimera removal and Rereplication

apa...@gmail.com

unread,

Feb 13, 2017, 11:19:44 AM2/13/17

to VSEARCH Forum

Hi

I was running some data with VSEARCH and had a question about the dereplication and rereplication. Here is a summary of what I did to the data with the goal of chimera removal:

sort by length (--sortbylength)
dereplication (--derep_full)
reference-based chimera search (--uchime_ref)
de novo chimera search (--uchime_denovo)
rereplication (--rereplicate)

When I do step 5, the output file does not look right (a header that appears once in original input file before derep appears 2252 times in the rereplicated, chimera removed file), also, no. of chimera detected+ no. of non-chimera reported < total no. of sequences in the input file.

Is there a cluster file that the rereplicate is referring to that I don't see? Is information to rereplicate being lost when I dereplicate the file?

Thanks a lot!

Apaala

Torbjørn Rognes

unread,

Feb 13, 2017, 11:40:27 AM2/13/17

to VSEARCH Forum

Hi

Do not use the rereplication command. It lacks the information needed to restore the original labels and therefore simply duplicates the same label over and over again. It is only for use in very special test cases.

I would also recommend clustering at about 97% before performing chimera detection.

- Torbjørn

apa...@gmail.com

unread,

Feb 13, 2017, 1:23:24 PM2/13/17

to VSEARCH Forum

Hi

Thanks for the quick reply! I appreciate it, so do you recommend that I skip the dereplication and just let the chimera removal run on file clustered at 97%? Or is there a better way to rereplicate the data?

Thanks!

Apaala

Torbjørn Rognes

unread,

Feb 13, 2017, 1:25:28 PM2/13/17

to VSEARCH Forum

I'd recommend dereplication first, then clustering at 97%, before you do the chimera detection. This is faster than just clustering. Also the chimera detection is improved when the data is preclustered.

- Torbjørn

apa...@gmail.com

unread,

Feb 14, 2017, 10:43:47 AM2/14/17

to VSEARCH Forum

Ok Thanks, one more question, if i perform reference based chimera removal on a input file, the number of --chimeras and --nonchimeras reported should add up to the number of input sequences correct? Or is there a chance of losing some sequences in the process?

Torbjørn Rognes

unread,

Feb 15, 2017, 9:14:30 AM2/15/17

to VSEARCH Forum

Some sequences may be classified as "borderline". The total number of chimeras + non-chimeras + borderline should add up to the total number of input sequences.

There is an option ("--borderline") to output those sequences.

- Torbjørn

apa...@gmail.com

unread,

Feb 15, 2017, 2:08:35 PM2/15/17

to VSEARCH Forum

Hi

So I should have mentioned that I want to use an inhouse classifier for taxonomy, this classifier requires a chimera removed fasta file as input that is not dereplicated, is there a way to use vsearch for this purpose?

Torbjørn Rognes

unread,

Mar 8, 2017, 11:15:25 AM3/8/17

to VSEARCH Forum

Hi

It is difficult to give you a precise answer to this question.

I've now added an example script to the VSEARCH Wiki that contains a simple pipeline that uses VSEARCH to process MiSeq 16S rRNA sequences. It may be helpful for you.