Merge differences between vsearch and usearch

2,188 views
Skip to first unread message

ast...@aucklanduni.ac.nz

unread,
Mar 8, 2016, 4:18:43 PM3/8/16
to VSEARCH Forum
Hi

I've just been merging some data with vsearch and I noticed that it merges at a vastly different rate than usearch when using exactly the same commands (~10% vsearch, ~80% usearch).
Here is the command I ran:

vsearch -fastq_mergepairs $R1 -reverse $R2 -fastqout $SAMPLE.fq -fastq_truncqual 3 -fastq_minmergelen 200
&
usearch -fastq_mergepairs $R1 -reverse $R2 -fastqout $SAMPLE.fq -fastq_truncqual 3 -fastq_minmergelen 200

And the stdout:

Merging reads 100% 
     44324  Pairs
      4667  Merged (10.5%)
     39657  Not merged (89.5%)

00:04 2.8Mb  100.0% 44.3k recs, 36038 merged (81.3%)  
     44324  Pairs                                  
     36038  Converted (81.3%)
        96  Exact overlaps (0.22%)
      8235  Not aligned (18.58%)
       939  Gaps
    648720  Mismatches
    182393  Fwd errs
    466327  Rev errs
        28  Merged too short (< 200)

Could anyone provide any insight to this please? I can provide the input files I used for this, but they are client data so would need to be kept confidential.

Regards,

Alex

Torbjørn Rognes

unread,
Mar 8, 2016, 4:45:17 PM3/8/16
to VSEARCH Forum
Hi Alex

The source code of USEARCH is not available and details of its implementation is not known. We are therefore unable to make VSEARCH work exactly like USEARCH. We strive to make VSEARCH better. The fastq_mergepairs command is relatively new in VSEARCH and we do not have so much experience with it. However on most of the datasets we have tested it on it seemed to give results that where rather similar to USEARCH.

If you compare different versions of USEARCH you will probably get different results as well.

Based on the output shown it appears to me that the merged fragments generated by USEARCH includes a very high number of mismatches between the merged sequence and the forward and reverse reads. The average number of mismatches is as high as 18 (648720 mismatches / 36038 merged sequences). By default VSEARCH will not merge reads with more than 5 mismatches (fastq_maxdiffs). This sounds far too high.

I hope this could help you.

Best,

- Torbjørn

ast...@aucklanduni.ac.nz

unread,
Mar 8, 2016, 5:48:26 PM3/8/16
to VSEARCH Forum
Hi Torbjørn

Thanks for that, I will have to do some more testing here. But I can understand what the differences are now.
I understand that you can't make it the same without seeing the source. To be honest, I prefer to use vsearch over usearch.

Cheers,

Alex

aditya...@gmail.com

unread,
Mar 28, 2016, 11:08:52 PM3/28/16
to VSEARCH Forum
Hi Torbjorn

Since Usearch takes Q scores into account and recalculates it during merging, isn't it fair enough to allow for upto 50% mismatches in the overlap region (especially for partial coverage MiSeq reads; My overlap region is approx. 80bp)? Also given the generally relatively bad quality of Read 2 compared to Read 1. 

This allows for merging a lot more reads, which can then be quality filtered in the subsequent step prior to clustering. The discarded reads can then be remapped to OTUs as suggested in the UPARSE pipeline

I would like to know your opinion on this, thanks in advance

Best,
Aditya

Torbjørn Rognes

unread,
Mar 29, 2016, 5:19:02 AM3/29/16
to VSEARCH Forum
Hi Aditya

I do not think this sounds like a good idea. If you allow many mismatches, the alignment between the sequences may not be correct, and you will end up with a lot of really bad sequences. It may be possible to do this by increasing the maxdiffs option.

In fact, vsearch uses the same method as usearch to recalculate the quality of the merged sequence. It is based on the article by Edgar and Flyvbjerg.

Best regards,

- Torbjørn

aditya...@gmail.com

unread,
Mar 29, 2016, 5:51:05 AM3/29/16
to VSEARCH Forum
Hi Torbjorn

Thank you for the suggestion! 

In a recent dataset that I processed using the UPARSE pipeline, I allowed upto 30 mismatches given an overlap length of approx. 90bp which resulted in approx. 80% of the reads merging. The merged reads had the expected sequence length distribution peaking at 466bp, but ofcourse a few thousand reads probably misaligned and had higher read lengths. 

In the subsequent quality filtering step using a maximum error of 2 and truncating reads to 400bp, only 23% of the merged data passed this filter. Finally 50% of the merged reads mapped to OTUs using the usearch_global command

My contention is that, since the discarded reads (during the quality filtering step) can still be used to construct the OTU table, why not allow for a larger percentage of reads to merge, which are anyway filtered prior to clustering? I largely allow this leeway during merging due to the very bad quality of bases in Read 2 in the overlap region

However, I do agree that if a very large number of reads misalign and this messes up the expected read lengths after merging, then its a bad idea. 

Looking forward to hearing your thoughts on this

Best,
Aditya

Frédéric Mahé

unread,
Mar 30, 2016, 1:49:13 PM3/30/16
to VSEARCH Forum
Hi Aditya,

I agree with Torbjørn, what you are trying seems dangerous.

I think the root of the problem is that we are trying to sequence markers that are close to the maximum length of paired-end reads (let's say 500bp markers when using Illumina MiSeq 2 x 300bp). Mergers like PEAR, USEARCH, VSEARCH or older software give more or less the same results when the marker is short compared to paired-read length. Stringent algorithms like vsearch or pear start to yield lower numbers of assembled reads when the overlapping length gets shorter. That yield decrease is a good thing for your downstream analysis, as dubious overlap alignment not only creates sequences with higher error rates, it can also produce artificial sequences that may not be eliminated by a filtering of expected errors > 2.

So, I agree with the idea of allowing noisy sequences to go through the pipeline to get the most of each sequencing. The problem here is that tweaking the merging step can produce bad sequences hard to identify as such, and hard to discard during downstream steps.

Best,

aditya...@gmail.com

unread,
Mar 30, 2016, 10:56:41 PM3/30/16
to VSEARCH Forum
Hi Frédéric

Thank you for the detailed reply. I totally agree on all your points and will have a relook at the data. I just compared merging with usearch8.1 and PEAR (both with default settings) on the same dataset.

Following are the results:
a) usearch8.1 Average Merged Reads/Sample (%) 2.9%
b) PEAR0.9.6 Average Merged Reads/Sample (%) 78%

Given these vastly different results, is it possible to evaluate which tool would be better? I really like the approach that PEAR takes, but given these results, I am not sure if its better to proceed with it

Thanks in advance

Best,
Aditya

Frédéric Mahé

unread,
Mar 31, 2016, 6:56:36 AM3/31/16
to VSEARCH Forum
Hi Aditya,

Wow, the numbers you report for usearch8 and pear are so different.
I used pear for several years, and I switch to using vsearch recently. My tests did not show a big difference in terms of assembled reads, just a speed up and more robust recomputed quality values. I've never observed such discrepancy, and I don't think I can give an informed opinion on what to do next.

Your dataset seems problematic (marker too long? failed R2 sequencing?). If another merging algorithm (vsearch?) confirms usearch8 results, you might have to resort to a re-sequencing, or using only R1.

Best,

aditya...@gmail.com

unread,
Mar 31, 2016, 2:00:55 PM3/31/16
to VSEARCH Forum
Hi Frédéric

Thank you for the suggestions. I think the issue I am facing is due to consistently bad quality towards the ends of Read 2 (right after about 175bp in a 300bp read). 
This I think messes up with the merging as the bad quality bases are right in the region where the reads overlap.

The primers I am using are 926F and 1392R targtetting the V6-V8 region. I am currently planning to proceed with the results I obtained using Read 1 only. Would this analysis be considered valid?
Although I must emphasize that the taxa results have not changed significantly in comparison to the previous approach I had taken.

Thank you

Best,
Aditya

Frédéric Mahé

unread,
Apr 1, 2016, 4:02:08 AM4/1/16
to VSEARCH Forum
Hi Aditya,

The
V6-V8 region is in the ballpark of 460-470 bp, so normally a 2x300 bp sequencing should work. However, I heard that Illumina has a quality-problem with the MiSeq v3 chemistry (the one allowing 2x300 bp). That problem has been around for a year now, I don't know if it has been solved. Maybe that could explain your results?

Resorting to use only R1 is not ideal, but if you explain that R2 is not usable, I don't think anyone can hold it against you. You can indeed compare with the taxonomic profiles obtained with assembled reads.

Best,

aditya...@gmail.com

unread,
Apr 2, 2016, 10:29:25 AM4/2/16
to VSEARCH Forum
Hi Frédéric

It could be due to the reason you pointed out. I am planning to flag this to my sequencing core.

However, I did a quick literature search of recent articles in top journals including ISME and PNAS, where they have merged their reads using FLASH. 

Apparently, they have used a mismatch ratio of (less than or equal to) 0.25, which also happens to be FLASHs default.
This parameter value roughly approximates to allowing 25 mismatches at most for a 100bp overlap, which is roughly what I had adopted while merging reads using fastq_mergepairs with max_diffs set to 30

Surprisingly, benchmarks for merging parameters seem to vary quite a bit.

Best,
Aditya

Frédéric Mahé

unread,
Apr 4, 2016, 5:41:11 AM4/4/16
to VSEARCH Forum
Hi Aditya,

well, recent articles in top journals can nonetheless use
suboptimal methods (old habits die hard). I'll quote the paper describing PEAR:

Both, Shera and FLASH (see later in the text) ignore the quality scores of the base calls.

and

FLASH constructs merged reads that maximize the overlap length-to-matches ratio. FLASH requires the mean DNA fragment size and standard deviation of the fragment size as input parameters. Therefore, it can only merge paired-end reads into fragments of ‘almost’ identical size. Furthermore, our tests show that FLASH performs poorly when the overlaps between reads are short (Section 3).

 In late 2013, pear appeared as a more advanced merging method, a bit more conservative (less true-positives), but with less false-positives too (which is important to reduce OTU inflation). In 2015 Robert Edgar and Henrik Flyvbjerg published an exact method to update quality values when merging read pairs and implemented it in usearch (pear developers independently arrived to the same solution but did not publish it). I my opinion the most advanced and reliable merging solutions today are usearch, pear and now vsearch.

Best,

robin....@gmail.com

unread,
Jul 9, 2018, 7:14:02 PM7/9/18
to VSEARCH Forum
This is a great discussion of read merging options! 

I'm curious if any of you are familiar with MeFiT/CASPER and have looked at how that compares to vsearch. Unfortunately the paper mostly uses the tools everyone has pointed out as flawed above for benchmarking. ref: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1358-1

The CASPER-based method uses kmer frequency to decide which base to call when the conflicting assignments have similar quality. My guess is that vsearch uses the same strategy as usearch- go with the higher one and break ties using the forward read? This kmer approach seems like a cool idea to me, but I've not seen anyone actually using it. Also, it's not clear to me how they calculate the new qual scores...

Thanks for any insight!

Robin

Torbjørn Rognes

unread,
Aug 15, 2018, 6:02:52 AM8/15/18
to VSEARCH Forum
Hi Robin

To decide on which base to use when there is a conflict in the overlapping region, VSEARCH chooses the one with highest quality. If the quality is equal, it chooses the forward read. I think this is similar to USEARCH, as you write.

I looked very briefly at CASPER a year ago, but was disappointed.
 
- It merges too many wrong read pairs (In one of my test it merged 3.4% of pairs that should not have been merged compared to less than 0.1% for vsearch and bbmerge. See https://github.com/torognes/merge-eval for details.)
- It is very slow (10x slower than VSEARCH in a simple test)
- The software has not been updated since October 2015.
- The k-mer approach may not be appropriate with all types of data
- It's rarely cited and therefore probably hardly used

Best,
- Torbjørn

chaims...@gmail.com

unread,
Aug 16, 2018, 5:31:49 PM8/16/18
to VSEARCH Forum
How can I get more information about what vsearch is seeing when it doesn't merge a read pair? I have a dataset where usearch merges ~55% of the read pairs, while vsearch only merges ~16%. In the course of investigating this, I've pulled out about 30 read pairs for which usearch generates an alignment with a low (<6%) mismatch rate and 100-140 bp of overlap. 25 of the 28 merged amplicons pass all of my usual qc checks with flying colors. Yet vsearch says for all of them: "alignment score too low, or score drop to high." How can I dig into that to find out what's going on? The only leads I have right now are the R2 drops below Q20 at about 270 bp and in at least half the cases the mismatches in usearch's alignment are in this region. But using --fastq_truncqual 20 only allows 2 of the pairs to merge successfully.

I am happy to send the sample reads, but can't seem to attach them here.

Thanks,
Chaim

Torbjørn Rognes

unread,
Aug 22, 2018, 5:59:50 AM8/22/18
to VSEARCH Forum
Hi Chaim,

It's a bit difficult to say exactly what happens, but in general this message should appear when there are too many mismatches or when the quality of the bases are too low in the overlap region.

If you could send the selected 30 read pairs to me by email (toro...@ifi.uio.no) I can try to figure out what's going on in more detail.

- Torbjørn

oum...@gmail.com

unread,
Aug 29, 2018, 7:33:29 PM8/29/18
to VSEARCH Forum
Hi,
I am a student in bioinformatics and as part of my memory I work on metagenomics.
however, I ran the pear command:  

"pear -f 3704_R1.fastq -r 3704_R2.fastq -o ~/Meta/3704Pear"

Forward reads file.................: 3704_R1.fastq
Reverse reads file.................: 3704_R2.fastq
PHRED..............................: 33
Using empirical frequencies........: YES
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 50
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 1

Allocating memory..................: 200,000,000 bytes
Computing empirical frequencies....: DONE
  A: 0.259712
  C: 0.240956
  G: 0.240444
  T: 0.258888
  332000 uncalled bases

And here are the results he generated:

Assemblying reads: 100%

Assembled reads ...................: 14,617,310 / 94,044,602 (15.543%)
Discarded reads ...................: 830 / 94,044,602 (0.001%)
Not assembled reads ...............: 79,426,462 / 94,044,602 (84.456%)
Assembled reads file...............: /home/dolo/Meta/3704Pear.assembled.fastq
Discarded reads file...............: /home/dolo/Meta/3704Pear.discarded.fastq
Unassembled forward reads file.....: /home/dolo/Meta/3704Pear.unassembled.forward.fastq
Unassembled reverse reads file.....: /home/dolo/Meta/3704Pear.unassembled.reverse.fastq

My question is: is the result of the readings assembled acceptable?


Best regards!

Colin Brislawn

unread,
Aug 30, 2018, 3:06:55 PM8/30/18
to VSEARCH Forum
Hello!

Thank you for posting your full command and log file. While many settings are OK, I noticed this line in your log file. 
Not assembled reads ...............: 79,426,462 / 94,044,602 (84.456%)

Having 84% of all your reads fail to join is pretty high! I would hope to get that number down to <10%.

How much overlap do you expect between your reads? If there is enough overlap, you should be able to get most of your reads to join.

Let me know what you find!
Colin

P.S. Vsearch also supports pairing, and it's really good. You could try joining with vsearch. 

vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME 

oum...@gmail.com

unread,
Aug 30, 2018, 8:40:10 PM8/30/18
to VSEARCH Forum
Thank you very much Colin for your reaction, I did it with vsearch too and here is the report!
it's about the same proportions.

vsearch --fastq_mergepairs 3704_R1.fastq --reverse 3704_R2.fastq --fastqout 3704_assembled.fastq
vsearch v2.8.2_linux_x86_64, 3.8GB RAM, 4 cores

Merging reads 100%  
  94044602  Pairs
  12965286  Merged (13.8%)
  81079316  Not merged (86.2%)

Pairs that failed merging due to various reasons:
  56838035  too few kmers found on same diagonal
    173603  multiple potential alignments
        32  too many differences
  23545301  alignment score too low, or score drop to high
    472862  overlap too short
     49483  staggered read pairs

Statistics of all reads:
     99.13  Mean read length

Statistics of merged reads:
    171.66  Mean fragment length
     16.26  Standard deviation of fragment length
      0.15  Mean expected error in forward sequences
      0.14  Mean expected error in reverse sequences
      0.13  Mean expected error in merged sequences
      0.02  Mean observed errors in merged region of forward sequences
      0.02  Mean observed errors in merged region of reverse sequences
      0.04  Mean observed errors in merged region



Can I afford to continue with this rate of assembled reads ?

Dolo

Colin Brislawn

unread,
Aug 30, 2018, 9:36:17 PM8/30/18
to VSEARCH Forum
I'm glad you discovered vsearch! It should be better supported than pear at this point.

Can I afford to continue with this rate of assembled reads ?
Yes! After all, 12 million reads were able to join. That's a lot of reads! 
However, it would be good to make use of those other reads too. 

The vsearch log also shows us something important; why the reads did not join.
  56838035  too few kmers found on same diagonal

So it appears to be having trouble finding kmers to do the alignment. Given that your reads are 100 bp long and many reads that joined are 170 bp long, then there is only 30 bp of overlap in the ones that did join... 

Are these 18S amplicons or untargeted metagenomic reads? I as because these reads are pretty short, and when I sequence shorter regions, say a common region of the 18S amplicon, I find that I need to use the --fastq_allowmergestagger command to join my reads well.

My next suggestion is to run this script while adding the --fastq_allowmergestagger flag to see if that helps your results.

If these are untargeted metagenomic reads or metatranscriptomic reads, you might not have to align them at all! Many aligners allow you to pass in reads that have not been joined, and this is a really good method to use! 

Keep me posted!
Colin

Torbjørn Rognes

unread,
Aug 31, 2018, 2:50:37 AM8/31/18
to VSEARCH Forum
Hi

As you can see from the output, the average length of the fragments that were joined is 172bp and the average read length is 99bp. I suspect that most of your fragments are simply too long (190bp or longer), making it impossible to join them.

Then you could choose to use either the merged sequences or use the reads as they are. It depends on what you are going to do with them.

- Torbjørn

oum...@gmail.com

unread,
Sep 4, 2018, 7:27:08 PM9/4/18
to VSEARCH Forum
Hello,

Thank you all for your support, which has been very helpful.

My next suggestion is to run this script while adding the --fastq_allowmergestagger flag to see if that helps your results.

I executed this script but it did not work, maybe I did not do well.

Then you could choose to use either the merged sequences or use the reads as they are. It depends on what you are going to do with them

In fact, it is within the framework of my memory (characterization of the intestinal microbiome in healthy subjects) whose objective is to master the technique.

Reply all
Reply to author
Forward
0 new messages