Problem: vsearch fastq_mergepairs gives drastically different results than usearch

434 views
Skip to first unread message

tima...@gmail.com

unread,
Aug 6, 2019, 12:00:57 PM8/6/19
to VSEARCH Forum
Hi,

I currently try to run the vsearch --fastq_mergepairs tool and found that quite many reads were discarded due to "too low alignment score". So I tried to increase the "--fastq_maxdiffs" value, but it apparently has no effect on the output after some point.

So my question is if there is another parameter rather than "--fastq_maxdiffpct", "--fastq_maxdiffs" and "--fastq_minovlen" that influences the merging output and might help to increase the number of merged reads?

Usearch in comparison merged 97% of the reads in this example with equivalent settings.

Here are the commands and the output from vsearch and the equivalent usearch:

VSEARCH command: vsearch --fastq_mergepairs file1_r1.fastq --reverse file1_r2.fastq --fastqout merged_file1.fastq --fastq_maxdiffpct 25 --fastq_maxdiffs 99 --fastq_minovlen 16 --quiet

VSEARCH Output:
    106059  Pairs
     90083  Merged (84.9%)
     15976  Not merged (15.1%)

Pairs that failed merging due to various reasons:
        17  too few kmers found on same diagonal
       211  multiple potential alignments
     15665  alignment score too low, or score drop to high
        10  overlap too short
        73  staggered read pairs

Statistics of all reads:
    294.79  Mean read length

Statistics of merged reads:
    364.55  Mean fragment length
      2.84  Standard deviation of fragment length
      1.25  Mean expected error in forward sequences
      4.57  Mean expected error in reverse sequences
      0.31  Mean expected error in merged sequences
      1.61  Mean observed errors in merged region of forward sequences
      6.33  Mean observed errors in merged region of reverse sequences
      7.94  Mean observed errors in merged region

USEARCH command: usearch --fastq_mergepairs file1_r1.fastq --reverse file1_r2.fastq --fastqout merged_file1.fastq --fastq_pctid 75 --fastq_maxdiffs 99 --fastq_minovlen 16

USEARCH Output:
00:01 101Mb   100.0% 97.7% merged

Totals:
    106059  Pairs (106.1k)
    103667  Merged (103.7k, 97.74%)
      2026  Alignments with zero diffs (1.91%)
      2160  Too many diffs (> 99) (2.04%)
       232  No alignment found (0.22%)
         0  Alignment too short (< 16) (0.00%)
        88  Staggered pairs (0.08%) merged & trimmed
    225.11  Mean alignment length
    364.43  Mean merged length
      1.54  Mean fwd expected errors
      5.23  Mean rev expected errors
      0.53  Mean merged expected errors

thanks in advance
Till

Colin Brislawn

unread,
Aug 6, 2019, 12:50:02 PM8/6/19
to VSEARCH Forum
This is a great question Till! I also wonder that defines 'too low alignment score' and if it could be adjusted to make this work better.

Colin

Torbjørn Rognes

unread,
Aug 27, 2019, 7:32:03 AM8/27/19
to VSEARCH Forum
Hi

Thanks for the question. Sorry for the late reply.

Unfortunatley, there are really no other options or parameters to adjust to increase the number of merged reads than fastq_maxdiffpct, fastq_maxdiffs and fastq_minovlen, as you mentioned. The only possibility is, if appropriate, that you may specify fastq_allowmergestagger to merge staggered reads (as USEARCH has done). The only other options that can influence the merging are fastq_maxee, fastq_maxmergelen, fastq_maxns, fastq_minlen, fastq_minmergelen, but they will only reduce the number of pairs merged, not increase the number.

The merging algorithm in VSEARCH is designed independently of USEARCH. The merging in VSEARCH is designed to be strict and rather leaves pairs unmerged than merged if there is doubt about whether it is correct or not to merge them. There are some internal parameters used, but they cannot be adjusted by users. The parameters have been set based on experiments with simulated sequences of varying length and quality. I might consider adding options for adjusting these parameters in the future.

In this case there many alignments with "alignment score too low, or score drop to high". This means that in the overlap region there are too many mismatches or that the matching residues are of too low quality. There might also be a "score drop" which could mean that most of the overlap is fine, but that there are short regions within the overlap that do not match well, e.g. two or more mismatches in a row. VSEARCH looks both the at the actual bases (if they match or mismatch) and their base quality.

There are also a few cases with "multiple potential alignments". This means that it seems like there is a possibility of aligning the read pair with at least two different overlap sizes. This is often due to repeat-like or low-complexity sequences. VSEARCH will not merge these read pairs because it does not know which one is correct.

VSEARCH will often merge fewer reads than other tools, but you'll most likely end up with fewer incorrectly merged reads. If you look at the expected number of errors in the merged sequences, you see that the number is 0.31 for VSEARCH and 0.53 for USEARCH. That might be one indication.

I hope this clarified things and helped you.

- Torbjørn

Aia Oz

unread,
Oct 6, 2022, 2:47:22 PM10/6/22
to VSEARCH Forum

Hi, bringing this up many years later,
What are the k-mers in "too few kmers found on same diagonal"?
thanks in advance
Aia

Torbjørn Rognes

unread,
Oct 10, 2022, 7:50:42 AM10/10/22
to VSEARCH Forum
Hi

The merge_pairs function initially looks for shared k-mers by the R1 and R2 reads in the potentially overlapping region. It usually requires at least 4 shared 5-mers on the same diagonal, that is, that could be aligned without the need for any gaps between them. The k-mers may overlap, so this could be achieved by 8 or more consecutive matching nucleotides, at the minimum.

If vsearch cannot find at least 4 shared 5-mers, you'll see the message you mentioned.

Please note that we made some changes to this function in March 2021, in vsearch 2.17.0 and later. It now allows merging of sequences with overlaps as short as 5 bp if the --fastq_minovlen option has been adjusted down from the default 10. In addition, much fewer pairs of reads should now be rejected with the reason 'multiple potential alignments' as the algorithm for detecting those have been changed.

- Torbjørn
Reply all
Reply to author
Forward
0 new messages