Problem: vsearch fastq_mergepairs gives drastically different results than usearch

tima...@gmail.com

unread,

Aug 6, 2019, 12:00:57 PM8/6/19

to VSEARCH Forum

Hi,

I currently try to run the vsearch --fastq_mergepairs tool and found that quite many reads were discarded due to "too low alignment score". So I tried to increase the "--fastq_maxdiffs" value, but it apparently has no effect on the output after some point.

So my question is if there is another parameter rather than "--fastq_maxdiffpct", "--fastq_maxdiffs" and "--fastq_minovlen" that influences the merging output and might help to increase the number of merged reads?

Usearch in comparison merged 97% of the reads in this example with equivalent settings.

Here are the commands and the output from vsearch and the equivalent usearch:

VSEARCH command: vsearch --fastq_mergepairs file1_r1.fastq --reverse file1_r2.fastq --fastqout merged_file1.fastq --fastq_maxdiffpct 25 --fastq_maxdiffs 99 --fastq_minovlen 16 --quiet

VSEARCH Output:
    106059 Pairs
     90083 Merged (84.9%)
     15976 Not merged (15.1%)

Pairs that failed merging due to various reasons:
        17 too few kmers found on same diagonal
       211 multiple potential alignments
     15665 alignment score too low, or score drop to high
        10 overlap too short
        73 staggered read pairs

Statistics of all reads:
    294.79 Mean read length

Statistics of merged reads:
    364.55 Mean fragment length
      2.84 Standard deviation of fragment length
      1.25 Mean expected error in forward sequences
      4.57 Mean expected error in reverse sequences
      0.31 Mean expected error in merged sequences
      1.61 Mean observed errors in merged region of forward sequences
      6.33 Mean observed errors in merged region of reverse sequences

7.94 Mean observed errors in merged region

USEARCH command: usearch --fastq_mergepairs file1_r1.fastq --reverse file1_r2.fastq --fastqout merged_file1.fastq --fastq_pctid 75 --fastq_maxdiffs 99 --fastq_minovlen 16

USEARCH Output:

00:01 101Mb 100.0% 97.7% merged

Totals:
    106059 Pairs (106.1k)
    103667 Merged (103.7k, 97.74%)
      2026 Alignments with zero diffs (1.91%)
      2160 Too many diffs (> 99) (2.04%)
       232 No alignment found (0.22%)
         0 Alignment too short (< 16) (0.00%)
        88 Staggered pairs (0.08%) merged & trimmed
    225.11 Mean alignment length
    364.43 Mean merged length
      1.54 Mean fwd expected errors
      5.23 Mean rev expected errors
      0.53 Mean merged expected errors

thanks in advance

Till

Colin Brislawn

unread,

Aug 6, 2019, 12:50:02 PM8/6/19

to VSEARCH Forum

This is a great question Till! I also wonder that defines 'too low alignment score' and if it could be adjusted to make this work better.

Colin

Torbjørn Rognes

unread,

Aug 27, 2019, 7:32:03 AM8/27/19

to VSEARCH Forum

Hi

Thanks for the question. Sorry for the late reply.

Unfortunatley, there are really no other options or parameters to adjust to increase the number of merged reads than fastq_maxdiffpct, fastq_maxdiffs and fastq_minovlen, as you mentioned. The only possibility is, if appropriate, that you may specify fastq_allowmergestagger to merge staggered reads (as USEARCH has done). The only other options that can influence the merging are fastq_maxee, fastq_maxmergelen, fastq_maxns, fastq_minlen, fastq_minmergelen, but they will only reduce the number of pairs merged, not increase the number.

The merging algorithm in VSEARCH is designed independently of USEARCH. The merging in VSEARCH is designed to be strict and rather leaves pairs unmerged than merged if there is doubt about whether it is correct or not to merge them. There are some internal parameters used, but they cannot be adjusted by users. The parameters have been set based on experiments with simulated sequences of varying length and quality. I might consider adding options for adjusting these parameters in the future.

In this case there many alignments with "alignment score too low, or score drop to high". This means that in the overlap region there are too many mismatches or that the matching residues are of too low quality. There might also be a "score drop" which could mean that most of the overlap is fine, but that there are short regions within the overlap that do not match well, e.g. two or more mismatches in a row. VSEARCH looks both the at the actual bases (if they match or mismatch) and their base quality.

There are also a few cases with "multiple potential alignments". This means that it seems like there is a possibility of aligning the read pair with at least two different overlap sizes. This is often due to repeat-like or low-complexity sequences. VSEARCH will not merge these read pairs because it does not know which one is correct.

VSEARCH will often merge fewer reads than other tools, but you'll most likely end up with fewer incorrectly merged reads. If you look at the expected number of errors in the merged sequences, you see that the number is 0.31 for VSEARCH and 0.53 for USEARCH. That might be one indication.

I hope this clarified things and helped you.

- Torbjørn

Aia Oz

unread,

Oct 6, 2022, 2:47:22 PM10/6/22

to VSEARCH Forum

Hi, bringing this up many years later,
What are the k-mers in "too few kmers found on same diagonal"?

thanks in advance

Aia

Torbjørn Rognes

unread,

Oct 10, 2022, 7:50:42 AM10/10/22

to VSEARCH Forum

Hi

The merge_pairs function initially looks for shared k-mers by the R1 and R2 reads in the potentially overlapping region. It usually requires at least 4 shared 5-mers on the same diagonal, that is, that could be aligned without the need for any gaps between them. The k-mers may overlap, so this could be achieved by 8 or more consecutive matching nucleotides, at the minimum.

If vsearch cannot find at least 4 shared 5-mers, you'll see the message you mentioned.

Please note that we made some changes to this function in March 2021, in vsearch 2.17.0 and later. It now allows merging of sequences with overlaps as short as 5 bp if the --fastq_minovlen option has been adjusted down from the default 10. In addition, much fewer pairs of reads should now be rejected with the reason 'multiple potential alignments' as the algorithm for detecting those have been changed.

- Torbjørn

Reply all

Reply to author

Forward