Mapping quality computation

Brendan O'Fallon

unread,

Mar 26, 2014, 10:13:43 AM3/26/14

to snap...@googlegroups.com

Just wondering if anyone can point me to documentation for how individual read mapping quality is computed. Looking at a few reads I'm seeing some behavior that seems confusing, for instance, reads with multiple mismatches are getting cigars of 101M and mapping quality of 70 (identical to reads with no mismatches). Sometimes even indels are present and the read still gets a mapping quality of 70. Shouldn't mapping quality be decreased if there are multiple mismatches? Shouldn't the cigar string reflect mismatches?

Note: I'm getting this info by looking at the alignment in IGV, so this could be some weird IGV-related bug.

Bill Bolosky

unread,

Mar 26, 2014, 2:23:51 PM3/26/14

to snap...@googlegroups.com

The mapping quality is determined by dividing the probability that the read was generated by the DNA at the aligned location by the probability that it would have been generated from anywhere in the genome (and then expressing that on a log scale). So even though the probability that you’d get a read with multiple mismatches is pretty low, it might be much lower that it came from somewhere else.

In practice, this means that you see high mapping quality when a read maps much better to one location than to any other. So, for instance, if a read has 3 differences form location x, but at least 6 from any other location in the genome, it will map to x with high quality.

While we were in development, we played around with using the absolute difference between the read and the mapped location as an input into the mapping quality computation, but we didn’t get it to work better than using the principled method, so we took it out.

In terms of the cigar string, when you specify –M it will only show indels and not single base mismatches. If you leave off the –M flag, then you’ll get a cigar string that uses = for an exactly matching base and X for a mismatched base (as well as I and D for insertions and deletions).

Make sense?

--Bill

--
You received this message because you are subscribed to the Google Groups "SNAP Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snap-user+...@googlegroups.com.
To post to this group, send email to snap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/snap-user/6749a988-831e-4f99-ba35-d7cc7175c917%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zhuyi Xue

unread,

Nov 17, 2015, 5:54:04 PM11/17/15

to SNAP Users

Hi Bill,

Could you please point the location in the code https://github.com/amplab/snap where the MAPQ is calculated? I wonder how the following values are actually calculated.

the probability that the read was generated by the DNA at the aligned location

the probability that it would have been generated from anywhere in the genome

Thank you!

Bill Bolosky

unread,

Nov 17, 2015, 6:41:02 PM11/17/15

to snap...@googlegroups.com

The final computation is in mapq.cpp::computeMAPQ, but that’s probably not what you’re interested in.

SNAP is really two different aligners, one for single end and one for paired-end. They have different code and different MAPQ computations.

As you deduced, SNAP determines the probability that a given DNA location generated a particular read. It does this each time it evaluates a candidate location. The mechanics of it are in the code around line 886 in BaseAligner.cpp (in the dev branch, though I don’t think it differs much if any from the master). When SNAP evaluates a candidate location, what it does is to compute the edit distance for the part of the read before the seed going backwards, and the edit distance for the part of the read after the seed going forward. These are the two calls to landauVishkin->computeEditDistance(). (NB: “LandauVishkin” is a misnomer, it’s really Ukkonen’s algorithm, but we never changed the code).

computeEditDistance() is in LandauVishkin.h. If it’s a perfect match, then the match probability is (1 – SNP_PROBABILITY) ^ length, which is precomputed in the lv_perfectMatchProbability array (and SNP_PROB is just a constant .001). If it’s not a perfect match, then the match probability is computed at the bottom of the function (line 265 and after) . It’s the product of the perfect match probability raised to the number of perfectly matching bases times the mismatch probability for each mismatching base, times the indel probability for each indel. The mismatch probability depends on the quality score for the base in question and SNP_PROB, and is computed at LandauVishkin.h:748. The indel probability is the product of the GAP_OPEN_PROB (which is .001) and the GAP_EXTEND_PROB (.5) raised to the length of the indel minus one. This is computed at LandauVishkin:738.

Once the aligner has computed the edit distance for the part before the seed and the part after the seed, it computes the final match probability as the product of those two match probabilities times (1 – SNP_PROB) ^ seedLength. This is the probability that that DNA locus would have generated that read.

SNAP basically works by looking at a bunch of candidate locations and choosing the best one as the aligned location. It computes the match probability for all of them, and uses the sum of them as the probability that the read came from anywhere in the genome, and the final MAPQ is computed as Pr(alignedLocation) / Pr(All locations) (converted into MAPQ numbers). If you followed the logic above, you’ll see that matchProbability drops exponentially in the number of mismatches (and for SNPs, the base of the exponent is .001, so very very quickly). Therefore, the only meaningful contribution to the denominator is from loci that are pretty close to the best match. SNAP handles this by searching for all matches it can find that are within some edit distance of the best match. This is by default 2 (meaning for SNPs about 10^-6 as much probability as the best hit), but you can change it with the –D flag, trading MAPQ accuracy for performance.

The paired-end aligner is in IntersectingPairedEndAligner.cpp. It works more-or-less the same, except that it’s using the probability of a pair match, which is the product of the probabilities of the individual ends.

Does this make sense? If not, let me know and I’ll try to clarify more.

It took me several months to work though this and add it to SNAP, it was way more of pain than I’d hoped, but it seems like it’s pretty good now. I haven’t touched it in years.

To view this discussion on the web visit https://groups.google.com/d/msgid/snap-user/016cbb17-7d18-4bea-823f-012ce220d553%40googlegroups.com.

Zhuyi Xue

unread,

Nov 19, 2015, 1:55:21 PM11/19/15

to SNAP Users

Hi Bill, Thank you so much for the detailed explanation, really appreciated! Just to let you know I haven't fully digested it yet. I will come back with further questions.

Zhuyi Xue

unread,

Dec 3, 2015, 5:52:58 PM12/3/15

to SNAP Users

Hi Bill,

I have a few questions, please:

1. Could you elucidate a bit more about this sentence: It’s the product of the perfect match probability raised to the number of perfectly matching bases times the mismatch probability for each mismatching base, times the indel probability for each indel. I get a bit confused.

2. LandauVishkin.h is only 512 lines long (https://github.com/amplab/snap/blob/dev/SNAPLib/LandauVishkin.h#L265), why LandauVishkin.h:748?

3. The probability that the read came from anywhere in the genome actually meant probability that the read came from any possible candidate locations in the genomes, right?

4. I don't quite understand this sentence, SNAP handles this by searching for all matches it can find that are within some edit distance of the best match. This is by default 2 (meaning for SNPs about 10^-6 as much probability as the best hit). How is 10^-6 deduced from 2?

Thanks!

Reply all

Reply to author

Forward