I have changed the code. When a mismatch occurs it is incorrect to use the higher quality score. The higher quality base is used, but the quality score has to be lowered.
Here's a good example, imagine if there's a mismatch, and the 2 q scores are the same… what should the new score be? A 0.50 error rate, or a q score of 2.
- Erik
--
You received this message because you are subscribed to the Google Groups "EA Utils" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
ea-utils+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
+
BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?HGFCDGC2F>2GGHHGFHHHHFGFJLKLLLLLK5LLLLLLLKK7HKJKGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>
Essentially the quality score *should be*
a) if they match: the higher quality score
b) if they don't match: the *difference* between the scores
In your example, the length of the reversed quality string is 150, and the length of your reversed sequence is 151, so there's probably some shifting in the data that's being pasted here. Likewise the merged sequence and the merged quality scores have different lengths.
It would be nice if you make a little file which contained the original reads, as is.
I'm extrapolating from what I think occurred
1. the first mismatch base was '1' and '2', and it used '2', which is incorrect, it should have been set to to '$'
2. the second mismatch base was 'F' and '/', and kept the lower quality '/' which is also incorrect, as it should be a 'A'
See below:
M M
R1 /FDFCDGC2F>2DDHH2F22@><F1FFFFHGGDC..11F1FGH0G0DG00
R2 1HG?@11F1F<1GGFDG0HHHHFGF//HGHHHHHG1HHHHHHHGB33GFG
MERG ?HGFCDGC2F>2GGHHGFHHHHFGF/FHGHHHHHG1HHHHHHHGG3DGFG
Again, if you can send the two original reads, I can test and see what's going on and either explain or fix it.
OK!
1. I fixed the issue… and there definitely was one
2. You can run with "-d -d" and see the per base output.
From: ea-u...@googlegroups.com [mailto:ea-u...@googlegroups.com]
On Behalf Of Brittany Demmitt
Sent: Thursday, August 14, 2014 11:37 AM
To: ea-u...@googlegroups.com
Subject: Re: fastq-join quality score mismatch base
Hi Erik,
--
Forward
TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTG
+
BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?FDFCDGC2F>2DDHH2F22@><F1FFFFHGGDC..11F1FGH0G0DG00
Reverse Comp
TTGGAAACTGTCTAACTTGAGTGCAGGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
+
1HG@11F?1F<1GGFDG0HHHHFGF//HGHHHHHG1HHHHHHHGB33GFGGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>
Merged
TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG
+
BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?HGFCDGC2F>$GGHHGFHHHHFGF8FHGHHHHHG1HHHHHHHGG3DGFGGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>
Also when you refer to using the "difference" between quality scores for mismatch bases, how exactly is that calculated?
Thank you!
Britt
'F' minus '/' equals '8'
'F'=37
'/'=14
37 - 14 =23
23='8'
I floor it at 3 because 10**-(3/10) = 0.5 , which is "50%"… iE: 50% chance of either base being correct.
The true distribution would involve calculations with "e", but in a simulation we ran this floored difference every bit as accurate. I posted some stuff about that a while back.