fastq-join quality score mismatch base

Brittany Demmitt

unread,

Aug 8, 2014, 6:01:23 PM8/8/14

to ea-u...@googlegroups.com

Hello,

Thank you so much for this script! I have joined some paired end reads using your script, and had one question regarding the quality score assigned to mismatched bases in the merged reads. My understanding is that the base with the highest quality score is selected and that this base as well as its quality score are used. While I do find this to be the case in many of the merges. I have found a few instances where the higher quality base is used, but the lower quality score from the other read is assigned to it,

Here is an example, with the two different instances of assigning the quality score have been used. I have highlighted the mismatched bases and their quality scores in yellow, red font, bolded and underlined.

In short, in the first instance the forward read has a T (quality score 2) and the reverse has a C (quality score 1) and in the merged read we see the T with quality score 2. However, in the second instance the forward read has a A (quality score F) and the reverse has a G (quality score /) and in the merged read we see the A with quality score /

Forward Read 1:N:0:0

TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTG

+

BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?FDFCDGC2F>2DDHH2F22@><F1FFFFHGGDC..11F1FGH0G0DG00

Reverse Complement of Reverse Read 2:N:0:0

TTGGAAACTGTCTAACTTGAGTGCAGGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG

+

1HG@11F1F<1GGFDG0HHHHFGF//HGHHHHHG1HHHHHHHGB33GFGGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>

Merged Sequence 1:N:0:0

TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG

+

BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/HGFCDGC2F>2GGHHGFHHHHFGF/FHGHHHHHG1HHHHHHHGG3DGFGGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>

This was run within the qiime pipeline http://qiime.org/scripts/join_paired_ends.html and run with the flag that the minimum allowed overlap was 40bp(-m), and the percent max allowed different was 4%(-p)

The fastq-join used was the ea-utils-1.1.2-537-release

Please let me know if you need anything else, and thank you in advance for your help!

Cheers,

Brittany

Aronesty, Erik

unread,

Aug 11, 2014, 2:36:07 PM8/11/14

to ea-u...@googlegroups.com

I have changed the code. When a mismatch occurs it is incorrect to use the higher quality score. The higher quality base is used, but the quality score has to be lowered.

Here's a good example, imagine if there's a mismatch, and the 2 q scores are the same… what should the new score be? A 0.50 error rate, or a q score of 2.

- Erik

--
You received this message because you are subscribed to the Google Groups "EA Utils" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ea-utils+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brittany Demmitt

unread,

Aug 11, 2014, 6:16:01 PM8/11/14

to ea-u...@googlegroups.com

Thanks Erik!

Brittany Demmitt

unread,

Aug 13, 2014, 1:34:26 PM8/13/14

to ea-u...@googlegroups.com

Hi Erik,

I downloaded the newest version of fastq-join and reran the analysis, and am not completely sure I understand the results. The new joined read is shown below.

The first mismatch still has the T with the higher quality score of 2 and the second mismatch is still A but now has a quality score of J. It seems that several other quality score of the joined portion of the read has now changed as well. Could you explain how these quality scores are being selected?

Thank you!

Brittany

TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG

+

BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?HGFCDGC2F>2GGHHGFHHHHFGFJLKLLLLLK5LLLLLLLKK7HKJKGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>

Aronesty, Erik

unread,

Aug 13, 2014, 5:17:45 PM8/13/14

to ea-u...@googlegroups.com

Essentially the quality score *should be*

a) if they match: the higher quality score

b) if they don't match: the *difference* between the scores

In your example, the length of the reversed quality string is 150, and the length of your reversed sequence is 151, so there's probably some shifting in the data that's being pasted here. Likewise the merged sequence and the merged quality scores have different lengths.

It would be nice if you make a little file which contained the original reads, as is.

I'm extrapolating from what I think occurred

1. the first mismatch base was '1' and '2', and it used '2', which is incorrect, it should have been set to to '$'

2. the second mismatch base was 'F' and '/', and kept the lower quality '/' which is also incorrect, as it should be a 'A'

See below:

M M

R1 /FDFCDGC2F>2DDHH2F22@><F1FFFFHGGDC..11F1FGH0G0DG00

R2 1HG?@11F1F<1GGFDG0HHHHFGF//HGHHHHHG1HHHHHHHGB33GFG

MERG ?HGFCDGC2F>2GGHHGFHHHHFGF/FHGHHHHHG1HHHHHHHGG3DGFG

Again, if you can send the two original reads, I can test and see what's going on and either explain or fix it.

Brittany Demmitt

unread,

Aug 14, 2014, 11:37:22 AM8/14/14

to ea-u...@googlegroups.com

Hi Erik,

I have attached to mini forward and reverse fastq files for the read I have posting about. Please let me know if you need anything else, and thank you so much for your help! :)

Britt

Undetermined_S0_L001_R2_001_subset.fastq

Undetermined_S0_L001_R1_001_subset.fastq

Aronesty, Erik

unread,

Aug 14, 2014, 3:44:42 PM8/14/14

to ea-u...@googlegroups.com

OK!

1. I fixed the issue… and there definitely was one

2. You can run with "-d -d" and see the per base output.

From: ea-u...@googlegroups.com [mailto:ea-u...@googlegroups.com] On Behalf Of Brittany Demmitt
Sent: Thursday, August 14, 2014 11:37 AM
To: ea-u...@googlegroups.com
Subject: Re: fastq-join quality score mismatch base

Hi Erik,

--

brittany

unread,

Sep 17, 2014, 6:18:31 PM9/17/14

to ea-u...@googlegroups.com

Hi Erik,

Sorry its been a while since I started this forum...I thought things were working fine, but on closer examination I get an unexpected base quality score result when merging the reads.

In the example above I now do get a score of $ in the merged reads, however for the second mismatch I get a 8 not an A as you suggest

Below are the details:

Forward

TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTG

+

BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?FDFCDGC2F>2DDHH2F22@><F1FFFFHGGDC..11F1FGH0G0DG00

Reverse Comp

TTGGAAACTGTCTAACTTGAGTGCAGGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG

+

1HG@11F?1F<1GGFDG0HHHHFGF//HGHHHHHG1HHHHHHHGB33GFGGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>

Merged

TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTACGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGGTTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG

+

BBBBBFFBFFFFGGGGGGGGGGHGGGGGHHHHGHHHGGGGHHBFCEGGGGGGGGGGGGHHFGGHHGHGBFDGFFHHHHHHHHHHFBGGHHHHHHBHBFD?/?HGFCDGC2F>$GGHHGFHHHHFGF8FHGHHHHHG1HHHHHHHGG3DGFGGGGGHHHF4GGGEFHFHHHGFB41HGHHHEGEFEEHEGGEEFFB0EEAA1FHHGFEB35HHHHHGGEGEGBHHGEAAABFGFEECAGGGFFFDB5FAABA>

Also when you refer to using the "difference" between quality scores for mismatch bases, how exactly is that calculated?

Thank you!

Britt

Aronesty, Erik

unread,

Sep 18, 2014, 10:00:59 AM9/18/14

to ea-u...@googlegroups.com

'F' minus '/' equals '8'

'F'=37

'/'=14

37 - 14 =23

23='8'

brittany

unread,

Sep 18, 2014, 10:34:12 AM9/18/14

to ea-u...@googlegroups.com

Hi Erik,

Thank you for the response sense. However, I am still confused as to why the first mismatch gets a score of $ The forward read is 2=17 and the reverse read is 1=16 so 17-16=1 so I thought that the difference score should be a "=1 not $=3

Thanks!

Britt

Aronesty, Erik

unread,

Sep 18, 2014, 10:49:36 AM9/18/14

to ea-u...@googlegroups.com

I floor it at 3 because 10**-(3/10) = 0.5 , which is "50%"… iE: 50% chance of either base being correct.

The true distribution would involve calculations with "e", but in a simulation we ran this floored difference every bit as accurate. I posted some stuff about that a while back.

brittany

unread,

Sep 18, 2014, 10:51:11 AM9/18/14

to ea-u...@googlegroups.com

Ahhh I see, that makes sense. Thank you so much!!!! :-)

Britt

Reply all

Reply to author

Forward