Read pairing issue

433 views
Skip to first unread message

Gavin Douglas

unread,
Sep 23, 2019, 10:29:27 AM9/23/19
to kneaddata users
Hey there,

I noticed the same problem as reported at this issue: https://bitbucket.org/biobakery/kneaddata/issues/14/differing-numbers-of-reads-in-paired-end. Briefly, I'm noticing differing numbers of final forward and reverse reads. It also looks like all of the reads with the same headerline are being put into either the final forward or reverse FASTQs randomly. For example, the line "@SRR3586068.87354" is in both the forward and reverse FASTQ, but both mates of this read are placed in the final R1 FASTQ.

This is the output of kneaddata_read_count_table with the files I've been testing:

Sample  raw pair1       raw pair2       trimmed pair1   trimmed pair2   trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1       decontaminated Homo_sapiens pair2       decontaminated Homo_sapiens orphan1   decontaminated Homo_sapiens orphan2     final pair1     final pair2     final orphan1   final orphan2
p144C_R1_kneaddata      75000.0 75000.0 50386.0 50386.0 6975.0  1832.0  36666.0 10219.0 49328.0 147.0   36666.0 10219.0 49328.0 147.0

This issue is only occurring with a certain dataset downloaded from SRA, which makes me think it may be related to the headerlines.

I created test input files based on only 4 reads, which look like this for the forward reads:

@SRR3586068.87354
ACGCGAGGTTGGCGGGATCGGGCGGCGTGAAGGCTTCGGCGTCCGCGACCGTTTCGATCAGCGACATTGCGCCTTCGGGCACCTGACCGAACGTGCCGATC
+
CCCFFFAD@?><FHIJJIJIJJJFBDD-7?BCCDDDCDDDDBBDDDDDDDB-?CABDEB;::57;B;AC@@BDDDCDDDDB?BDDDDDDBDDB?BBDD@<9
@SRR3586068.123205
TCCATTAAGGCAGGGACTTCCTCATTCCAAAACCTAGAACAGTGCCTGTCACATAGTGTTAGTAAAAAATTGTTGAATGAATCAGTTTTAAAGAAAGAGGT
+
CCCFFFFFHHHHHJJBGHBHHGIGIJJJJJIJIJJJJIJJJJF<BGHGJJDGIJIIFHHIJJJJHGIJJGAHEHFDFFFFFEECEA@CDCCCDCDDDD?B#
@SRR3586068.154362
GTATTTCTTTAATAATTCTAGCCATTTCTTTCTAAGGTCTTTTTCTGTAAAACCAGGTTCAAGTCCTAATTCTGTCATACATTCAAAATATTTTCTATCCA
+
@@@DDFFFGFHHHGGHIIIJJIJGIJIIJJJIIHGIJCFEEIIIIIIIIJJJGHHHDDFGAHHF<EGHHE<CGFEFEIGHIAGGI@HEGEHEHFBEFFCC;
@SRR3586068.12
ATCAGAGCCGAATCATGGCCCATCTGTGTGAGCAGAACCTGTTTTCAAACTTTATCCCAGGGGGAGATGAGATTTGCTTGTAGGGTGAGTGTTTCCACGGT
+
C@@FFDFFHHHDHJJJGGIJJIIJJJCHGIGGHIEHIIJJID>FIICHIGIJJJIJHJIJJJDE?>BCDDDCCDDCCCDDBCCC?:?BD@C@CCACDDB##

And this for the reverse reads:

@SRR3586068.87354
GGCCGAACGGCTGGTCGATTCCGGCCGCCATATCGTCTTCATCGGCCATGCCGGCCACCCGGAGGTGATCGGCACGTTCGGTCAGGTGCCCGAAGGCGCAA
+
@@CFFDDDHHHHHJAFHGIJJIGIJGGIJEG@CECHHFFF;C9CD<AACCCA?BBDDDDDDD7;B2<::@8>;@BD<ABDD<@<CD>C@CCB39@BA@>B5
@SRR3586068.123205
ATGTAGTTAGGCTTAGAAAATATGATTTATGAAGATTATATAATATTCTGTAAAATGGATATTCCACAATTTCATCAATCCACCTCTTTCTTTAAAACTGA
+
@@@DFFFDHHHHHIJIFHIHHGIEHGGIJIIIIIHJIIIFIIJJJGIIIIHGFGGDFCH:FHHGIJJJJJJGIJJJIIJFDFHIJJHHHHHGHFFFFFF:;
@SRR3586068.154362
TTTTGTGGTAAATTACTTAAAGTTAATTACAGATAAATTTTAAAATAGGGAAGAGGGGATTAGAAATGGATAGAAAATATTTTGAATGTATGACAGAATTA
+
@@@BDEDDCDFHAHHIJIIBGGF>HIIJCHHIAHCFFIIJJGHDGHI<FHGGGCHHEGABH=CCEHAHIGIIGHHHGFDFFFFF@@AE;66@ACDCCCC:>
@SRR3586068.12
GAAAGTCAACAGTTCGCTCTCACGCAGCGATCTCAGGCCGGGGTCCACATAAACCCCCCTAGACAAGGATGGAGCAGGAGGGCACAGCACCGTGGAAACAC
+
@@@DDDEBHHHGHIIGIGGIGHIIIIIGIIIIIIDGIIIIIII4AGEEHE).;CEFDDD8<?>C>C?BACCDCACD?BBBB<B?BCDDDDDBBDDDCCD><


I ran this command:

kneaddata -i raw_data_example/p144C_R1_subset.fastq -i raw_data_example/p144C_R2_subset.fastq -o kneaddata_out_subset --trimmomatic /usr/local/prg/Trimmomatic-0.36/ -t 1 -db /home/gavin/projects/kneaddata_db/Homo_sapiens
 
The final "paired" files are empty when I input only these small files, but the unmatched files look like this:

p144C_R1_subset_kneaddata_unmatched_1.fastq:
@SRR3586068.87354
ACGCGAGGTTGGCGGGATCGGGCGGCGTGAAGGCTTCGGCGTCCGCGACCGTTTCGATCAGCGACATTGCGCCTTCGGGCACCTGACCGAACGTGCCGATC
+
CCCFFFAD@?><FHIJJIJIJJJFBDD-7?BCCDDDCDDDDBBDDDDDDDB-?CABDEB;::57;B;AC@@BDDDCDDDDB?BDDDDDDBDDB?BBDD@<9
@SRR3586068.123205
TCCATTAAGGCAGGGACTTCCTCATTCCAAAACCTAGAACAGTGCCTGTCACATAGTGTTAGTAAAAAATTGTTGAATGAATCAGTTTTAAAGAAAGAGG
+
CCCFFFFFHHHHHJJBGHBHHGIGIJJJJJIJIJJJJIJJJJF<BGHGJJDGIJIIFHHIJJJJHGIJJGAHEHFDFFFFFEECEA@CDCCCDCDDDD?B
@SRR3586068.87354
GGCCGAACGGCTGGTCGATTCCGGCCGCCATATCGTCTTCATCGGCCATGCCGGCCACCCGGAGGTGATCGGCACGTTCGGTCAGGTGCCCGAAGGCGCAA
+
@@CFFDDDHHHHHJAFHGIJJIGIJGGIJEG@CECHHFFF;C9CD<AACCCA?BBDDDDDDD7;B2<::@8>;@BD<ABDD<@<CD>C@CCB39@BA@>B5
@SRR3586068.123205
ATGTAGTTAGGCTTAGAAAATATGATTTATGAAGATTATATAATATTCTGTAAAATGGATATTCCACAATTTCATCAATCCACCTCTTTCTTTAAAACTGA
+
@@@DFFFDHHHHHIJIFHIHHGIEHGGIJIIIIIHJIIIFIIJJJGIIIIHGFGGDFCH:FHHGIJJJJJJGIJJJIIJFDFHIJJHHHHHGHFFFFFF:

p144C_R1_subset_kneaddata_unmatched_2.fastq:
@SRR3586068.154362
GTATTTCTTTAATAATTCTAGCCATTTCTTTCTAAGGTCTTTTTCTGTAAAACCAGGTTCAAGTCCTAATTCTGTCATACATTCAAAATATTTTCTATCCA
+
@@@DDFFFGFHHHGGHIIIJJIJGIJIIJJJIIHGIJCFEEIIIIIIIIJJJGHHHDDFGAHHF<EGHHE<CGFEFEIGHIAGGI@HEGEHEHFBEFFCC;
@SRR3586068.12
ATCAGAGCCGAATCATGGCCCATCTGTGTGAGCAGAACCTGTTTTCAAACTTTATCCCAGGGGGAGATGAGATTTGCTTGTAGGGTGAGTGTTTCCACG
+
C@@FFDFFHHHDHJJJGGIJJIIJJJCHGIGGHIEHIIJJID>FIICHIGIJJJIJHJIJJJDE?>BCDDDCCDDCCCDDBCCC?:?BD@C@CCACDDB
@SRR3586068.154362
TTTTGTGGTAAATTACTTAAAGTTAATTACAGATAAATTTTAAAATAGGGAAGAGGGGATTAGAAATGGATAGAAAATATTTTGAATGTATGACAGAATTA
+
@@@BDEDDCDFHAHHIJIIBGGF>HIIJCHHIAHCFFIIJJGHDGHI<FHGGGCHHEGABH=CCEHAHIGIIGHHHGFDFFFFF@@AE;66@ACDCCCC:>
@SRR3586068.12
GAAAGTCAACAGTTCGCTCTCACGCAGCGATCTCAGGCCGGGGTCCACATAAACCCCCCTAGACAAGGATGGAGCAGGAGGGCACAGCACCGTGGAAACAC
+
@@@DDDEBHHHGHIIGIGGIGHIIIIIGIIIIIIDGIIIIIII4AGEEHE).;CEFDDD8<?>C>C?BACCDCACD?BBBB<B?BCDDDDDBBDDDCCD><


Note that read ids with the same headerline are placed in the same output file. These analyses were run with kneaddata v0.7.2 (and the same problem occurred with v0.7.0), bowtie2 2.2.3, Python 3.7.4, and the latest human_genome bowtie2 database downloaded today with "kneaddata_database". On a minor note I also noticed that the database name should be specified as "human_genome" rather than "human" in the installation instruction here: http://huttenhower.sph.harvard.edu/kneaddata


Thanks for any insight!

Gavin




Lauren McIver

unread,
Sep 23, 2019, 5:10:34 PM9/23/19
to Gavin Douglas, kneaddata users
Hi Gavin, Would you check out the formatting of the sequence ids in your original fastq files? I wonder if kneaddata is having an issue picking up the pairs. Ideally the ids would end with "/1" and "/2". There are also other variants it can pick up but it is not able to pick up all possible variations. 

Thanks!
Lauren
 

--
You received this message because you are subscribed to the Google Groups "kneaddata users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kneaddata-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kneaddata-users/337c0f10-67ed-439b-ab24-f85969e57ae0%40googlegroups.com.

Gavin Douglas

unread,
Sep 24, 2019, 9:22:47 AM9/24/19
to Lauren McIver, kneaddata users
Thanks for the quick response Lauren.

Adding “/1” and “/2” to the end of the R1 and R2 headerlines resolved this problem, thanks! It might be worth mentioning this in the manual (unless I missed it), because for whatever reason it's possible to download FASTQs from SRA / ENA without these numbers present in the headers. I think this is likely what’s going on in that issue that I referenced as well.


Thanks,

Gavin

Gavin Douglas

unread,
Nov 4, 2019, 9:36:21 AM11/4/19
to Lauren McIver, kneaddata users
In case it's useful to others - I've also noticed that this issue can occur even when FASTQ headerlines are formatted correctly. The problem seems to be related to using older versions of bowtie2 as well (at least with v2.2.3), but was resolved when I tried a newer version (v2.3.4.3).


Gavin
Reply all
Reply to author
Forward
0 new messages