Hey there,
I noticed the same problem as reported at this issue:
https://bitbucket.org/biobakery/kneaddata/issues/14/differing-numbers-of-reads-in-paired-end. Briefly, I'm noticing differing numbers of final forward and reverse reads. It also looks like all of the reads with the same headerline are being put into either the final forward or reverse FASTQs randomly. For example, the line "@SRR3586068.87354" is in both the forward and reverse FASTQ, but both mates of this read are placed in the final R1 FASTQ.
This is the output of kneaddata_read_count_table with the files I've been testing:
Sample raw pair1 raw pair2 trimmed pair1 trimmed pair2 trimmed orphan1 trimmed orphan2 decontaminated Homo_sapiens pair1 decontaminated Homo_sapiens pair2 decontaminated Homo_sapiens orphan1 decontaminated Homo_sapiens orphan2 final pair1 final pair2 final orphan1 final orphan2
p144C_R1_kneaddata 75000.0 75000.0 50386.0 50386.0 6975.0 1832.0 36666.0 10219.0 49328.0 147.0 36666.0 10219.0 49328.0 147.0
This issue is only occurring with a certain dataset downloaded from SRA, which makes me think it may be related to the headerlines.
I created test input files based on only 4 reads, which look like this for the forward reads:
@SRR3586068.87354
ACGCGAGGTTGGCGGGATCGGGCGGCGTGAAGGCTTCGGCGTCCGCGACCGTTTCGATCAGCGACATTGCGCCTTCGGGCACCTGACCGAACGTGCCGATC
+
CCCFFFAD@?><FHIJJIJIJJJFBDD-7?BCCDDDCDDDDBBDDDDDDDB-?CABDEB;::57;B;AC@@BDDDCDDDDB?BDDDDDDBDDB?BBDD@<9
@SRR3586068.123205
TCCATTAAGGCAGGGACTTCCTCATTCCAAAACCTAGAACAGTGCCTGTCACATAGTGTTAGTAAAAAATTGTTGAATGAATCAGTTTTAAAGAAAGAGGT
+
CCCFFFFFHHHHHJJBGHBHHGIGIJJJJJIJIJJJJIJJJJF<BGHGJJDGIJIIFHHIJJJJHGIJJGAHEHFDFFFFFEECEA@CDCCCDCDDDD?B#
@SRR3586068.154362
GTATTTCTTTAATAATTCTAGCCATTTCTTTCTAAGGTCTTTTTCTGTAAAACCAGGTTCAAGTCCTAATTCTGTCATACATTCAAAATATTTTCTATCCA
+
@@@DDFFFGFHHHGGHIIIJJIJGIJIIJJJIIHGIJCFEEIIIIIIIIJJJGHHHDDFGAHHF<EGHHE<CGFEFEIGHIAGGI@HEGEHEHFBEFFCC;
@SRR3586068.12
ATCAGAGCCGAATCATGGCCCATCTGTGTGAGCAGAACCTGTTTTCAAACTTTATCCCAGGGGGAGATGAGATTTGCTTGTAGGGTGAGTGTTTCCACGGT
+
C@@FFDFFHHHDHJJJGGIJJIIJJJCHGIGGHIEHIIJJID>FIICHIGIJJJIJHJIJJJDE?>BCDDDCCDDCCCDDBCCC?:?BD@C@CCACDDB##
And this for the reverse reads:
@SRR3586068.87354
GGCCGAACGGCTGGTCGATTCCGGCCGCCATATCGTCTTCATCGGCCATGCCGGCCACCCGGAGGTGATCGGCACGTTCGGTCAGGTGCCCGAAGGCGCAA
+
@@CFFDDDHHHHHJAFHGIJJIGIJGGIJEG@CECHHFFF;C9CD<AACCCA?BBDDDDDDD7;B2<::@8>;@BD<ABDD<@<CD>C@CCB39@BA@>B5
@SRR3586068.123205
ATGTAGTTAGGCTTAGAAAATATGATTTATGAAGATTATATAATATTCTGTAAAATGGATATTCCACAATTTCATCAATCCACCTCTTTCTTTAAAACTGA
+
@@@DFFFDHHHHHIJIFHIHHGIEHGGIJIIIIIHJIIIFIIJJJGIIIIHGFGGDFCH:FHHGIJJJJJJGIJJJIIJFDFHIJJHHHHHGHFFFFFF:;
@SRR3586068.154362
TTTTGTGGTAAATTACTTAAAGTTAATTACAGATAAATTTTAAAATAGGGAAGAGGGGATTAGAAATGGATAGAAAATATTTTGAATGTATGACAGAATTA
+
@@@BDEDDCDFHAHHIJIIBGGF>HIIJCHHIAHCFFIIJJGHDGHI<FHGGGCHHEGABH=CCEHAHIGIIGHHHGFDFFFFF@@AE;66@ACDCCCC:>
@SRR3586068.12
GAAAGTCAACAGTTCGCTCTCACGCAGCGATCTCAGGCCGGGGTCCACATAAACCCCCCTAGACAAGGATGGAGCAGGAGGGCACAGCACCGTGGAAACAC
+
@@@DDDEBHHHGHIIGIGGIGHIIIIIGIIIIIIDGIIIIIII4AGEEHE).;CEFDDD8<?>C>C?BACCDCACD?BBBB<B?BCDDDDDBBDDDCCD><
I ran this command:
kneaddata -i raw_data_example/p144C_R1_subset.fastq -i raw_data_example/p144C_R2_subset.fastq -o kneaddata_out_subset --trimmomatic /usr/local/prg/Trimmomatic-0.36/ -t 1 -db /home/gavin/projects/kneaddata_db/Homo_sapiens
The final "paired" files are empty when I input only these small files, but the unmatched files look like this:
p144C_R1_subset_kneaddata_unmatched_1.fastq:
@SRR3586068.87354
ACGCGAGGTTGGCGGGATCGGGCGGCGTGAAGGCTTCGGCGTCCGCGACCGTTTCGATCAGCGACATTGCGCCTTCGGGCACCTGACCGAACGTGCCGATC
+
CCCFFFAD@?><FHIJJIJIJJJFBDD-7?BCCDDDCDDDDBBDDDDDDDB-?CABDEB;::57;B;AC@@BDDDCDDDDB?BDDDDDDBDDB?BBDD@<9
@SRR3586068.123205
TCCATTAAGGCAGGGACTTCCTCATTCCAAAACCTAGAACAGTGCCTGTCACATAGTGTTAGTAAAAAATTGTTGAATGAATCAGTTTTAAAGAAAGAGG
+
CCCFFFFFHHHHHJJBGHBHHGIGIJJJJJIJIJJJJIJJJJF<BGHGJJDGIJIIFHHIJJJJHGIJJGAHEHFDFFFFFEECEA@CDCCCDCDDDD?B
@SRR3586068.87354
GGCCGAACGGCTGGTCGATTCCGGCCGCCATATCGTCTTCATCGGCCATGCCGGCCACCCGGAGGTGATCGGCACGTTCGGTCAGGTGCCCGAAGGCGCAA
+
@@CFFDDDHHHHHJAFHGIJJIGIJGGIJEG@CECHHFFF;C9CD<AACCCA?BBDDDDDDD7;B2<::@8>;@BD<ABDD<@<CD>C@CCB39@BA@>B5
@SRR3586068.123205
ATGTAGTTAGGCTTAGAAAATATGATTTATGAAGATTATATAATATTCTGTAAAATGGATATTCCACAATTTCATCAATCCACCTCTTTCTTTAAAACTGA
+
@@@DFFFDHHHHHIJIFHIHHGIEHGGIJIIIIIHJIIIFIIJJJGIIIIHGFGGDFCH:FHHGIJJJJJJGIJJJIIJFDFHIJJHHHHHGHFFFFFF:
p144C_R1_subset_kneaddata_unmatched_2.fastq:
@SRR3586068.154362
GTATTTCTTTAATAATTCTAGCCATTTCTTTCTAAGGTCTTTTTCTGTAAAACCAGGTTCAAGTCCTAATTCTGTCATACATTCAAAATATTTTCTATCCA
+
@@@DDFFFGFHHHGGHIIIJJIJGIJIIJJJIIHGIJCFEEIIIIIIIIJJJGHHHDDFGAHHF<EGHHE<CGFEFEIGHIAGGI@HEGEHEHFBEFFCC;
@SRR3586068.12
ATCAGAGCCGAATCATGGCCCATCTGTGTGAGCAGAACCTGTTTTCAAACTTTATCCCAGGGGGAGATGAGATTTGCTTGTAGGGTGAGTGTTTCCACG
+
C@@FFDFFHHHDHJJJGGIJJIIJJJCHGIGGHIEHIIJJID>FIICHIGIJJJIJHJIJJJDE?>BCDDDCCDDCCCDDBCCC?:?BD@C@CCACDDB
@SRR3586068.154362
TTTTGTGGTAAATTACTTAAAGTTAATTACAGATAAATTTTAAAATAGGGAAGAGGGGATTAGAAATGGATAGAAAATATTTTGAATGTATGACAGAATTA
+
@@@BDEDDCDFHAHHIJIIBGGF>HIIJCHHIAHCFFIIJJGHDGHI<FHGGGCHHEGABH=CCEHAHIGIIGHHHGFDFFFFF@@AE;66@ACDCCCC:>
@SRR3586068.12
GAAAGTCAACAGTTCGCTCTCACGCAGCGATCTCAGGCCGGGGTCCACATAAACCCCCCTAGACAAGGATGGAGCAGGAGGGCACAGCACCGTGGAAACAC
+
@@@DDDEBHHHGHIIGIGGIGHIIIIIGIIIIIIDGIIIIIII4AGEEHE).;CEFDDD8<?>C>C?BACCDCACD?BBBB<B?BCDDDDDBBDDDCCD><
Note that read ids with the same headerline are placed in the same output file. These analyses were run with kneaddata v0.7.2 (and the same problem occurred with v0.7.0), bowtie2 2.2.3, Python 3.7.4, and the latest human_genome bowtie2 database downloaded today with "kneaddata_database". On a minor note I also noticed that the database name should be specified as "human_genome" rather than "human" in the installation instruction here:
http://huttenhower.sph.harvard.edu/kneaddata
Thanks for any insight!
Gavin