PRINSEQ Paired-end reads input format data

44 views
Skip to first unread message

perrin...@gmail.com

unread,
Aug 6, 2015, 4:52:08 AM8/6/15
to Edwards Lab Tools, mmper...@sbr-roscoff.fr
Hello,

I am trying to use PRINSEQ with Paired-end reads.
I know that "The sequence identifiers for two matching paired-end sequences in separate files can be marked by /1 and /2, or _L and _R, or _left and _right, or must have the exact same identifier in both input files.".

Can I have an example of the input data?
I am not sure but I think that mine are marked by .1 and .2 (cf. above).
Can you confirm that I have to "transform" them?
And If yes how should i do that?

Thank you
Marie

___SRR496757_1.fastq
@SRR496757.1 B802KKABXX:8:1:1321:1990 length=90
ATTCAAAAGTATCACAATTGAGCTTGAAAATCACACGAGCTGTATTTTTTTTTTGTCAACAGCGAGGAGAGAACTACACAAGCAAAAAAG
+SRR496757.1 B802KKABXX:8:1:1321:1990 length=90
=GGGGDEFFCGGFFGDAGGGBGGGGGGFGDEGGF?EGDFEGGEEGFEEC@6?;<37*?@?@@?@?A5)@?4/7&/86:48EBDEC@@?##
@SRR496757.2 B802KKABXX:8:1:1737:1916 length=90
NTAAATCTCAATTGAAGGCATGACTTCGGCGAATTTCGACAGACACCCGCATGTGGCAAGCTGTTCAGTTCGAGTTCAGTTCGACCCCCC
+SRR496757.2 B802KKABXX:8:1:1737:1916 length=90
#**-(27272EEEE?EEEEEEEE?EEEEE>9@@@@?A@A9@@@@?@>9>>BBBB<0:=<7133-.)30+47770099999>8>>>BBBBB

___SRR496757_2.fastq
@SRR496757.1 B802KKABXX:8:1:1321:1990 length=90
CCCCCCCATGGCACAGTCACAAGTAGTATTAAAGGTAGCCCCGGGCTACAGACGATACTACAAAAGATAGAATACCAGTACAGTCTTTTT
+SRR496757.1 B802KKABXX:8:1:1321:1990 length=90
GGDFGG>GGGG?ED?DBDDDAAB?:DB?DDCCCCC:;@=>DDDDDBD?ACAA-A>CC-C:=?############################
@SRR496757.2 B802KKABXX:8:1:1737:1916 length=90
GGAGTATATAGTGCCGGTTGCCGCTATAGTGCCGGCCTTATTGGCTGGGGGGGAACCAAAAAACCGGACAGAAAATAAAGGGGGGTCTAT
+SRR496757.2 B802KKABXX:8:1:1737:1916 length=90
GGEGGGGF:FACEEEEEEEE5CDDD>@B?@EB:=ADDCEE:BEEC-CA=?########################################

Kate

unread,
Aug 6, 2015, 7:01:31 PM8/6/15
to Edwards Lab Tools, mmper...@sbr-roscoff.fr
It looks as though they are not already marked.  The .1 and .2 appear to be the sequence number.  It looks like it increments .1 .2 .3 (however your excerpt doesn't show the third).

The /1 or /2 needs to be at the end of the sequence id (before any spaces).  So the correct notation would be "@SRR496757.1/1 B802KKABXX:8:1:1321:1990 length=90".  Many commands you might find on the web will incorrectly have you add the /1 to the end of the entire sequence line; which if there is a space, would be wrong.

The correct way to format the files are with the following two commands:

cat file_1.fastq | paste - - | sed 's/^\(\S*\)/\1\/1/' | tr "\t" "\n" > file_1_renamed.fastq

cat file_2.fastq | paste - - | sed 's/^\(\S*\)/\1\/2/' | tr "\t" "\n" > file_2_renamed.fastq


See: https://edwards.sdsu.edu/research/changing-the-label-of-paired-end-sequences-in-fastq-files/



**The above commands assume you have the standard 4 line fastq files (which it looks like you do).  It will break on fastq files that are multi-line aka "line wrapped".  Multi-line fastqs shouldn't exist, but they do.
Reply all
Reply to author
Forward
0 new messages