Question regarding to the pipe-line of SOAPdenovo2

48 views
Skip to first unread message

sendru

unread,
Nov 29, 2012, 2:03:54 AM11/29/12
to bgi-...@googlegroups.com
I am quite new for de novo assembly, and recently have some reads of mammal for analysis.
The reads are from PE library, 101bases for each side, 400M reads.
and some reads looks really bad from base calling quality, here is some examples

@SRR518712.1 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1247:2069 length=202
ATGGACATTGCACCTGCCCTACTGTCTCCCTCCTATCATGCAGATCCTTGTTCCAAACCGCACATTGATTTCCTGGGTGTTCCAACTGATCTTAGGTCACTNAAGNAANNCANAAGGGGGGAAATCAAGCAGTAATACANCNNNATGNGTCTGAGGGGTTANGNNNNCAGAGAANGGGGGGGTCGTGGGTGGNNNNNNGAGC
+SRR518712.1 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1247:2069 length=202
?@@DB?ADFBFFHEHGAE+AC:?@,2CCFGB)):9?C@D<DH99DH@9@**9?98?##################################################################################################################################################
@SRR518712.2 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1232:2172 length=202
GTTTACAGTTCTTCCTCCTATGTTGGAGGTTCTCAGAATAACAAAGTTTTTCAACTAGATCCAGGAGAACTATGGAGGTCGTCATGTAAGACCGNGAGGGCCTGGAATAGGCCCAGATTATGTTTGGTTTGCTGGTATCCTTANTATCGGTGTCCCTTTTACTCTTAAAAGTTACTGTGGGAAAATGGCAGCGCAGNTCCAG
+SRR518712.2 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1232:2172 length=202
??@DDDDD2,2CCEGEGAEBIEFHE=+AFA3?3?<E@***11CB:B0:4?)090*9?*9DF########################################+:=B+A+2+CDBF+A######################################################################################
@SRR518712.3 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1317:2060 length=202
GAACAGAGGTGCTATTAAACAACAGAAAGGAAGAGGTTAAATTAAGTAGGGGTTACAATAATGTTTCCTGGAGGACAGGACACTTGAGAGGAAAGTTATTTNNGNNNNNNNNNNACAGTTTCGCCAGGGGTTACTTTAANNNNNNGNNCGTTTTGGGGCNNNNNNNNNNTGTNTNNGTTGCTACATCAAANNNNNNNNNAGT
+SRR518712.3 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1317:2060 length=202
@@@DDDBDA+ADDI@A@A:EF=FGHCHHIGEG>E9):3??DGH4?D0?9BD0('888C:@:@@###########################################################################################################################################

The first 101bases is one end, following by 101 bases from the other, and they are forward-reverse couple

My concerns are mainly from data preparation step:

1) Do I need to delete reads with low qualities before ErrorCorrection? and what the criteria?
2) Do I need to delete identical reads before ErrorCorrection, which is assume to generated from PCR amplification?
3) Does the current version of ErrorCorrection make use of base calling information? what is the difference if I use fasta format instead?

Reply all
Reply to author
Forward
0 new messages