sendru
unread,Nov 29, 2012, 2:03:54 AM11/29/12Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to bgi-...@googlegroups.com
I am quite new for de novo assembly, and recently have some reads of mammal for analysis.
The reads are from PE library, 101bases for each side, 400M reads.
and some reads looks really bad from base calling quality, here is some examples
@SRR518712.1 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1247:2069
length=202
ATGGACATTGCACCTGCCCTACTGTCTCCCTCCTATCATGCAGATCCTTGTTCCAAACCGCACATTGATTTCCTGGGTGTTCCAACTGATCTTAGGTCACTNAAGNAANNCANAAGGGGGGAAATCAAGCAGTAATACANCNNNATGNGTCTGAGGGGTTANGNNNNCAGAGAANGGGGGGGTCGTGGGTGGNNNNNNGAGC
+SRR518712.1 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1247:2069
length=202
?@@DB?ADFBFFHEHGAE+AC:?@,2CCFGB)):9?C@D<DH99DH@9@**9?98?##################################################################################################################################################
@SRR518712.2 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1232:2172
length=202
GTTTACAGTTCTTCCTCCTATGTTGGAGGTTCTCAGAATAACAAAGTTTTTCAACTAGATCCAGGAGAACTATGGAGGTCGTCATGTAAGACCGNGAGGGCCTGGAATAGGCCCAGATTATGTTTGGTTTGCTGGTATCCTTANTATCGGTGTCCCTTTTACTCTTAAAAGTTACTGTGGGAAAATGGCAGCGCAGNTCCAG
+SRR518712.2 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1232:2172
length=202
??@DDDDD2,2CCEGEGAEBIEFHE=+AFA3?3?<E@***11CB:B0:4?)090*9?*9DF########################################+:=B+A+2+CDBF+A######################################################################################
@SRR518712.3 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1317:2060
length=202
GAACAGAGGTGCTATTAAACAACAGAAAGGAAGAGGTTAAATTAAGTAGGGGTTACAATAATGTTTCCTGGAGGACAGGACACTTGAGAGGAAAGTTATTTNNGNNNNNNNNNNACAGTTTCGCCAGGGGTTACTTTAANNNNNNGNNCGTTTTGGGGCNNNNNNNNNNTGTNTNNGTTGCTACATCAAANNNNNNNNNAGT
+SRR518712.3 DGM97JN1_110802_0138_B817D2ABXX:1:1101:1317:2060
length=202
@@@DDDBDA+ADDI@A@A:EF=FGHCHHIGEG>E9):3??DGH4?D0?9BD0('888C:@:@@###########################################################################################################################################
The first 101bases is one end, following by 101 bases from the other, and they are forward-reverse couple
My concerns are mainly from data preparation step:
1) Do I need to delete reads with low qualities before ErrorCorrection? and what the criteria?
2) Do I need to delete identical reads before ErrorCorrection, which is assume to generated from PCR amplification?
3) Does the current version of ErrorCorrection make use of base calling information? what is the difference if I use fasta format instead?