read.Bismark

raza rahman

unread,

Oct 16, 2014, 5:02:17 AM10/16/14

to methylkit_...@googlegroups.com

Hello,
Is it necessary for the sam file to have header lines? I mean I sorted the sam output file from bismark and removed the headers? Is it necessary to have them in the file for read.bismark?

Error

not enough alignments that pass coverage and phred score thresholds to calculate conversion rates
 EXITING....

 
Error in .local(location, sample.id, assembly, save.folder, save.context,  : 
  
Error in methylation calling...
Make sure the file is sorted correctly and it is a legitimate Bismark SAM file

thanks

Kalyan K Pasumarthy

unread,

Oct 16, 2014, 5:18:47 AM10/16/14

to methylkit_...@googlegroups.com

Methylkit vignette recommends to remove the headers. Make sure that SAM file is sorted.

Regards,
Kalyan

--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

raza rahman

unread,

Oct 16, 2014, 6:00:14 AM10/16/14

to methylkit_...@googlegroups.com

Thanks for the reply. Yes the sam file is sorted. This is how my flow is

I have output sam files from Bismark.
        I removed the duplicates through bismark_deduplicate.
        I sorted them as described in the methylkit documentation (unix sort).
        I added chr to the chromosome number.

My file looks like this now (50 GB)


HWI-ST1140:146:C45ELACXX:5:1305:20630:83995_1:N:0:TTAGGC 99 chr1 3000086 40 38M = 3000120 134 TTGGTTTTGGGTTTTTTTTTTTTTTTTTTTTTTTTTTT CCCFFFFFHHHHHJJJJJJJJJJJHDDDDDDDDDDDDB NM:i:4 MD:Z:0C4C0C4C26 XM:Z:x....hx....h.......................... XR:Z:CT XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:1305:20630:83995_1:N:0:TTAGGC 147 chr1 3000120 40 100M = 3000086 -134 TTTTGGTTGGGAGATTATTGATGATTGTTTTTATTTTTTTAGGGGAAATGGGATTTTTAGTTTATGAATTTGATTTTGATTTAGTTTTGGTATTTGGCAT DDDDDDDDDDDDDEDEDDDDEEDEDDFFFFHHHHJJJJJJJJJJJJIJIHIIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFCCC NM:i:17 MD:Z:6G7C9C2C0C1C5C16C7C0C6C4C0C8C7C0C3T2 XM:Z:..............h.........x..hh.h.....h................h.......hh......x....hx........h.......hx...... XR:Z:GA XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:1116:12516:66117_1:N:0:TTAGGC 99 chr1 3000667 42 99M = 3000681 112 GGTTATGTTGTGGATTTATTTTTATTAAATTTTAAAATATTTTTAATTTTTTTTTTTATTTTATTATTGATTAAGTTATTATTAAGTAGAGTATTGTTT @CBDDFFFHHFHHIJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJHFFFEEEEEEEEFDDEDEDDDCDEDEEEEDDDEDDDCDDEDEDD NM:i:13 MD:Z:2C19C6C7C2C12C7C2C5C0C3C3C18C0 XM:Z:..h...................h......h.......h..h............h.......h..h.....hh...h...h..................x XR:Z:CT XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:1116:12516:66117_1:N:0:TTAGGC 147 chr1 3000681 42 98M = 3000667 -112 TTTATTTTTATTAAATTTTAAAATATTTTTAATTTTTTTTTTTATTTTATTATTGATTAAGTTATTATTAAGTAGAGTATTGTTTAGTTTTTAGGTGA DDDDDDDDECEDCCEDDDEEEEFDDDDDDDDDDDHJJJJJJJJJJJJIJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJIIJJJJJHHHHHFFFFFCCC NM:i:14 MD:Z:8C6C7C2C12C7C2C5C0C3C3C18C5C0C6 XM:Z:........h......h.......h..h............h.......h..h.....hh...h...h..................x.....hx...... XR:Z:GA XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:1207:19464:21081_1:N:0:TTAGGC 99 chr1 3000735 42 99M = 3000824 188 GATTAAGTTATTATTAAGTAGAGTATTGTTTAGTTTTTAGGTGAATGTTGGTTTTTTATTATTTATGTTGTTATTGAAGATTAGTTTTAGTTCGTAGTT BBCFDFFFHHHHHJJJJJHJJJJFIIJJJHJIJHHJJJJJJEHIIJJIIJJIJJJJJJJJJJJJJJJIJJIIJJJJHHHHHHHFBEFFFEEEEDDDDDE NM:i:14 MD:Z:2C0C3C3C18C5C0C13C3C11C13C2C0C5C7 XM:Z:..hh...h...h..................x.....hx.............h...h...........x.............x..hh.....xZ...... XR:Z:CT XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:1207:19464:21081_1:N:0:TTAGGC 147 chr1 3000824 42 99M = 3000735 -188 GTTCGTAGTTATTTGAAAGGATGTATGGGAAAATTTTAATATTTTTGTATTTGTTGAGGATTTTTTGTGAGTGATTATATGGTTAATTTTGGAGGATTT DDDDDDDDDDDEEEEEEEFFFFFFHHHHHHIJJJJJJJJIJJJJJJJIHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFCCC NM:i:8 MD:Z:2C9C10C12C13C9C13C8C15 XM:Z:..xZ........x..........h............h.............x.........h.............h........h............... XR:Z:GA XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:2112:13241:99119_1:N:0:TTAGGC 99 chr1 3000922 42 100M = 3000965 142 TGGTATTGAGAAGAAGGTATATATTTTTTTGTTTTATGATAAAATGTTTTGTAGATATTTATTAAATTTATTTGTTTTATAATTTCGGTTAGTGTTCGTG CCCFFFFFHHHGHJIJJAHJJJIJJJJJJJJHIJJJJJJJJJIJJIIIJJJHIJJJJJJJIJJJJJJJJIJHHHHEHFFFFFFFEEDDDDDDDDDFDDDD NM:i:10 MD:Z:5C18C0C6C15C9C9C8C4C12C4 XM:Z:.....x..................hh......h...............x.........h.........h........h....h..Z.........xZ... XR:Z:CT XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:2112:13241:99119_1:N:0:TTAGGC 147 chr1 3000965 42 99M = 3000922 -142 ATGTTTTGTAGATATTTATTAAATTTATTTGTTTTATAATTTCGGTTAGTGTTCGTGTGTTTTTGTTTAGTTTTTGTTTTTAGGATTTGTTTTTTGGCG DDDDDDDDEEEEEEDDEDEDDEEDDDDDDDDDEEEEDDDAAABFHHFFEJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJHHHHHFFFFFCCC NM:i:17 MD:Z:5C9C9C8C4C12C7C1C4C5C2C2C0C5C3C1C4T1 XM:Z:.....x.........h.........h........h....h..Z.........xZ......h.x....x.....x..h..hx.....x...h.h...... XR:Z:GA XG:Z:CT
HWI-ST1140:146:C45ELACXX:5:1103:3371:72689_1:N:0:TTAGGC 163 chr1 3001053 42 99M = 3001055 100 GTCTCTTAATAAAAATATAATCTTAAAATCTCCCAATATTATTTTATAAAATACAATATATACTTTAATCTTTAACAAAATATATTTAATAAATATAAC @BBFDFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJIIIIIJJJJJJJIJFHHHHHHFFFF NM:i:26 MD:Z:7G0G1G1G1G1G1G0G4G2G7G11G1G0G1G4G1G1G4G7G4G1G8G3G1G0G1 XM:Z:X......hh.h.h.h.h.hh....h..h.......x...........h.hh.h....h.h.h....h.......h....h.h........h...h.hh. XR:Z:GA XG:Z:GA
HWI-ST1140:146:C45ELACXX:5:2116:9780:51547_1:N:0:TTAGGC 163 chr1 3001053 24 99M = 3001138 184 ATCTTTTAATAAAAATATAATCTTAAAATCTCCCAATATTATTTTATAAAATACAATATATACTTTAATCTTTAACAAAATATATTTAATAAATATAAC CC@FFFFFHHHGHJJJIIIHIHIIJIIJJGHJIJIGIJJJJIJJJJJJIJJHIJIIIJJJJJDAGIJJIJJJJIIJIIJJJIJIJJJJHHHHGHHFFFF NM:i:28 MD:Z:0G3C2G0G1G1G1G1G1G0G4G2G7G11G1G0G1G4G1G1G4G7G4G1G8G3G1G0G1 XM:Z:x......hh.h.h.h.h.hh....h..h.......x...........h.hh.h....h.h.h....h.......h....h.h........h...h.hh. XR:Z:GA XG:Z:GA


I used the following script
listOfFiles<-list("sorted_A.deduplicated.sam","sorted_B.deduplicated.sam","sorted_C.deduplicated.sam","sorted_D.deduplicated.sam")
myobj=read.bismark(location=listOfFiles,sample.id=list("A","B","C","D"),assembly="mm10",save.folder=path,save.context="CpG",read.context="none",mincov=5,minqual=20,treatment=c(1,1,0,0))


It gives me this error



not enough alignments that pass coverage and phred score thresholds to calculate conversion rates
 EXITING....

Error in .local(location, sample.id, assembly, save.folder, save.context,  : 
  
Error in methylation calling...
Make sure the file is sorted correctly and it is a legitimate Bismark SAM file

I was getting this error before and through I came to know that I should remove the duplicated alignments only through the bismark script, (which I did now)
I also tried to run it without these parameters mincov=5,minqual=20, but it still gives me the same error.

Any suggestions.

thanks in advance

On Thursday, October 16, 2014 11:18:47 AM UTC+2, Kalyan Kumar Pasumarthy wrote:

Methylkit vignette recommends to remove the headers. Make sure that SAM file is sorted.

Regards,
Kalyan

On 16 October 2014 12:02, raza rahman <raza9p...@gmail.com> wrote:

Hello,
Is it necessary for the sam file to have header lines? I mean I sorted the sam output file from bismark and removed the headers? Is it necessary to have them in the file for read.bismark?

Error
not enough alignments that pass coverage and phred score thresholds to calculate conversion rates
 EXITING....

 
Error in .local(location, sample.id, assembly, save.folder, save.context,  : 
  
Error in methylation calling...
Make sure the file is sorted correctly and it is a legitimate Bismark SAM file
thanks

--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsub...@googlegroups.com.

Kalyan K Pasumarthy

unread,

Oct 16, 2014, 6:23:24 AM10/16/14

to methylkit_...@googlegroups.com

You may let us know how many alignments were left after deduplication. These stats are provided in the bismark deduplication report.

Also, try using minqual =5 or 10 or 15 (This will help you only if you have enough alignments. So, the first step is to look at the deduplicated stats)

Regards,
Kalyan

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.

raza rahman

unread,

Oct 16, 2014, 6:31:50 AM10/16/14

to methylkit_...@googlegroups.com

This is the report for all the four files. I tried with minqual 20 earlier but now left it default.

sorted_A.deduplicated.sam

Total number of alignments analysed in output/A/sorted_A.deduplicated.sam:    110368117
Total number duplicated alignments removed:    60519650 (54.83%)
Duplicated alignments were found at:    30052677 different position(s)

Total count of deduplicated leftover sequences: 49848467 (45.17% of total)

sorted_B.deduplicated.sam

Total number of alignments analysed in output/B/sorted_B.deduplicated.sam:	172223700
Total number duplicated alignments removed:	98591672 (57.25%)
Duplicated alignments were found at:	46569641 different position(s)

Total count of deduplicated leftover sequences: 73632028 (42.75% of total)

sorted_C.deduplicated.sam
Total number of alignments analysed in output/C/sorted_C.deduplicated.sam:	107532033
Total number duplicated alignments removed:	63739408 (59.27%)
Duplicated alignments were found at:	28465321 different position(s)

Total count of deduplicated leftover sequences: 43792625 (40.73% of total)

sorted_D.deduplicated.sam

Total number of alignments analysed in output/D/sorted_D.deduplicated.sam:    166817026
Total number duplicated alignments removed:    95903509 (57.49%)
Duplicated alignments were found at:    44933344 different position(s)

Total count of deduplicated leftover sequences: 70913517 (42.51% of total)

Regards,
Kalyan

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsubscrib...@googlegroups.com.

To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

Kalyan K Pasumarthy

unread,

Oct 16, 2014, 7:02:54 AM10/16/14

to methylkit_...@googlegroups.com

Hi Raza,

I am assuming that this data belongs to WGBS (For RRBS we do not deduplicate) and the genome is mm10.

The aligned sequences after deduplication are in the range of 43M-73M. This is a very low number of reads for WGBS. I do not expect much coverage in this data. So, this could be the reason for failure. Also, it would have been more informative if details of the experiment ie., if this data is single end or paired end and the read length.

mm10 is ~700MB. Which means one would need close to two lanes of HiSeq 2500 ie., ~ 360M PE 100bp data for 10X coverage of the genome (Consider the duplication in data and overlap in paired reads). Based on this information you may estimate what else you need!

Regards,
Kalyan

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.

raza rahman

unread,

Oct 16, 2014, 7:42:01 AM10/16/14

to methylkit_...@googlegroups.com

Hi Kalyan,
You are right. This is WGBS data. This is paired end data and yes the genome is mm10. The read length is 100 bp.

On Thursday, October 16, 2014 1:02:54 PM UTC+2, Kalyan Kumar Pasumarthy wrote:

Hi Raza,

I am assuming that this data belongs to WGBS (For RRBS we do not deduplicate) and the genome is mm10. Yes my experiment design is almost the same as you said. here are the states for one of samples from bismark alignment.

Sequence pairs analysed in total:   354807808
Number of paired-end alignments with a unique best hit:   110368117
Mapping efficiency:   31.1%

Sequence pairs with no alignments under any condition:   238382622
Sequence pairs did not map uniquely:   6057069
Sequence pairs which were discarded because genomic sequence could not be extracted:   0

Total number of C's analysed:    4499496092

Total methylated C's in CpG context:    175337457
Total methylated C's in CHG context:    34728555
Total methylated C's in CHH context:    98112657
Total methylated C's in Unknown context:    23

Total unmethylated C's in CpG context:    51041673
Total unmethylated C's in CHG context:    1076524230
Total unmethylated C's in CHH context:    3063751520
Total unmethylated C's in Unknown context:    124

C methylated in CpG context:    77.5%
C methylated in CHG context:    3.1%
C methylated in CHH context:    3.1%
C methylated in unknown context (CN or CHN):    15.6%

I had this number of reads that you mentioned even more.

So what could be the possible solution to this problem now?

thanks again.

Reply all

Reply to author

Forward