missing EOF marker after splitting bam file

pegos...@gmail.com

unread,

Sep 10, 2013, 1:23:52 PM9/10/13

to bamu...@googlegroups.com

Hi,

I used the splitBam program to sort a BAM by read group into multiple BAM files. Unfortunately the BAM files generated do not have the EOF marker and cannot be used for further analyses. Is there an easy fix to this issue?

Many thanks

Carlos A Machado

Mary Kate Wing

unread,

Sep 10, 2013, 3:20:34 PM9/10/13

to bamu...@googlegroups.com

I am sorry to hear you are running into a problem. I am happy to help resolve the issue.

Unfortunately, I was not able to reproduce your problem. The BAMs that I generated using splitBam have the EOF marker at the end.

Here is what I get when I do a hex dump of the last 28 bytes on my generated BAM file:

xxd -s -28 results/splitBam.RG2.bam

00002dc: 1f8b 0804 0000 0000 00ff 0600 4243 0200 ............BC..

00002ec: 1b00 0300 0000 0000 0000 0000 ............

The last 28 bytes is the empty BGZF block.

Do you see that on your BAMs?

If you cd into bamUtil/test and run ./testSplitBam.sh, does it report any errors?

I would definitely like to help figure this out. We want the splitBam output to work well as an input to other programs.

If you do see those empty blocks, what error message are you seeing/what program is failing on the outputs of splitBam? I would like to try running my bams through to make sure I didn't do something wrong with generating that block.

Mary Kate

--
You received this message because you are subscribed to the Google Groups "bamUtils" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bamutils+u...@googlegroups.com.
To post to this group, send email to bamu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/86c4204c-faca-4a55-a317-7b4b52f48443%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

pegos...@gmail.com

unread,

Sep 10, 2013, 3:48:55 PM9/10/13

to bamu...@googlegroups.com

Hi Mary Kate

many thanks for the quick answer.

Here is what I get when doing what you suggest for two different bam files generated by splitBam:

> xxd -s -28 New.Og54.bam

a266cfe4:3d55 dfcc 8474 9cf3 1e2b 7ad7 2d6d b799 =U...t...+z.-m..

a266cff4:5644 dc78 a468 795f 6cba 24eb VD.x.hy_l.$.

> xxd -s -28 New.Ob137.bam

208643fe40670 9ed3 d214 e0e2 552c 2f45 24bd eba6 .p......U,/E$...

208643ff4bc75 b87b f755 a1b3 3340 4953 .u.{.U..3@IS

Here is what I get in the large bam file that was splitted:

> xxd -s -28 ObOgOp.bam

6ae828c081f8b 0804 0000 0000 00ff 0600 4243 0200 ............BC..

6ae828c091b00 0300 0000 0000 0000 0000 ............

Here is what I get when I cd into bamUtil/test and run ./testSplitBam.sh:

ERROR: Cannot open the log file results/splitBam.log. Check if the directory exists and you have the permission to create a file./testSplitBam.sh: line 3: 10027 Aborted (core dumped) ../bin/bam splitBam --in testFiles/splitBam.bam --out results/splitBam

diff: results/splitBam.RG1.bam: No such file or directory

diff: results/splitBam.RG2.bam: No such file or directory

ERROR: Cannot open the log file results/splitSam.log. Check if the directory exists and you have the permission to create a file./testSplitBam.sh: line 17: 10031 Aborted (core dumped) ../bin/bam splitBam -i testFiles/splitBam.sam -o results/splitSam

diff: results/splitSam.RG1.bam: No such file or directory

diff: results/splitSam.RG2.bam: No such file or directory

Carlos

Mary Kate Wing

unread,

Sep 10, 2013, 4:25:56 PM9/10/13

to bamutils

A few more things...

1) What version of bamUtil do you have?

If you run bamUtil/bin/bam with no parameters it will print out the options. At the top of that printout is a version number.

1a) Do you know what version of libStatGen you are using?

You can use:

grep -m 1 VERSION libStatGen/Makefile

Update the last part with the correct path to libStatGen (likely up one directory from bamUtil).

2) Is bamUtil in a directory that you have permission to write in?

If so, can you cd bamUtil/test; mkdir results; ./testSplitBam.sh

I'm sorry, I forgot that it doesn't create the directory if it doesn't already exist.

If you don't have write permission, can you just run ../bin/bam splitBam --in testFiles/splitBam.bam --out ~/splitBam

That will create: splitBam.log splitBam.RG1.bam splitBam.RG2.bam

Thanks for the additional information.

Mary Kate

To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/fa421c36-403d-4597-a40c-c33d676d793e%40googlegroups.com.

pegos...@gmail.com

unread,

Sep 10, 2013, 9:16:00 PM9/10/13

to bamu...@googlegroups.com

HI,

1) What version of bamUtil do you have?
If you run bamUtil/bin/bam with no parameters it will print out the options. At the top of that printout is a version number.

version 1.0.9

1a) Do you know what version of libStatGen you are using?

VERSION ?= 1.0.9

2) Is bamUtil in a directory that you have permission to write in?
If so, can you cd bamUtil/test; mkdir results; ./testSplitBam.sh

OK, I did that and the files were created. I checked again if the EOF was present:

> xxd -s -28 splitBam.RG1.bam

00002d3: 1f8b 0804 0000 0000 00ff 0600 4243 0200 ............BC..

00002e3: 1b00 0300 0000 0000 0000 0000 ............

it looks fine...so, what could be going on with my other files?

Thanks

Carlos

Mary Kate Wing

unread,

Sep 11, 2013, 2:00:32 PM9/11/13

to bamutils

Sorry for the delay in responding.

Unfortunately, I am still uncertain as to what happened. I wonder if it hit an error and the program never finished running.

Can you look at the start of your BAM files:

xxd -l 16 New.Og54.bam

xxd -l 16 New.Og137.bam

xxd -l 16 ObOgOp.bam

They should all look like:

0000000: 1f8b 0804 0000 0000 00ff 0600 4243 0200 ............BC..

If not, it would appear that they aren't compressed at all. Let me know the results of this.

If all the BAMs have the correct starting header, I guess we can look at the BAMs and see if they are valid:

bam validate --in New.Og54.bam

That should fail right away due to the missing EOF marker, so run it again with:

bam validate --in New.Og54.bam --noeof

Can you also run this on your original BAM and your other RG bams:

bam validate --in New.Og137.bam --noeof

bam validate --in ObOgOp.bam

That will ensure the BAMs are valid and print how many records are in each one.

The sum of the number of records in all the BAMs produced by the split should equal the number of records in the original BAM.

How many RG bams are you getting?

Thanks for the additional info. Hopefully we can track this down soon.

Mary Kate

To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/0ef0cffc-eb2a-4731-9b78-1ce119def9ea%40googlegroups.com.

pegos...@gmail.com

unread,

Sep 12, 2013, 3:53:59 PM9/12/13

to bamu...@googlegroups.com

HI,

no problem, we are all busy but I truly appreciate your help trying to figure out what is going on here

On Wednesday, September 11, 2013 2:00:32 PM UTC-4, Mary Kate Wing wrote:

Sorry for the delay in responding.

Unfortunately, I am still uncertain as to what happened. I wonder if it hit an error and the program never finished running.

Can you look at the start of your BAM files:

xxd -l 16 New.Og54.bam
xxd -l 16 New.Og137.bam
xxd -l 16 ObOgOp.bam

They should all look like:

0000000: 1f8b 0804 0000 0000 00ff 0600 4243 0200 ............BC..

If not, it would appear that they aren't compressed at all. Let me know the results of this.

YES, they all look like that

If all the BAMs have the correct starting header, I guess we can look at the BAMs and see if they are valid:

bam validate --in New.Og54.bam
That should fail right away due to the missing EOF marker, so run it again with:

bam validate --in New.Og54.bam --noeof

> bam validate --in New.Og54.bam --noeof

Record 30475278:

FAIL_IO: Failed to read the record size.

Number of records read = 30475278

Number of valid records = 30475278

TotalReads(e6) 30.48

MappedReads(e6) 30.00

PairedReads(e6) 30.48

ProperPair(e6) 29.19

DuplicateReads(e6) 2.47

QCFailureReads(e6) 0.00

MappingRate(%) 98.45

PairedReads(%) 100.00

ProperPair(%) 95.79

DupRate(%) 8.11

QCFailRate(%) 0.00

TotalBases(e6) 2925.63

BasesInMappedReads(e6) 2880.39

Returning: 3 (FAIL_IO)

Can you also run this on your original BAM and your other RG bams:
bam validate --in New.Og137.bam --noeof

> bam validate --in New.Ob137.bam --noeof

Record 109676960:

FAIL_IO: Failed to read the record

Number of records read = 109676960

Number of valid records = 109676960

TotalReads(e6) 109.68

MappedReads(e6) 106.19

PairedReads(e6) 109.68

ProperPair(e6) 100.11

DuplicateReads(e6) 13.76

QCFailureReads(e6) 0.00

MappingRate(%) 96.82

PairedReads(%) 100.00

ProperPair(%) 91.28

DupRate(%) 12.54

QCFailRate(%) 0.00

TotalBases(e6) 10528.99

BasesInMappedReads(e6) 10194.27

Returning: 3 (FAIL_IO)

bam validate --in ObOgOp.bam

> bam validate --in ObOgOp.bam

OK, I will send these results later, the process is taking a long time (it is a 500 GB file with 40 RG bams)

That will ensure the BAMs are valid and print how many records are in each one.

The sum of the number of records in all the BAMs produced by the split should equal the number of records in the original BAM.

OK, I would take a while to do this for all the bam files (40 files)

How many RG bams are you getting?

The large bam has 40 RG bams and I am getting all of those

Many thanks

Carlos

Mary Kate Wing

unread,

Sep 12, 2013, 4:48:08 PM9/12/13

to bamutils

Ah, I see, you have a large number of output RG bams.

You don't need to validate any more of the RG specific ones, but I am interested to see the validation results on the original file.

What does your splitBam log file say?

If you didn't specify --log, I should be called New.log.

Also, about how long did it take to run split?

Unfortunately, with a 500G file & 40 RGs, I'm guessing it took a while.

I'm hoping the log will lead me to another idea.

But if it doesn't, I might request that you rerun with --out New1 or something like that.

I am interested to see if it gets the exact same results.

But before rerunning (since it takes so long), I'm hoping the log will help lead me to something.

Mary Kate

To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/0a2e84f6-93fb-42bd-baaa-09b119222677%40googlegroups.com.

pegos...@gmail.com

unread,

Sep 13, 2013, 10:08:49 AM9/13/13

to bamu...@googlegroups.com

Hi Mary Kate,

You don't need to validate any more of the RG specific ones, but I am interested to see the validation results on the original file.

Here it is:

>bam validate --in ObOgOp.bam

Number of records read = 1316728014

Number of valid records = 1316728014

TotalReads(e6) 5611.70

MappedReads(e6) 4528.66

PairedReads(e6) 5611.70

ProperPair(e6) 4401.25

DuplicateReads(e6) 260.85

QCFailureReads(e6) 0.00

MappingRate(%) 80.70

PairedReads(%) 100.00

ProperPair(%) 78.43

DupRate(%) 4.65

QCFailRate(%) 0.00

TotalBases(e6) 535330.17

BasesInMappedReads(e6) 430435.49

Returning: 0 (SUCCESS)

What does your splitBam log file say?
If you didn't specify --log, I should be called New.log.

Here is the content of New.log

Input BAM file : ObOgOp.bam

Output BAM prefix : New

Output log file : New.log

Verbose mode : On

BGFZ EOF indicator : On

The following ReadGroup IDs are identified. Splitting into 40 BAM files..

1: Ob003

2: Ob008

3: Ob026

4: Ob047

5: Ob085

6: Ob092

7: Ob094

8: Ob096

9: Ob100122

10: Ob100931

11: Ob100934

12: Ob103895

13: Ob104084

14: Ob104119

15: Ob105608

16: Ob106234

17: Ob137

18: Ob140

19: Ob214

20: Og101049

21: Og103469

22: Og41

23: Og43

24: Og45

25: Og46

26: Og51

27: Og54

28: Og57

29: Og67563

30: Og68939

31: Og68976

32: Og75500

33: Og96841

34: OgTOG5457

35: OgTOG5467

36: OgTOG5923

37: OgTOG5949

38: OgTOG7025

39: OgTOG7102

40: Opunctata

Also, about how long did it take to run split?
Unfortunately, with a 500G file & 40 RGs, I'm guessing it took a while.

Yes, I don't remember exactly but it was between 24-48 hours.

I'm hoping the log will lead me to another idea.
But if it doesn't, I might request that you rerun with --out New1 or something like that.
I am interested to see if it gets the exact same results.

But before rerunning (since it takes so long), I'm hoping the log will help lead me to something.

As I am running out of space in my drive, I guess I will delete the problematic split bam files as it does not look like we can add the EOF marker to them, correct?

Thanks

Carlos

Mary Kate Wing

unread,

Sep 13, 2013, 11:30:42 AM9/13/13

to bamutils

Hmm, your log file is definitely interesting. At the end of the log file, it is supposed to print the number of records that were successfully written to each RG specific file.

Here is the log file from my testcase

more ~/code/bamUtil/test/results/splitBam.log

Input BAM file : testFiles/splitBam.bam

Output BAM prefix : results/splitBam

Output log file : results/splitBam.log

Verbose mode : Off

BGFZ EOF indicator : On

The following ReadGroup IDs are identified. Splitting into 2 BAM files..

1: RG1

2: RG2

Successfully wrote 10 record for readGroup RG1

Successfully wrote 10 record for readGroup RG2

As you can see on the bottom of the log it says how many successfully wrote.

It appears that your run did not complete successfully.

Unfortunately the log does not indicate an error. I'm assuming you didn't see one on the command-line either, is that correct?

I'm guessing that something happened that caused it to stop running. That would explain why it never wrote those end of file markers - it never hit that part of the code.

That would also explain why your log is missing the number of records in each read group.

Since it appears to me that splitBam did not finish running, I would guess that your per RG bams are incomplete - missing records. Also, based on the validation failures - it appears that they did not end on a record boundary - only parts of records were written.

If they were valid BAMs, you could run bam convert --in New.Ob137.bam --out New1.Ob137.bam --noeof

That would add an EOF block to the end. You can try it, but I would expect that it would fail on reading the last record due to it being truncated.

So unfortunately, I think at this point I would have to recommend to rerun. You can delete your old RG specific BAMs.

Hopefully this time splitBam completes successfully or reports an error on the command-line or in the log to explain what happened.

I'm sorry for the troubles that this has caused. It shouldn't silently fail like that unless maybe it got killed and was unable to report anything.

If you are unable to rerun, and don't mind that you may be missing records, you can run recovery on your per RG bams. That should truncate them at the last valid BGZF block, preserving as much of the BAM as it can figure out. This tool was written to help us recover some of our BAM files that were corrupted by our filesystem in the past.

Use the --recover option of bam convert to try this: http://genome.sph.umich.edu/wiki/BamUtil:_convert

But as I said, using recover will leave your BAMs missing records. I think the "best" thing to do is rerun.

Mary Kate

To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/bbddba17-f890-4777-a617-9006c24a9499%40googlegroups.com.

pegos...@gmail.com

unread,

Sep 13, 2013, 12:20:14 PM9/13/13

to bamu...@googlegroups.com

HI Mary Kate,

yes, I see now. I run again splitbam on another smalller bam file that I created and here is the log

Input BAM file : 3japonica.bam

Output BAM prefix : New

Output log file : New.log

Verbose mode : On

BGFZ EOF indicator : On

The following ReadGroup IDs are identified. Splitting into 3 BAM files..

1: OJ28275

2: OJ36420

3: OJ3860

Successfully wrote 37579408 record for readGroup OJ28275

Successfully wrote 73649244 record for readGroup OJ36420

Successfully wrote 50461396 record for readGroup OJ3860

WHich looks correct now. These bam files were all manipulated previously using your set of tools (adding RG tags and header) and the merge was also done with your merge tool (for the large bam file I used samtools).

I will run things again adn will let you know if I see the same problem but it may take some time.