missing EOF marker after splitting bam file

307 views
Skip to first unread message

pegos...@gmail.com

unread,
Sep 10, 2013, 1:23:52 PM9/10/13
to bamu...@googlegroups.com
Hi,

I used the splitBam program to sort a BAM by read group into multiple BAM files. Unfortunately the BAM files generated do not have the EOF marker and cannot be used for further analyses. Is there an easy fix to this issue?
Many thanks

Carlos A Machado

Mary Kate Wing

unread,
Sep 10, 2013, 3:20:34 PM9/10/13
to bamu...@googlegroups.com
I am sorry to hear you are running into a problem.  I am happy to help resolve the issue.

Unfortunately, I was not able to reproduce your problem.  The BAMs that I generated using splitBam have the EOF marker at the end.  

Here is what I get when I do a hex dump of the last 28 bytes on my generated BAM file:
xxd -s -28 results/splitBam.RG2.bam
00002dc: 1f8b 0804 0000 0000 00ff 0600 4243 0200  ............BC..
00002ec: 1b00 0300 0000 0000 0000 0000            ............

The last 28 bytes is the empty BGZF block.

Do you see that on your BAMs?

If you cd into bamUtil/test and run ./testSplitBam.sh, does it report any errors?

I would definitely like to help figure this out.  We want the splitBam output to work well as an input to other programs.

If you do see those empty blocks, what error message are you seeing/what program is failing on the outputs of splitBam?  I would like to try running my bams through to make sure I didn't do something wrong with generating that block.

Mary Kate


--
You received this message because you are subscribed to the Google Groups "bamUtils" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bamutils+u...@googlegroups.com.
To post to this group, send email to bamu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bamutils/86c4204c-faca-4a55-a317-7b4b52f48443%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

pegos...@gmail.com

unread,
Sep 10, 2013, 3:48:55 PM9/10/13
to bamu...@googlegroups.com
Hi Mary Kate

many thanks for the quick answer.

Here is what I get when doing what you suggest for two different bam files generated by splitBam:

> xxd -s -28 New.Og54.bam
a266cfe4:3d55 dfcc 8474 9cf3 1e2b 7ad7 2d6d b799  =U...t...+z.-m..
a266cff4:5644 dc78 a468 795f 6cba 24eb            VD.x.hy_l.$.

> xxd -s -28 New.Ob137.bam
208643fe40670 9ed3 d214 e0e2 552c 2f45 24bd eba6  .p......U,/E$...
208643ff4bc75 b87b f755 a1b3 3340 4953            .u.{.U..3@IS

Here is what I get in the large bam file that was splitted:

> xxd -s -28 ObOgOp.bam
6ae828c081f8b 0804 0000 0000 00ff 0600 4243 0200  ............BC..
6ae828c091b00 0300 0000 0000 0000 0000            ............

Here is what I get when I cd into bamUtil/test and run ./testSplitBam.sh:

ERROR: Cannot open the log file results/splitBam.log. Check if the directory exists and you have the permission to create a file./testSplitBam.sh: line 3: 10027 Aborted                 (core dumped) ../bin/bam splitBam --in testFiles/splitBam.bam --out results/splitBam
diff: results/splitBam.RG1.bam: No such file or directory
diff: results/splitBam.RG2.bam: No such file or directory
ERROR: Cannot open the log file results/splitSam.log. Check if the directory exists and you have the permission to create a file./testSplitBam.sh: line 17: 10031 Aborted                 (core dumped) ../bin/bam splitBam -i testFiles/splitBam.sam -o results/splitSam
diff: results/splitSam.RG1.bam: No such file or directory
diff: results/splitSam.RG2.bam: No such file or directory

Carlos

Mary Kate Wing

unread,
Sep 10, 2013, 4:25:56 PM9/10/13
to bamutils
A few more things...

1) What version of bamUtil do you have?
If you run bamUtil/bin/bam with no parameters it will print out the options.  At the top of that printout is a version number.

1a) Do you know what version of libStatGen you are using?  
You can use:
     grep -m 1 VERSION libStatGen/Makefile
Update the last part with the correct path to libStatGen (likely up one directory from bamUtil).



2) Is bamUtil in a directory that you have permission to write in?  
If so, can you cd bamUtil/test; mkdir results; ./testSplitBam.sh

I'm sorry, I forgot that it doesn't create the directory if it doesn't already exist.

If you don't have write permission, can you just run ../bin/bam splitBam --in testFiles/splitBam.bam --out ~/splitBam 
That will create: splitBam.log  splitBam.RG1.bam  splitBam.RG2.bam

Thanks for the additional information.
Mary Kate



pegos...@gmail.com

unread,
Sep 10, 2013, 9:16:00 PM9/10/13
to bamu...@googlegroups.com
HI,

1) What version of bamUtil do you have?
If you run bamUtil/bin/bam with no parameters it will print out the options.  At the top of that printout is a version number.

version 1.0.9

 
1a) Do you know what version of libStatGen you are using?  

VERSION ?= 1.0.9

2) Is bamUtil in a directory that you have permission to write in?  
If so, can you cd bamUtil/test; mkdir results; ./testSplitBam.sh

OK, I did that and the files were created. I checked again if the EOF was present:

> xxd -s -28 splitBam.RG1.bam
00002d3: 1f8b 0804 0000 0000 00ff 0600 4243 0200  ............BC..
00002e3: 1b00 0300 0000 0000 0000 0000            ............

it looks fine...so, what could be going on with my other files? 
Thanks

Carlos

Mary Kate Wing

unread,
Sep 11, 2013, 2:00:32 PM9/11/13
to bamutils
Sorry for the delay in responding.

Unfortunately, I am still uncertain as to what happened.  I wonder if it hit an error and the program never finished running.

Can you look at the start of your BAM files:

xxd -l 16 New.Og54.bam
xxd -l 16 New.Og137.bam
xxd -l 16 ObOgOp.bam

They should all look like:
0000000: 1f8b 0804 0000 0000 00ff 0600 4243 0200  ............BC..

If not, it would appear that they aren't compressed at all.  Let me know the results of this.

If all the BAMs have the correct starting header, I guess we can look at the BAMs and see if they are valid:

bam validate --in New.Og54.bam
That should fail right away due to the missing EOF marker, so run it again with:

bam validate --in New.Og54.bam --noeof

Can you also run this on your original BAM and your other RG bams:
bam validate --in New.Og137.bam --noeof
bam validate --in ObOgOp.bam

That will ensure the BAMs are valid and print how many records are in each one.

The sum of the number of records in all the BAMs produced by the split should equal the number of records in the original BAM.

How many RG bams are you getting?

Thanks for the additional info.  Hopefully we can track this down soon.
Mary Kate


pegos...@gmail.com

unread,
Sep 12, 2013, 3:53:59 PM9/12/13
to bamu...@googlegroups.com
HI,

no problem, we are all busy but I truly appreciate your help trying to figure out what is going on here


On Wednesday, September 11, 2013 2:00:32 PM UTC-4, Mary Kate Wing wrote:
Sorry for the delay in responding.

Unfortunately, I am still uncertain as to what happened.  I wonder if it hit an error and the program never finished running.

Can you look at the start of your BAM files:

xxd -l 16 New.Og54.bam
xxd -l 16 New.Og137.bam
xxd -l 16 ObOgOp.bam

They should all look like:
0000000: 1f8b 0804 0000 0000 00ff 0600 4243 0200  ............BC..

If not, it would appear that they aren't compressed at all.  Let me know the results of this.


YES, they all look like that
 
If all the BAMs have the correct starting header, I guess we can look at the BAMs and see if they are valid:

bam validate --in New.Og54.bam
That should fail right away due to the missing EOF marker, so run it again with:

bam validate --in New.Og54.bam --noeof


> bam validate --in New.Og54.bam --noeof

Record 30475278: 
FAIL_IO: Failed to read the record size.

Number of records read = 30475278
Number of valid records = 30475278

TotalReads(e6) 30.48
MappedReads(e6) 30.00
PairedReads(e6) 30.48
ProperPair(e6) 29.19
DuplicateReads(e6) 2.47
QCFailureReads(e6) 0.00

MappingRate(%) 98.45
PairedReads(%) 100.00
ProperPair(%) 95.79
DupRate(%) 8.11
QCFailRate(%) 0.00

TotalBases(e6) 2925.63
BasesInMappedReads(e6) 2880.39
Returning: 3 (FAIL_IO)

 
Can you also run this on your original BAM and your other RG bams:
bam validate --in New.Og137.bam --noeof

> bam validate --in New.Ob137.bam --noeof

Record 109676960: 
FAIL_IO: Failed to read the record

Number of records read = 109676960
Number of valid records = 109676960

TotalReads(e6) 109.68
MappedReads(e6) 106.19
PairedReads(e6) 109.68
ProperPair(e6) 100.11
DuplicateReads(e6) 13.76
QCFailureReads(e6) 0.00

MappingRate(%) 96.82
PairedReads(%) 100.00
ProperPair(%) 91.28
DupRate(%) 12.54
QCFailRate(%) 0.00

TotalBases(e6) 10528.99
BasesInMappedReads(e6) 10194.27
Returning: 3 (FAIL_IO)


 
bam validate --in ObOgOp.bam


> bam validate --in ObOgOp.bam

OK, I will send these results later, the process is taking a long time (it is a 500 GB file with 40 RG bams)

 
That will ensure the BAMs are valid and print how many records are in each one.

The sum of the number of records in all the BAMs produced by the split should equal the number of records in the original BAM.

OK, I would take a while to do this for all the bam files (40 files)
 
How many RG bams are you getting?

The large bam has 40 RG bams and I am getting all of those

Many thanks

Carlos

Mary Kate Wing

unread,
Sep 12, 2013, 4:48:08 PM9/12/13
to bamutils
Ah, I see, you have a large number of output RG bams.
You don't need to validate any more of the RG specific ones, but I am interested to see the validation results on the original file.

What does your splitBam log file say?  
If you didn't specify --log, I should be called New.log.

Also, about how long did it take to run split?
Unfortunately, with a 500G file & 40 RGs, I'm guessing it took a while.  

I'm hoping the log will lead me to another idea.
But if it doesn't, I might request that you rerun with --out New1 or something like that.
I am interested to see if it gets the exact same results.

But before rerunning (since it takes so long), I'm hoping the log will help lead me to something.

Mary Kate


pegos...@gmail.com

unread,
Sep 13, 2013, 10:08:49 AM9/13/13
to bamu...@googlegroups.com
Hi Mary Kate,
 
You don't need to validate any more of the RG specific ones, but I am interested to see the validation results on the original file.

Here it is:

>bam validate --in ObOgOp.bam

Number of records read = 1316728014
Number of valid records = 1316728014

TotalReads(e6) 5611.70
MappedReads(e6) 4528.66
PairedReads(e6) 5611.70
ProperPair(e6) 4401.25
DuplicateReads(e6) 260.85
QCFailureReads(e6) 0.00

MappingRate(%) 80.70
PairedReads(%) 100.00
ProperPair(%) 78.43
DupRate(%) 4.65
QCFailRate(%) 0.00

TotalBases(e6) 535330.17
BasesInMappedReads(e6) 430435.49
Returning: 0 (SUCCESS)
 
What does your splitBam log file say?  
If you didn't specify --log, I should be called New.log.

Here is the content of New.log

Input BAM file : ObOgOp.bam
Output BAM prefix : New
Output log file : New.log
Verbose mode    : On
BGFZ EOF indicator : On
The following ReadGroup IDs are identified. Splitting into 40 BAM files..
        1: Ob003
        2: Ob008
        3: Ob026
        4: Ob047
        5: Ob085
        6: Ob092
        7: Ob094
        8: Ob096
        9: Ob100122
        10: Ob100931
        11: Ob100934
        12: Ob103895
        13: Ob104084
        14: Ob104119
        15: Ob105608
        16: Ob106234
        17: Ob137
        18: Ob140
        19: Ob214
        20: Og101049
        21: Og103469
        22: Og41
        23: Og43
        24: Og45
        25: Og46
        26: Og51
        27: Og54
        28: Og57
        29: Og67563
        30: Og68939
        31: Og68976
        32: Og75500
        33: Og96841
        34: OgTOG5457
        35: OgTOG5467
        36: OgTOG5923
        37: OgTOG5949
        38: OgTOG7025
        39: OgTOG7102
        40: Opunctata


Also, about how long did it take to run split?
Unfortunately, with a 500G file & 40 RGs, I'm guessing it took a while.  

 
Yes, I don't remember exactly but it was between 24-48 hours.
 
I'm hoping the log will lead me to another idea.
But if it doesn't, I might request that you rerun with --out New1 or something like that.
I am interested to see if it gets the exact same results.

But before rerunning (since it takes so long), I'm hoping the log will help lead me to something.


As I am running out of space in my drive, I guess I will delete the problematic split bam files as it does not look like we can add the EOF marker to them, correct?

 Thanks

Carlos

Mary Kate Wing

unread,
Sep 13, 2013, 11:30:42 AM9/13/13
to bamutils
Hmm, your log file is definitely interesting.  At the end of the log file, it is supposed to print the number of records that were successfully written to each RG specific file.

Here is the log file from my testcase 


 more ~/code/bamUtil/test/results/splitBam.log 
Input BAM file : testFiles/splitBam.bam
Output BAM prefix : results/splitBam
Output log file : results/splitBam.log
Verbose mode    : Off
BGFZ EOF indicator : On
The following ReadGroup IDs are identified. Splitting into 2 BAM files..
1: RG1
2: RG2
Successfully wrote 10 record for readGroup RG1
Successfully wrote 10 record for readGroup RG2


As you can see on the bottom of the log it says how many successfully wrote.

It appears that your run did not complete successfully.  

Unfortunately the log does not indicate an error.  I'm assuming you didn't see one on the command-line either, is that correct?

I'm guessing that something happened that caused it to stop running.  That would explain why it never wrote those end of file markers - it never hit that part of the code.  
That would also explain why your log is missing the number of records in each read group.

Since it appears to me that splitBam did not finish running, I would guess that your per RG bams are incomplete - missing records.  Also, based on the validation failures - it appears that they did not end on a record boundary - only parts of records were written.

If they were valid BAMs, you could run bam convert --in New.Ob137.bam --out New1.Ob137.bam --noeof
That would add an EOF block to the end.  You can try it, but I would expect that it would fail on reading the last record due to it being truncated.

So unfortunately, I think at this point I would have to recommend to rerun.  You can delete your old RG specific BAMs.

Hopefully this time splitBam completes successfully or reports an error on the command-line or in the log to explain what happened.

I'm sorry for the troubles that this has caused.  It shouldn't silently fail like that unless maybe it got killed and was unable to report anything.

If you are unable to rerun, and don't mind that you may be missing records, you can run recovery on your per RG bams.  That should truncate them at the last valid BGZF block, preserving as much of the BAM as it can figure out.  This tool was written to help us recover some of our BAM files that were corrupted by our filesystem in the past.
Use the --recover option of bam convert to try this: http://genome.sph.umich.edu/wiki/BamUtil:_convert

But as I said, using recover will leave your BAMs missing records.  I think the "best" thing to do is rerun.

Mary Kate



pegos...@gmail.com

unread,
Sep 13, 2013, 12:20:14 PM9/13/13
to bamu...@googlegroups.com
HI Mary Kate,

yes, I see now. I run again splitbam on another smalller bam file that I created and here is the log

Input BAM file : 3japonica.bam
Output BAM prefix : New
Output log file : New.log
Verbose mode    : On
BGFZ EOF indicator : On
The following ReadGroup IDs are identified. Splitting into 3 BAM files..
        1: OJ28275
        2: OJ36420
        3: OJ3860
Successfully wrote 37579408 record for readGroup OJ28275
Successfully wrote 73649244 record for readGroup OJ36420
Successfully wrote 50461396 record for readGroup OJ3860

WHich looks correct now. These bam files were all manipulated previously using your set of tools (adding RG tags and header) and the merge was also done with your merge tool (for the large bam file I used samtools). 

I will run things again adn will let you know if I see the same problem but it may take some time.

Many thanks for all your help

Carlos
Reply all
Reply to author
Forward
0 new messages