Tabix compression, bgzip file extension, confusing documentation

706 views
Skip to first unread message

Blaise Li

unread,
Dec 9, 2014, 6:38:24 AM12/9/14
to pysam-us...@googlegroups.com
Dear pysam users,


According to the pysam manual (*), pysam.tabix_index expects a filename ending in ".gz", which is the standard extension for files compressed with gzip.

The tabix man page (on my computer) states that:
-----
Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.
-----

This suggests that bgzip is not equivalent to gzip, and the use of a different file extension is therefore a good idea.
(However, the example section uses the ".gz" extension, which I think is confusing.)


My guess is that pysam uses bgzip to compress files before making a tabix index. If I'm correct, then it would seem preferable that pysam.tabix_index look for a ".bgz" extension before deciding whether to compress or not.

Another source of confusion for me is that pysam documentation refers to a "bgzf" program:
-----
If filename ends in gz, the file is assumed to be already compressed with bgzf.
-----

My guess is that this is just a typo and that it should read "bgzip"


So, assuming my guesses are correct, here are my suggestions:
1) Use ".bgz" instead of ".gz" as extension for pre-tabix-indexing stuff ?
2) Update the documentation accordingly


Best regards.

Blaise


(*) http://pysam.readthedocs.org/en/latest/api.html#pysam.TabixFile

Peter Cock

unread,
Dec 9, 2014, 6:53:38 AM12/9/14
to pysam-us...@googlegroups.com
Hi Blaise,

I personally use *.bgz for BGZF style GZIPPED files to make this
difference clear.

However, BGZF files are also fully standard compliant GZIP files
(just with a little added metadata), so naming them *.gz means
all the existing gzip aware tools and settings "just work" system
wide. e.g. colour coding in file listing.

Note that BGZF is short for blocked gzip format. While bgzip
is one way to create such files, it is not the only way. So I think
this line is fine:

"If filename ends in gz, the file is assumed to be already
compressed with bgzf."

Regards,

Peter
> --
> You received this message because you are subscribed to the Google Groups
> "Pysam User group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pysam-user-gro...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages