Dear pysam users,
According to the pysam manual (*), pysam.tabix_index expects a filename ending in ".gz", which is the standard extension for files compressed with gzip.
The tabix man page (on my computer) states that:
-----
Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface.
-----
This suggests that bgzip is not equivalent to gzip, and the use of a different file extension is therefore a good idea.
(However, the example section uses the ".gz" extension, which I think is confusing.)
My guess is that pysam uses bgzip to compress files before making a tabix index. If I'm correct, then it would seem preferable that pysam.tabix_index look for a ".bgz" extension before deciding whether to compress or not.
Another source of confusion for me is that pysam documentation refers to a "bgzf" program:
-----
If
filename ends in
gz, the file is assumed to be already
compressed with bgzf.
-----
My guess is that this is just a typo and that it should read "bgzip"
So, assuming my guesses are correct, here are my suggestions:
1) Use ".bgz" instead of ".gz" as extension for pre-tabix-indexing stuff ?
2) Update the documentation accordingly
Best regards.
Blaise
(*)
http://pysam.readthedocs.org/en/latest/api.html#pysam.TabixFile