Indeed there's an issue with the tsize estimate and the SAM parser. The original implementation:
640 def tsize (self):
641 s = 0
642 n = 0
643 m = 0
644 while n<10 and m<1000:
645 m += 1
646 thisline = self.fhd.readline()
647 (chromosome,fpos,strand) = self.__fw_parse_line(thisline)
648 if not fpos or not chromosome:
649 continue
650 thisline = thisline.rstrip()
651 thisfields = thisline.split("\t")
652 s += len(thisfields[9])
653 n += 1
654 self.fhd.seek(0)
655 return int(s/n)
if the read is not aligned, line 647 doesn't return a chromosome, the loop at line 648 skips to the next iteration. If one has more than 1000 unaligned reads at the beginning of the file, n remain 0 and the division raises an error. The same could happen if there are more than 1000 headers (as in zebrafish). I've applied this:
$ diff -u !$
diff -u lib/IO/Parser.py*
--- lib/IO/Parser.py 2010-09-23 16:31:59.000000000 +0200
+++ lib/IO/Parser.py.old 2010-09-23 16:31:15.000000000 +0200
@@ -641,13 +641,11 @@
s = 0
n = 0
m = 0
- c = 1000
- while n<10 and m<c:
+ while n<10 and m<1000:
m += 1
thisline = self.fhd.readline()
(chromosome,fpos,strand) = self.__fw_parse_line(thisline)
if not fpos or not chromosome:
- if c - m < 10: c += 1000
continue
thisline = thisline.rstrip()
thisfields = thisline.split("\t")
When the checked lines is close to the limit (c = 1000) but there are less than required reads (n), extends the limit to other 1000...
Besides, it may be useful to re-enable options.tsize parsing.
d