seqz_binning not reducing file size with high-coverage WGS data (>150x) + segmentation fault

5 views
Skip to first unread message

Mihaela Martis

unread,
Dec 16, 2025, 6:54:59 PM (3 days ago) Dec 16
to Sequenza User Group

Hello,

I'm experiencing issues with sequenza-utils on high-coverage whole-genome sequencing data and would appreciate guidance.

System Information:

  • Sequenza-utils version: 3.9.0
  • Python version: Python 3.10.16
  • Coverage: 130-250x (tumor), 40-100x (normal)
  • Genome: GRCh37
1. seqz_binning is not reducing the file size.
I'm running the commands bam2seqz and seqz_binning to obtain the CNVs, however the seqz_binning tool barley decreases the file size.

# Original raw seqz: 27 GB 
# After binning -w 50: 6 GB 
1       10001   N       175     542     3.138   1.0     0       hom     48      51      N       .       0
1       10004   C       72      208     2.889   0.995   0       hom     48      208     C       G0.005  G0.0
1       10005   C       73      207     2.836   0.986   0       hom     48      207     C       T0.014  T0.0
1       10007   T       87      291     3.345   0.993   0       hom     48      291     T       C0.007  C0.0
1       10008   A       87      283     3.253   0.975   0       hom     48      283     A       G0.007:T0.011   G0.5:T0.333
1       10012   C       114     341     2.991   0.994   0       hom     48      341     C       A0.003:G0.003   A0.0:G0.0
1       10013   T       123     381     3.098   0.982   0       hom     48      379     T       A0.003:C0.011:G0.005    A0.0:C0.5:G0.5
1       10014   A       118     373     3.161   0.973   0       hom     48      372     A       T0.003  T0.0
1       10015   A       123     383     3.114   0.969   0       hom     48      383     A       G0.008:T0.005   G0.667:T0.5
1       10016   C       126     408     3.238   0.998   0       hom     48      408     C       G0.002  G0.0

# After binning -w 500: 5.2 GB 
1       10001   N       230     1019    4.737   1.0     0       hom     54      501     N       .       0
1       10004   C       72      208     2.889   0.995   0       hom     54      208     C       G0.005  G0.0
1       10005   C       73      207     2.836   0.986   0       hom     54      207     C       T0.014  T0.0
1       10007   T       87      291     3.345   0.993   0       hom     54      291     T       C0.007  C0.0
1       10008   A       87      283     3.253   0.975   0       hom     54      283     A       G0.007:T0.011   G0.5:T0.333
1       10012   C       114     341     2.991   0.994   0       hom     54      341     C       A0.003:G0.003   A0.0:G0.0
1       10013   T       123     381     3.098   0.982   0       hom     54      379     T       A0.003:C0.011:G0.005    A0.0:C0.5:G0.5
1       10014   A       118     373     3.161   0.973   0       hom     54      372     A       T0.003  T0.0
1       10015   A       123     383     3.114   0.969   0       hom     54      383     A       G0.008:T0.005   G0.667:T0.5
1       10016   C       126     408     3.238   0.998   0       hom     54      408     C       G0.002  G0.0
# After binning -w 2000: 5.1 GB

As you can see, there is no change at the beginning of the file.
Here are the first 10 lines for another chromosome:
zless chr4.strict.seqz.gz |head -10
chromosome      position        base.ref        depth.normal    depth.tumor     depth.ratio     Af      Bf      zygosity.normal GC.percent    good.reads       AB.normal       AB.tumor        tumor.strand
4       10007   N       202     609     3.369   1.0     0       hom     43      39496   N       .       0
4       10011   T       20      104     5.2     0.981   0       hom     43      53      T       G0.019  G0.0
4       10025   A       31      135     4.355   0.989   0       hom     43      94      A       C0.011  C1.0
4       10030   A       32      146     4.562   0.978   0       hom     43      91      A       T0.011  T1.0
4       10031   A       33      149     4.515   0.982   0       hom     43      113     A       C0.018  C0.5
4       10035   T       29      153     5.276   0.98    0       hom     43      98      T       A0.02   A1.0
4       10037   C       40      203     5.075   0.985   0       hom     43      131     C       A0.015  A1.0
4       10040   T       52      246     4.731   0.972   0       hom     43      178     T       A0.017  A0.333
4       10041   A       54      256     4.741   0.964   0       hom     43      196     A       G0.005  G1.0


Here are the commands I run:
 sequenza-utils bam2seqz -n normal.bam -t tumor.bam -F myfasta.fa -gc ref.wig.gz -q 30 -N 100 -o myOut.seqz.gz
sequenza-utils seqz_binning -w 200 -s myOut.seqz.gz -o myOut.small.seqz.gz

The output still contains for example ~10 million lines for chr4. The binning is doing something, but it's not aggregating positions into windows as expected.  Sample output shows positions at: 10007, 10011, 10025, 10030, 10031, 10035... (every few bp)

Problem 2: Segmentation fault

When trying to run the downstream analysis, I get a segmentation fault.

Questions:

  1. Is there a known issue with seqz_binning at very high coverage (200-250x)?
  2. Should I downsample my BAM files before running sequenza-utils?
  3. Are there additional filtering parameters I should use for WGS at this coverage?
  4. How can I fix this? I really need the Sequenza results, since I use those also as an input in calculating HRD.

Any guidance would be greatly appreciated!

Best regards,

Mihaela

Reply all
Reply to author
Forward
0 new messages