seqz_binning not reducing file size with high-coverage WGS data (>150x) + segmentation fault

9 views

Skip to first unread message

Mihaela Martis

unread,

Dec 16, 2025, 6:54:59 PM12/16/25

to Sequenza User Group

Hello,

I'm experiencing issues with sequenza-utils on high-coverage whole-genome sequencing data and would appreciate guidance.

System Information:

Sequenza-utils version: 3.9.0
Python version: Python 3.10.16
Coverage: 130-250x (tumor), 40-100x (normal)
Genome: GRCh37

1. seqz_binning is not reducing the file size.

I'm running the commands bam2seqz and seqz_binning to obtain the CNVs, however the seqz_binning tool barley decreases the file size.

# Original raw seqz: 27 GB

# After binning -w 50: 6 GB

1 10001 N 175 542 3.138 1.0 0 hom 48 51 N . 0
1 10004 C 72 208 2.889 0.995 0 hom 48 208 C G0.005 G0.0
1 10005 C 73 207 2.836 0.986 0 hom 48 207 C T0.014 T0.0
1 10007 T 87 291 3.345 0.993 0 hom 48 291 T C0.007 C0.0
1 10008 A 87 283 3.253 0.975 0 hom 48 283 A G0.007:T0.011 G0.5:T0.333
1 10012 C 114 341 2.991 0.994 0 hom 48 341 C A0.003:G0.003 A0.0:G0.0
1 10013 T 123 381 3.098 0.982 0 hom 48 379 T A0.003:C0.011:G0.005 A0.0:C0.5:G0.5
1 10014 A 118 373 3.161 0.973 0 hom 48 372 A T0.003 T0.0
1 10015 A 123 383 3.114 0.969 0 hom 48 383 A G0.008:T0.005 G0.667:T0.5
1 10016 C 126 408 3.238 0.998 0 hom 48 408 C G0.002 G0.0

# After binning -w 500: 5.2 GB

1 10001 N 230 1019 4.737 1.0 0 hom 54 501 N . 0
1 10004 C 72 208 2.889 0.995 0 hom 54 208 C G0.005 G0.0
1 10005 C 73 207 2.836 0.986 0 hom 54 207 C T0.014 T0.0
1 10007 T 87 291 3.345 0.993 0 hom 54 291 T C0.007 C0.0
1 10008 A 87 283 3.253 0.975 0 hom 54 283 A G0.007:T0.011 G0.5:T0.333
1 10012 C 114 341 2.991 0.994 0 hom 54 341 C A0.003:G0.003 A0.0:G0.0
1 10013 T 123 381 3.098 0.982 0 hom 54 379 T A0.003:C0.011:G0.005 A0.0:C0.5:G0.5
1 10014 A 118 373 3.161 0.973 0 hom 54 372 A T0.003 T0.0
1 10015 A 123 383 3.114 0.969 0 hom 54 383 A G0.008:T0.005 G0.667:T0.5
1 10016 C 126 408 3.238 0.998 0 hom 54 408 C G0.002 G0.0

# After binning -w 2000: 5.1 GB

As you can see, there is no change at the beginning of the file.

Here are the first 10 lines for another chromosome:

zless chr4.strict.seqz.gz |head -10
chromosome position base.ref depth.normal depth.tumor depth.ratio Af Bf zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand
4 10007 N 202 609 3.369 1.0 0 hom 43 39496 N . 0
4 10011 T 20 104 5.2 0.981 0 hom 43 53 T G0.019 G0.0
4 10025 A 31 135 4.355 0.989 0 hom 43 94 A C0.011 C1.0
4 10030 A 32 146 4.562 0.978 0 hom 43 91 A T0.011 T1.0
4 10031 A 33 149 4.515 0.982 0 hom 43 113 A C0.018 C0.5
4 10035 T 29 153 5.276 0.98 0 hom 43 98 T A0.02 A1.0
4 10037 C 40 203 5.075 0.985 0 hom 43 131 C A0.015 A1.0
4 10040 T 52 246 4.731 0.972 0 hom 43 178 T A0.017 A0.333
4 10041 A 54 256 4.741 0.964 0 hom 43 196 A G0.005 G1.0

Here are the commands I run:

sequenza-utils bam2seqz -n normal.bam -t tumor.bam -F myfasta.fa -gc ref.wig.gz -q 30 -N 100 -o myOut.seqz.gz

sequenza-utils seqz_binning -w 200 -s myOut.seqz.gz -o myOut.small.seqz.gz

The output still contains for example ~10 million lines for chr4. The binning is doing something, but it's not aggregating positions into windows as expected. Sample output shows positions at: 10007, 10011, 10025, 10030, 10031, 10035... (every few bp)

Problem 2: Segmentation fault

When trying to run the downstream analysis, I get a segmentation fault.

Questions:

Is there a known issue with seqz_binning at very high coverage (200-250x)?
Should I downsample my BAM files before running sequenza-utils?
Are there additional filtering parameters I should use for WGS at this coverage?
How can I fix this? I really need the Sequenza results, since I use those also as an input in calculating HRD.

Any guidance would be greatly appreciated!

Best regards,

Mihaela

Reply all

Reply to author

Forward

0 new messages