Hello,
I'm experiencing issues with sequenza-utils on high-coverage whole-genome sequencing data and would appreciate guidance.
System Information:
- Sequenza-utils version: 3.9.0
- Python version: Python 3.10.16
- Coverage: 130-250x (tumor), 40-100x (normal)
- Genome: GRCh37
1. seqz_binning is not reducing the file size.
I'm running the commands bam2seqz and seqz_binning to obtain the CNVs, however the seqz_binning tool barley decreases the file size.
# Original raw seqz: 27 GB
# After binning -w 50: 6 GB
1 10001 N 175 542 3.138 1.0 0 hom 48 51 N . 0
1 10004 C 72 208 2.889 0.995 0 hom 48 208 C G0.005 G0.0
1 10005 C 73 207 2.836 0.986 0 hom 48 207 C T0.014 T0.0
1 10007 T 87 291 3.345 0.993 0 hom 48 291 T C0.007 C0.0
1 10008 A 87 283 3.253 0.975 0 hom 48 283 A G0.007:T0.011 G0.5:T0.333
1 10012 C 114 341 2.991 0.994 0 hom 48 341 C A0.003:G0.003 A0.0:G0.0
1 10013 T 123 381 3.098 0.982 0 hom 48 379 T A0.003:C0.011:G0.005 A0.0:C0.5:G0.5
1 10014 A 118 373 3.161 0.973 0 hom 48 372 A T0.003 T0.0
1 10015 A 123 383 3.114 0.969 0 hom 48 383 A G0.008:T0.005 G0.667:T0.5
1 10016 C 126 408 3.238 0.998 0 hom 48 408 C G0.002 G0.0
# After binning -w 500: 5.2 GB
1 10001 N 230 1019 4.737 1.0 0 hom 54 501 N . 0
1 10004 C 72 208 2.889 0.995 0 hom 54 208 C G0.005 G0.0
1 10005 C 73 207 2.836 0.986 0 hom 54 207 C T0.014 T0.0
1 10007 T 87 291 3.345 0.993 0 hom 54 291 T C0.007 C0.0
1 10008 A 87 283 3.253 0.975 0 hom 54 283 A G0.007:T0.011 G0.5:T0.333
1 10012 C 114 341 2.991 0.994 0 hom 54 341 C A0.003:G0.003 A0.0:G0.0
1 10013 T 123 381 3.098 0.982 0 hom 54 379 T A0.003:C0.011:G0.005 A0.0:C0.5:G0.5
1 10014 A 118 373 3.161 0.973 0 hom 54 372 A T0.003 T0.0
1 10015 A 123 383 3.114 0.969 0 hom 54 383 A G0.008:T0.005 G0.667:T0.5
1 10016 C 126 408 3.238 0.998 0 hom 54 408 C G0.002 G0.0
# After binning -w 2000: 5.1 GB
As you can see, there is no change at the beginning of the file.
Here are the first 10 lines for another chromosome:
zless chr4.strict.seqz.gz |head -10
chromosome position base.ref depth.normal depth.tumor depth.ratio Af Bf zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand
4 10007 N 202 609 3.369 1.0 0 hom 43 39496 N . 0
4 10011 T 20 104 5.2 0.981 0 hom 43 53 T G0.019 G0.0
4 10025 A 31 135 4.355 0.989 0 hom 43 94 A C0.011 C1.0
4 10030 A 32 146 4.562 0.978 0 hom 43 91 A T0.011 T1.0
4 10031 A 33 149 4.515 0.982 0 hom 43 113 A C0.018 C0.5
4 10035 T 29 153 5.276 0.98 0 hom 43 98 T A0.02 A1.0
4 10037 C 40 203 5.075 0.985 0 hom 43 131 C A0.015 A1.0
4 10040 T 52 246 4.731 0.972 0 hom 43 178 T A0.017 A0.333
4 10041 A 54 256 4.741 0.964 0 hom 43 196 A G0.005 G1.0
Here are the commands I run:
sequenza-utils bam2seqz -n normal.bam -t tumor.bam -F myfasta.fa -gc ref.wig.gz -q 30 -N 100 -o myOut.seqz.gz
sequenza-utils seqz_binning -w 200 -s myOut.seqz.gz -o myOut.small.seqz.gz
The output still contains for example ~10 million lines for chr4. The binning is doing something, but it's not aggregating positions into windows as expected.
Sample output shows positions at: 10007, 10011, 10025, 10030, 10031, 10035... (every few bp)
Problem 2: Segmentation fault
When trying to run the downstream analysis, I get a segmentation fault.
Questions:
- Is there a known issue with seqz_binning at very high coverage (200-250x)?
- Should I downsample my BAM files before running sequenza-utils?
- Are there additional filtering parameters I should use for WGS at this coverage?
- How can I fix this? I really need the Sequenza results, since I use those also as an input in calculating HRD.
Any guidance would be greatly appreciated!
Best regards,
Mihaela