core dumps (failed assertion?) in SAM_pairer_update_orphant_table

145 views

Skip to first unread message

Philip Lijnzaad

unread,

Dec 10, 2019, 7:08:08 PM12/10/19

to Subread

Dear Subread authors,

I'm having trouble with using featureCounts on some bam files. Most work fine, but a few of them keep giving core dumps, both with version 1.5.2, 1.6.4 and 1.6.5 (if not needed, I prefer not to go to 2.x for now).

The bam files (~ 12 GB; ~300M reads, 2 x 150 bp) are coordinate-sorted and picard tools SamValidate-d, giving only a few of these errors:

ERROR::INVALID_CIGAR:Record 40894024, Read name A00295:112:HH73JDMXX:1:2180:20546:7733, No real operator (M|I|D|N) in CIGAR

I assume these are mostly harmless, and I think (but have not checked) that the bam files that worked fine also contain such errors) The bam files, incidentially, were converted from cram first, therefore the quality scores look unusual.

My featureCounts-1.6.5 command line invocation was:

featureCounts -s 2 -a annotations.gtf --tmpDir /tmp/1323873.1.default -p -R BAM -B -o thefile.out thefile.bam

featureCounts was compiled with GNU C 4.8.5 20150623 (Red Hat 4.8.5-39) -mtune=core2 -mpreferred-stack-boundary=4 -mstackrealign -march=x86-64 -ggdb -O2 -fmessage-length=0

(I added the ``-mpreferred-stack-boundary=4 -mstackrealign'' because the first few core dumps suggested it was an alignment problem)

It was run on CentOS 7.7.1908 (Linux 3.10.0-1062.4.3.el7.x86_64 with glibc-2.17-292.el7.x86_64 zlib-1.2.7-18.el7.x86_64

There is, I think, plenty of memory (32 GB) and tmpspace available (100 GB).

The stack trace is shown below, as are the bam header and the first few reads.

What to do? Thanks kindly in advance,

Philip Lijnzaad

========================================================================

# stack trace:

This GDB was configured as "x86_64-redhat-linux-gnu".

For bug reporting instructions, please see:

<http://www.gnu.org/software/gdb/bugs/>...

Reading symbols from /hpc/local/CentOS7/gen/software/subread-1.6.5/bin/featureCounts...done.

[New LWP 84815]

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

Core was generated by `featureCounts -s 2 -a /hpc/local/CentOS7/gen/data/genomes/human/gencode/gencode'.

Program terminated with signal 6, Aborted.

#0 0x00002af313314337 in raise () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 zlib-1.2.7-18.el7.x86_64

(gdb) where

#0 0x00002af313314337 in raise () from /lib64/libc.so.6

#1 0x00002af313315a28 in abort () from /lib64/libc.so.6

#2 0x00002af31330d156 in __assert_fail_base () from /lib64/libc.so.6

#3 0x00002af31330d202 in __assert_fail () from /lib64/libc.so.6

#4 0x0000000000451805 in SAM_pairer_update_orphant_table (pairer=pairer@entry=0x7fffacf60bb8, thread_context=0x389acf0) at input-files.c:4375

#5 0x00000000004530ad in SAM_pairer_finish_margin_table (pairer=pairer@entry=0x7fffacf60bb8) at input-files.c:4612

#6 0x00000000004532d8 in SAM_pairer_run_once (pairer=pairer@entry=0x7fffacf60bb8) at input-files.c:4634

#7 0x00000000004567b3 in SAM_pairer_run (pairer=pairer@entry=0x7fffacf60bb8) at input-files.c:5620

#8 0x0000000000413c0e in fc_thread_wait_threads (global_context=0x7fffacf60a60) at readSummary.c:4463

#9 readSummary_single_file (global_context=global_context@entry=0x7fffacf60a60, column_numbers=column_numbers@entry=0x2862f10, nexons=nexons@entry=1062876, geneid=0x2af3158d9010, chr=0x2af316d1f010, start=0x2af315ce7010, stop=0x2af316503010, sorted_strand=0x2af3129ee010 "", anno_chr_2ch=0x244ba40 "\310]j\023\363*", anno_chrs=0x245b6b0,

anno_chr_head=0x245ba60, block_end_index=0x279b0d0, block_min_start=0x2af317d57010, block_max_end=0x25a6a70, my_read_counter=my_read_counter@entry=0x244bcb0, junction_global_table=junction_global_table@entry=0x0, splicing_global_table=splicing_global_table@entry=0x0, merged_RG_table=merged_RG_table@entry=0x0) at readSummary.c:6254

#10 0x000000000041597f in readSummary (argc=argc@entry=56, argv=argv@entry=0x7fffacf63970) at readSummary.c:6004

#11 0x000000000040384a in main (argc=<optimized out>, argv=<optimized out>) at readSummary.c:6760

========================================================================

# bam header and first few reads:

@HD VN:1.5 SO:coordinate

@PG ID:STAR PN:STAR VN:STAR_2.6.1a_08-27 CL:STAR --runThreadN 4 --genomeDir /tmp/4767736.1.default//hpc/pmc_gen/references/RNA-Seq/staralign --readFilesType SAM PE --readFilesIn /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-StarAlignBam/inputs/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ssh_md5/execution/PMABM000COX_RNA-Seq.bam --readFilesCommand samtools view -h --outFileNamePrefix PMABM000COX_RNA-Seq.sorted --outStd BAM_SortedByCoordinate --outReadsUnmapped Within --outSAMtype BAM SortedByCoordinate --outSAMunmapped Within KeepPairs --outSAMorder Paired --outBAMsortingThreadN 4 --alignIntronMax 100000 --alignMatesGapMax 100000 --alignSJDBoverhangMin 10 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentMin 12 --chimJunctionOverhangMin 12 --chimSegmentReadGapMax 0 --twopassMode Basic

@PG ID:MarkDuplicates VN:2.10.10-SNAPSHOT CL:MarkDuplicates INPUT=[/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-MarkDuplicates/inputs/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-MergeBamAlignment/execution/PMABM000COX_RNA-Seq.aligned.sorted.bam] OUTPUT=PMABM000COX_RNA-Seq.aligned.sorted.duplicates_marked.bam METRICS_FILE=PMABM000COX_RNA-Seq.duplicate_metrics ASSUME_SORT_ORDER=coordinate OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true CREATE_MD5_FILE=true USE_JDK_DEFLATER=true USE_JDK_INFLATER=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:STAR

@PG ID:GATK SplitNCigarReads VN:4.0.10.1 CL:SplitNCigarReads --output PMABM000COX_RNA-Seq.split.bam --input /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-SplitNCigarReads/inputs/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-MarkDuplicates/execution/PMABM000COX_RNA-Seq.aligned.sorted.duplicates_marked.bam --reference /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-SplitNCigarReads/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa --create-output-bam-index true --create-output-bam-md5 true --refactor-cigar-string false --skip-mapping-quality-transform false --max-reads-in-memory 150000 --max-mismatches-in-overhang 1 --max-bases-in-overhang 40 --do-not-fix-overhangs false --process-secondary-alignments false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false PN:GATK SplitNCigarReads

@PG ID:GATK SplitNCigarReads.1 VN:4.0.10.1 CL:SplitNCigarReads --output PMABM000COX_RNA-Seq.split.bam --input /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-SplitNCigarReads/inputs/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-MarkDuplicates/execution/PMABM000COX_RNA-Seq.aligned.sorted.duplicates_marked.bam --reference /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-SplitNCigarReads/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa --create-output-bam-index true --create-output-bam-md5 true --refactor-cigar-string false --skip-mapping-quality-transform false --max-reads-in-memory 150000 --max-mismatches-in-overhang 1 --max-bases-in-overhang 40 --do-not-fix-overhangs false --process-secondary-alignments false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false PN:GATK SplitNCigarReads

@PG ID:GATK ApplyBQSR VN:4.0.10.1 CL:ApplyBQSR --output PMABM000COX_RNA-Seq.aligned.duplicate_marked.recalibrated.bam --bqsr-recal-file /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ApplyBQSR/inputs/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-BaseRecalibrator/execution/PMABM000COX_RNA-Seq.recal_data.csv --use-original-qualities true --static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30 --input /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ApplyBQSR/inputs/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-SplitNCigarReads/execution/PMABM000COX_RNA-Seq.split.bam --reference /hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ApplyBQSR/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa --create-output-bam-md5 true --add-output-sam-program-record true --preserve-qscores-less-than 6 --quantize-quals 0 --round-down-quantized false --emit-original-quals false --global-qscore-prior -1.0 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false PN:GATK ApplyBQSR

@SQ SN:chr1 LN:248956422 M5:2648ae1bacce4ec4b6cf337dcae37816 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr2 LN:242193529 M5:4bb4f82880a14111eb7327169ffb729b UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr3 LN:198295559 M5:a48af509898d3736ba95dc0912c0b461 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr4 LN:190214555 M5:3210fecf1eb92d5489da4346b3fddc6e UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr5 LN:181538259 M5:f7f05fb7ceea78cbc32ce652c540ff2d UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr6 LN:170805979 M5:6a48dfa97e854e3c6f186c8ff973f7dd UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr7 LN:159345973 M5:94eef2b96fd5a7c8db162c8c74378039 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr8 LN:145138636 M5:c67955b5f7815a9a1edfaa15893d3616 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr9 LN:138394717 M5:addd2795560986b7491c40b1faa3978a UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr10 LN:133797422 M5:907112d17fcb73bcab1ed1c72b97ce68 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr11 LN:135086622 M5:1511375dc2dd1b633af8cf439ae90cec UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr12 LN:133275309 M5:e81e16d3f44337034695a29b97708fce UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr13 LN:114364328 M5:17dab79b963ccd8e7377cef59a54fe1c UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr14 LN:107043718 M5:acbd9552c059d9b403e75ed26c1ce5bc UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr15 LN:101991189 M5:f036bd11158407596ca6bf3581454706 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr16 LN:90338345 M5:24e7cabfba3548a2bb4dff582b9ee870 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr17 LN:83257441 M5:a8499ca51d6fb77332c2d242923994eb UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr18 LN:80373285 M5:11eeaa801f6b0e2e36a1138616b8ee9a UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr19 LN:58617616 M5:b0eba2c7bb5c953d1e06a508b5e487de UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr20 LN:64444167 M5:b18e6c531b0bd70e949a7fc20859cb01 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr21 LN:46709983 M5:2f45a3455007b7e271509161e52954a9 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chr22 LN:50818468 M5:221733a2a15e2de66d33e73d126c5109 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chrX LN:156040895 M5:49527016a48497d9d1cbd8e4a9049bd3 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chrY LN:57227415 M5:b2b7e6369564d89059e763cd6e736837 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@SQ SN:chrM LN:16569 M5:c68f52674c9fb33aef52dcf399755519 UR:/hpc/pmc_seq/everwiel/cromwell-executions/rnafusion/a879d6d0-e954-484a-ad1a-17665204c93f/call-ConvertToCram/inputs/hpc/pmc_gen/references/RNA-Seq/references/ref_genome.fa

@RG ID:PMABM000COX_AH7MHFDRXX SM:PMABM000COX LB:PMABM000COX PL:Illumina PU:A00295 CN:Illumina

A00295:158:H7MHFDRXX:2:1260:9968:10050 99 chr1 11680 1 151M = 182315 170786 CTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCCGGTGTTGAGAATGACTGCGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAG ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:151M PG:Z:MarkDuplicates HI:i:2 MQ:i:1 UQ:i:0 AS:i:296 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:2:1260:9968:10050 409 chr1 11796 1 151M = 11796 0 TTTCCTTTGCTGTTCCTGCATGTAGTTTAAACGAGATTGCCAGCACCGGGTATCATTCACCATTTTTATTTTCGTTAACTTGCCGTCAGCCTTTTCTTTGACCTCTTCTTTCTGTTCATGTGTATTAGCTGTCTCTTAGCCCAGACTTCCC ????????????????????????????????????+?????????????????????????????????????????????????????????????????????????????????5???????????????????????????????? PG:Z:MarkDuplicates HI:i:2 UQ:i:74 AS:i:296 MD:Z:67C58T24 NM:i:2 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:1:1232:31656:9502 99 chr1 12608 1 151M = 183151 170694 CCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCG ??5???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:151M PG:Z:MarkDuplicates HI:i:2 MQ:i:1 UQ:i:0 AS:i:300 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:1:2278:20265:20619 1123 chr1 12608 1 151M = 183151 170694 CCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCG ??5???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:151M PG:Z:MarkDuplicates HI:i:2 MQ:i:1 UQ:i:0 AS:i:300 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:2:2158:16423:10113 1123 chr1 12608 1 151M = 183151 170694 CCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCG ??5???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:151M PG:Z:MarkDuplicates HI:i:2 MQ:i:1 UQ:i:0 AS:i:300 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:1:1232:31656:9502 409 chr1 12632 1 151M = 12632 0 ATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGC ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????5??? PG:Z:MarkDuplicates HI:i:2 UQ:i:0 AS:i:300 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:1:2278:20265:20619 409 chr1 12632 1 151M = 12632 0 ATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGC ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????5??? PG:Z:MarkDuplicates HI:i:2 UQ:i:0 AS:i:300 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

A00295:158:H7MHFDRXX:2:2158:16423:10113 409 chr1 12632 1 151M = 12632 0 ATGCCCTTCCCCAGCATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTGAGTGTCCCCAGTGTTGCAGAGGTGAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGC ???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????5??? PG:Z:MarkDuplicates HI:i:2 UQ:i:0 AS:i:300 MD:Z:151 NM:i:0 RG:Z:PMABM000COX_AH7MHFDRXX

Yang LIAO

unread,

Dec 11, 2019, 8:58:01 PM12/11/19

to Subread

The error was triggered by a unsatisfied assertion statement, which was to make sure that all the unpaired reads in the BAM file were processed. This shouldn't be relevant to the "no real operator in CIGAR" issue nor the abnormal quality strings. Is it possible if you can share your BAM file and the annotation file? I hope to dig deeper into the error.

BTW I didn't have the memory-address alignment issues when I compile the program in many Linux distributions. I haven't tried CentOS 7.7 though. I will test if this is can cause the error that you encountered.

Philip Lijnzaad

unread,

Jan 28, 2020, 10:10:58 AM1/28/20

to Subread

Hi Yang,

I think I found the cause: the tmpspace requirements are enormous and I run out of diskspace (it seems like not all disk writes are checked for succes, resulting in a few obscure error messages, depending on what stage the disk is full ...). The tempspace requirements don't seem to scale linearly. Initially it looked like I needed 4* the size of the input bam (this is 2x150 bp paired end human RNA-seq data, genomically aligned to Hg38 and being assigned to protein coding gene from GenCode26). But with a 17 GB bam file (not all that big, I think), the tmpDir gets filled with ~ 800 temp-core-*.tmp files, each ~150-200 MB, totalling 122 GB. After that, there is not enough room for creating the new bam file (which I also create in tmpDir) and the thing crashes.

This is using featureCounts v1.6.5, roughly as follows

featureCounts -s 2 -a $gtf --tmpDir $tmpdir  -T 4 -p -R BAM -B -o $output $bam > $log

Our queueing system cleverly allocates tmpdir on the fly and the 128 GB I requested is apparently not enough for a 13 GB file. I find this a bit excessive to be honest. Is there a way to reduce these requirements, or alternatively, to predict the tmpspace needed ? I cannot guarantee that all my data will be smaller than e.g. 15 GB, and requesting 1TB just-in-case doesn't feel right.

Yang LIAO

unread,

Jan 28, 2020, 4:29:53 PM1/28/20

to Subread

FeatureCounts only writes alignments into temporary files if their paired alignments are unfound nearby in the BAM file. No matter if the alignments in the BAM file are sorted by the read names or by the alignment coordinates, it is likely that a pair of alignments are located pretty closely in the BAM file, hence none of these alignments are written into the temporary file.

As you said, featureCounts created 122GB of temporary data, and it is pretty unusual for an input BAM file only 13GB big. Because the temporary data is in a condensed format, it is unlikely to have 122GB of it generated even all the alignments in the BAM file were written into the temporary files. To my experience, the compression level is around 2-3, hence featureCounts shouldn't create temporary files larger than 40GB for a 13GB BAM file.

Can you try this:

cat $bam | gzip -cd > test-bam.bin

This will create a binary file named "test-bam.bin". This binary file contains all the alignments in the BAM file in the format of the temporary files. Can you check the size of it? If this file is indeed larger than 100GB, it may be some very low complexity reads making the compression ratio extremely high.

Philip Lijnzaad

unread,

Jan 29, 2020, 4:33:27 AM1/29/20

to Subread

[I thought I had replied by e-mail but apparently that got lost somewhere]

Hi Yang,

thanks for the reply. (The bam file I use is coordinate sorted, btw) The result of your test (for a 14 GB file, the 'worst' of my batch) was a decompression of factor of 12.3 (bam was 14,876,903,505, decompressed yielded 182,946,027,797 bytes). I guess part of this is due to the bam file having been re-generated from a cram file. I can dig up the details if needed, but the quality scores haven an extremely low complexity, looking pretty much like this (we have very generous 150 bp read lengths):

????????????+????5???5??5?????5??5???55555+?5??5?5???555?5555???+??5555?????55???55?5???5?555?5??5??5???55?5???555'55??+5+5??????????5????????????5????
?????????5????????????5????????+?+5??5555???5???5?55??????5????5?????5??5+5??5?5?5????+??????5?5??5???5????5???????????5?+???5????+?5????????5????55???
???55??????????????????????55???5?555?5????????5??5??????????5?5?55???55?555???5???55?5?5????5?55??5??5????????5?55???????5?55?????5???5??????5????????
??555?????????????????5?5???5?555?5???????????5???5????5?5???55???55?555???5???55?5?5????5?55??5??5????????5?55?5?????5555????55???5??????5???????5????
???5??5??????5?5?5?5??5??5???5?5?5??????5???5????55?5??5??55???5?5???55?5??55?5??5?5?5?55??5???5??55?????555??555?5?????5?5?5?5?5???5????55????????????
??5?5???????????????????5555?????5???5??????5??5????5???5???5?5?55555?5?5???5?5??5?5??5?5??5????5???5???5??55?55?5??5??5?5?5???55??55??555??????5?5????
???5???????????55??5???5???5?????555??55555???5?555?555?5?5?5????5555?????????5??????55??55?55?5?????????555??5??555???5?5???????????????5?5??????5????
???555??555?5?????5?5?5?5?5???5????5555????5????5??????55??55?55?5?????5???555??5??555???5?555??5???5??????5?5??????5???????5?5?5???????????5??5??5?5??
???555?????5????55??????5??5??5??5555?????????????????555????5???5555?5??????5?5??5??5?55????55?5?5?5?5?55?555?5???555?5??5??5?5555???5???5????5?555???

Are the quality scores used at all by featureCounts? If not, it's a pity to have to waste diskspace and i/o unpacking them; is that something to consider or would that be too complex?

Philip Lijnzaad

unread,

Jan 29, 2020, 11:56:08 AM1/29/20

to Subread

On second reading, I'm a bit puzzled by this remark

No matter if the alignments in the BAM file are sorted by the read names or by the alignment coordinates, it is likely that a pair of alignments are located pretty closely in the BAM file, hence none of these alignments are written into the temporary file.

Does this mean that if the input bam file were properly sorted, this issue would not arise because nothing would be written to tmspace to start with? The header of my bam files says

@HD VN:1.5 SO:coordinate

but I understand that not all sortings are equal. I tried to cut down on the uneccesary bits using --primary --ignoreDu, but that doesn't solve the problem: the filtering appears to be done after decompressing all the alignments so the tmpspace requirements are the same. Would it make sense to first filter, then write unsorted pairs later?

Regards,

Philip

Yang LIAO

unread,

Jan 29, 2020, 3:44:33 PM1/29/20

to Subread

The read sequences and quality scores are not written into the temporary files by default, but when you specified "-R BAM", they are written into the temporary files for reconstructing the BAM output.

I'm still wondering what are in the BAM file. It looks that every single alignment was written into the temporary file and it likely indicates some format issues in the input BAM file. Can you please show me the first 500 reads in the BAM file? If it is sensitive, you can remove the read sequences.

Philip Lijnzaad

unread,

Jan 30, 2020, 4:57:04 AM1/30/20

to Subread

Hi Yang,

thanks for your continued efforts! So reverting to the non-bam format would solve most of the tmpspace requirements?

I attached a bam file of the first 500 reads. There's a fair few PCR duplicates in there, there are weird very low complexity quality scores (the bam was re-created from a cram file) and the distance between mates of a pair can be large-ish. The header contains all the preprocessing steps as-is, the sorting was done by STAR using --outStd BAM_SortedByCoordinate --outReadsUnmapped Within --outSAMtype BAM SortedByCoordinate .

Many thanks,

Philip

PS: will there be support from cram as an input format?

foo.bam

Yang LIAO

unread,

Jan 30, 2020, 4:32:24 PM1/30/20

to Subread

Thanks for sharing the data!

I'm not sure if the pure "N" read sequences in foo.bam were removed manually from the BAM file or it was the same in the input BAM file. If it was the same in the input BAM file, the reads have a very low complexity in both read sequences and quality scores, hence the compression rate was extremely high.

The format is correct, although I see many read-pairs in your BAM file that have only one end mapped. For the mapped reads, some have pretty long template lengths. This made featureCounts unable to find the paired alignments in a small region in the BAM file, hence wrote the alignments into temporary files.

If you want to reduce the size of temporary files, you can specify "-R CORE" instead of "-R BAM". This will generate a text file containing read names and assignment details but not in SAM nor BAM format, hence the sequences and quality strings are not written into the temporary files.

Reply all

Reply to author

Forward

0 new messages