bug: subjunc mapping with extremely long skipped region

96 views
Skip to first unread message

Kamil

unread,
Jul 2, 2015, 11:49:03 AM7/2/15
to sub...@googlegroups.com

I mapped one paired-end RNA-seq sample with subjunc v1.4.6-p2 and got a very strange mapping with 268,435,444 skipped bases on chromosome 13, even though chromosome 13 is only 115,169,878 bases long. I discovered this mapping because it causes Picard CollectGcBiasMetrics to exit with an exception.

Such mappings should not be considered valid:

  • chromosome 13 is 115,169,878 bases long, less than the 268,435,444 skipped bases
  • I used --maxdist 1000000 (1M)

subjunc command

opt=(
      --subreads 20
      --index $SUBREAD_INDEX
      --BAMoutput
      --output "$bam"
      --multi 16
      --quality
      --threads $THREADS
      --maxMismatches 10
      --allJunctions
      --gzFASTQinput
      --read "$fq1"
      --read2 "$fq2"
      --order fr
      --mindist 0
      --maxdist 1000000

)
subjunc ${opt[*]}

BAM lines

samtools view subjunc/SCC10055_A01.bam | grep H7TVLADXX140423:2:1112:17883:23072
H7TVLADXX140423:2:1112:17883:23072      73      chr13   27729228        5       13S2M268435444N24M62S   *       0       0       GTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGATGTCCACAACTTTTTTTTTTTTTTTTTTTTTTTGAGGGGACCCCCCC  BBBFDDDFHHHHHJJHIEHHJJJJJJJJJJJJFDDDDDDDDDDDDDDDDB-)0(+(+33((+2(4@CDDDDDDDDDDDDDDDDDD@###############   HI:i:1  NH:i:1  CG:Z:39S19M43S  CP:i:119945131  CT:Z:-  CC:Z:chrX       NM:i:11
H7TVLADXX140423:2:1112:17883:23072      133     *       0       0       *       chr13   27729228        0       CAAACAAAAAAACAGAGCGAGGTTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAA  7;(22(((..(<-@)@985?88;=???>?<??????????#############################################################

CollectGcBiasMetrics exception

[Wed Jul 01 14:22:09 EDT 2015] picard.analysis.CollectGcBiasMetrics CHART_OUTPUT=SCC10055_A01.gc_bias_histogram.pdf SUMMARY_OUTPUT=SCC10055_A01.gc_bias_metrics INPUT=/tmp/picardmetrics_ks38_13802/SCC10055_A01.bam OUTPUT=SCC10055_A01.gc_bias_histogram TMP_DIR=[/tmp/picardmetrics_ks38_13802] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=/data/srlab/external-data/Ensembl/pub/release-75/fasta/homo_sapiens/dna/chr/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa    WINDOW_SIZE=100 MINIMUM_GENOME_FRACTION=1.0E-5 IS_BISULFITE_SEQUENCED=false METRIC_ACCUMULATION_LEVEL=[ALL_READS] ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jul 01 14:22:09 EDT 2015] Executing as ks38 on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13; Picard version: 1.134(a7a08c474e4d99346eec7a9956a8fe71943b5d80_1434033355) IntelDeflater
INFO    2015-07-01 14:24:30     SinglePassSamProgram    Processed     1,000,000 records.  Elapsed time: 00:00:24s.  Time for last 1,000,000:    9s.  Last read position: chr12:3,320,784
Ignoring SAM validation error: ERROR: Read name H7TVLADXX140423:2:1112:17883:23072, Read CIGAR M operator maps off end of reference
Ignoring SAM validation error: ERROR: Record 1256555, Read name H7TVLADXX140423:2:1112:17883:23072, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
[Wed Jul 01 14:24:32 EDT 2015] picard.analysis.CollectGcBiasMetrics done. Elapsed time: 2.38 minutes.
Runtime.totalMemory()=5124915200
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Exception counting mismatches for read H7TVLADXX140423:2:1112:17883:23072 1/2 101b aligned read.
        at htsjdk.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:317)
        at htsjdk.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:331)
        at picard.analysis.directed.GcBiasMetricsCollector.addRead(GcBiasMetricsCollector.java:274)
        at picard.analysis.directed.GcBiasMetricsCollector.access$100(GcBiasMetricsCollector.java:25)
        at picard.analysis.directed.GcBiasMetricsCollector$PerUnitGcBiasMetricsCollector.acceptRecord(GcBiasMetricsCollector.java:123)
        at picard.analysis.directed.GcBiasMetricsCollector$PerUnitGcBiasMetricsCollector.acceptRecord(GcBiasMetricsCollector.java:64)
        at picard.metrics.MultiLevelCollector$AllReadsDistributor.acceptRecord(MultiLevelCollector.java:192)
        at picard.metrics.MultiLevelCollector.acceptRecord(MultiLevelCollector.java:315)
        at picard.analysis.directed.GcBiasMetricsCollector.acceptRecord(GcBiasMetricsCollector.java:58)
        at picard.analysis.CollectGcBiasMetrics.acceptRead(CollectGcBiasMetrics.java:149)
        at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:114)
        at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:53)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 296164673
        at htsjdk.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:304)
        ... 14 more

Kamil

unread,
Jul 2, 2015, 1:09:21 PM7/2/15
to sub...@googlegroups.com

Here are a few other instances of the same bug. Perhaps this is an integer overflow?

Skipped lengths:

  • 268,435,446
  • 268,435,451
  • 268,435,454
samtools view subjunc/SCC10055_D07.bam | grep H7TVLADXX140423:2:1102:12416:8747
H7TVLADXX140423:2:1102:12416:8747       101     *       0       0       *       chr17   43267198        0       CCACCATGTCCAGGATTTTTTTTTTTTTTTTGCGAAGGGGCCTCCCTCTGTTCCCTAGGCTCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCT  CCCFFBDDHDHHFGBBEFHHIIGGIGEIEF#######################################################################
H7TVLADXX140423:2:1102:12416:8747       153     chr17   43267198        6       26S3M268435446N19M53S   *       0       0       GAGTAGATGGTCGGCAGCGTCAGATGTGTAAAAGAGACAGCCCCCATGTCCAGGATTTTTTTTTTTTTTTTGCGATGGAGTCTCACTCTGTTGCCTAGGCT  ###########################################?@::+0393380-:@@BEEGIIIIIIIGIIIIIIGIIHHGHDGGFFAHFF;8FDD<??   HI:i:1  NH:i:1  CG:Z:48S17M36S  CP:i:14912798   CT:Z:-  CC:Z:chr6       NM:i:14
samtools view subjunc/SCC10055_G07.bam | grep H7TVLADXX140423:2:2215:5409:64069
H7TVLADXX140423:2:2215:5409:64069       73      chr10   114919611       0       15S1M268435451N18M67S   *       0       0       AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTACTAAACACCCCCCCCCCCCCCCCCCCCCCCCCCCTCCACCCTCGCCCCTCCCA  +1=B?DDDFAFDFFIIBD?6227@B@@@355@#####################################################################   HI:i:1  NH:i:1  CG:Z:34S17M50S  CP:i:115346243  CT:Z:-  CC:Z:chr10      NM:i:19
H7TVLADXX140423:2:2215:5409:64069       133     *       0       0       *       chr10   114919611       0       GTATCAACCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTCTTTTTTCTTTGTTTTAAACATGTCTTTTTTACACATTTATGGGCCGCCCG  @@@FFFFFDHDHH:EGIIIIJIG1?DBBHHFDDDDDDDDDDDDDDDB######################################################
samtools view subjunc/SCC10057_F11.bam | grep H9FTCADXX140602:2:2215:9745:68801
H9FTCADXX140602:2:2215:9745:68801       73      chr1    21960565        0       13S4M268435454N17M67S   *       0       0       GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGTTTTTTTTTCCCCCCCGGGGGGGGGGGGGGGGGCGGTTTCCGGCCCC  CBBFFFFFHHHHHIGHIIJJJJJJJJJJJHFDDDDDDDDDDDD##########################################################   HI:i:1  NH:i:1  CG:Z:34S19M48S  CP:i:22032080   CT:Z:-  CC:Z:chr1       NM:i:16
H9FTCADXX140602:2:2215:9745:68801       133     *       0       0       *       chr1    21960565        0       ACCCGGGAAGCAGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACTCCATCTCAAAAAAAAAAAAAAAAAAAAAA  ;;;95--@((.@.@(66688@<;>????????????<<?<<<;?;>;>?7=??>=?==<<?>?>>==<<<;===<========<::::::9:::::9:8:9

Wei Shi

unread,
Jul 15, 2015, 12:20:59 AM7/15/15
to sub...@googlegroups.com
Hi Kamil,

Could you send us your fastq file so that we can reproduce the problem you encountered? We need to have your command of index building as well.

Thanks,
Wei

Kamil

unread,
Jul 16, 2015, 11:36:13 AM7/16/15
to sub...@googlegroups.com

Hi Wei,

Thanks for the reply! I’d be happy to provide any files that you need. I sent you a Dropbox link to a shared folder with FASTQ files.

Curiously, I cannot reproduce the bug when I use two FASTQ files, where each file has just 1 read with the erroneous mapping mentioned in my previous message.

Therefore, it seems to me that the bug is not related to the sequence content of my reads. Instead, I would guess that the read mapping loop fails to reset some variables before mapping the next read. I haven’t studied your code, so this is just speculation.

subread-buildindex

Here is the index building command:

wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

f=Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
o=Homo_sapiens.GRCh37.75.dna.primary_assembly

subread-buildindex -F -B -o $o $f &> subread-buildindex.log

Here is the log:

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
    v1.4.6-p2

//=========================== indexBuilder setting ===========================\\
||                                                                            ||
||                Index name : Homo_sapiens.GRCh37.75.dna.primary_assembly    ||
||               Index space : base-space                                     ||
||           One block index : yes                                            ||
||          Repeat threshold : 24 repeats                                     ||
||  Distance to next subread : 1                                              ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o ../Homo_sapiens.GRCh37.75.dna.primary_as ... ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
||    8%,   1 mins elapsed, rate=7312.8k bps/s, total=3109m                   ||
||   16%,   2 mins elapsed, rate=6175.1k bps/s, total=3109m                   ||
||   24%,   3 mins elapsed, rate=4992.2k bps/s, total=3109m                   ||
||   33%,   4 mins elapsed, rate=4315.3k bps/s, total=3109m                   ||
||   41%,   6 mins elapsed, rate=3731.4k bps/s, total=3109m                   ||
||   49%,   8 mins elapsed, rate=3242.5k bps/s, total=3109m                   ||
||   58%,  11 mins elapsed, rate=2852.2k bps/s, total=3109m                   ||
||   66%,  14 mins elapsed, rate=2491.3k bps/s, total=3109m                   ||
||   74%,  18 mins elapsed, rate=2218.6k bps/s, total=3109m                   ||
||   83%,  23 mins elapsed, rate=1951.2k bps/s, total=3109m                   ||
||   91%,  28 mins elapsed, rate=1730.6k bps/s, total=3109m                   ||
|| 4311183 uninformative subreads were found.                                 ||
|| These subreads were excluded from index building.                          ||
|| Build the index...                                                         ||
||    8%,  37 mins elapsed, rate=1326.4k bps/s, total=3109m                   ||
||   16%,  40 mins elapsed, rate=1284.4k bps/s, total=3109m                   ||
||   24%,  44 mins elapsed, rate=1309.2k bps/s, total=3109m                   ||
||   33%,  47 mins elapsed, rate=1341.0k bps/s, total=3109m                   ||
||   41%,  50 mins elapsed, rate=1329.4k bps/s, total=3109m                   ||
||   49%,  53 mins elapsed, rate=1308.1k bps/s, total=3109m                   ||
||   58%,  57 mins elapsed, rate=1305.2k bps/s, total=3109m                   ||
||   66%,  61 mins elapsed, rate=1261.0k bps/s, total=3109m                   ||
||   74%,  65 mins elapsed, rate=1251.3k bps/s, total=3109m                   ||
||   83%,  68 mins elapsed, rate=1265.3k bps/s, total=3109m                   ||
||   91%,  71 mins elapsed, rate=1274.9k bps/s, total=3109m                   ||
|| Save current index block...                                                ||
||  [ 0.0% finished ]                                                         ||
||  [ 10.0% finished ]                                                        ||
||  [ 20.0% finished ]                                                        ||
||  [ 30.0% finished ]                                                        ||
||  [ 40.0% finished ]                                                        ||
||  [ 50.0% finished ]                                                        ||
||  [ 60.0% finished ]                                                        ||
||  [ 70.0% finished ]                                                        ||
||  [ 80.0% finished ]                                                        ||
||  [ 90.0% finished ]                                                        ||
||  [ 100.0% finished ]                                                       ||
||                                                                            ||
||                     Total running time: 100.8 minutes.                     ||
|| Index Homo_sapiens.GRCh37.75.dna.primary_assembly was successfully built!  ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Kamil

unread,
Jul 20, 2015, 3:08:21 PM7/20/15
to sub...@googlegroups.com

For others who have encountered this problem, I would like to suggest a script to drop these troublesome reads long skipped regions:

https://gist.github.com/slowkow/6c1c1b37b57fc1b7c795

This is only a temporary workaround, and the correct solution is to fix the bug in subread/subjunc that causes erroneous mappings.

Wei Shi

unread,
Sep 3, 2015, 9:24:11 PM9/3/15
to Subread
Dear Kamil,

Sorry for taking a while to get back to you. We have fixed the bug in the new release 1.4.6-p5.

Please let us know if the problem persists.

Cheers,
Wei
Reply all
Reply to author
Forward
0 new messages