I mapped one paired-end RNA-seq sample with subjunc v1.4.6-p2 and got a very strange mapping with 268,435,444 skipped bases on chromosome 13, even though chromosome 13 is only 115,169,878 bases long. I discovered this mapping because it causes Picard CollectGcBiasMetrics to exit with an exception.
Such mappings should not be considered valid:
--maxdist 1000000 (1M)opt=(
--subreads 20
--index $SUBREAD_INDEX
--BAMoutput
--output "$bam"
--multi 16
--quality
--threads $THREADS
--maxMismatches 10
--allJunctions
--gzFASTQinput
--read "$fq1"
--read2 "$fq2"
--order fr
--mindist 0
--maxdist 1000000
)
subjunc ${opt[*]}
samtools view subjunc/SCC10055_A01.bam | grep H7TVLADXX140423:2:1112:17883:23072
H7TVLADXX140423:2:1112:17883:23072 73 chr13 27729228 5 13S2M268435444N24M62S * 0 0 GTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGATGTCCACAACTTTTTTTTTTTTTTTTTTTTTTTGAGGGGACCCCCCC BBBFDDDFHHHHHJJHIEHHJJJJJJJJJJJJFDDDDDDDDDDDDDDDDB-)0(+(+33((+2(4@CDDDDDDDDDDDDDDDDDD@############### HI:i:1 NH:i:1 CG:Z:39S19M43S CP:i:119945131 CT:Z:- CC:Z:chrX NM:i:11
H7TVLADXX140423:2:1112:17883:23072 133 * 0 0 * chr13 27729228 0 CAAACAAAAAAACAGAGCGAGGTTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAA 7;(22(((..(<-@)@985?88;=???>?<??????????#############################################################
[Wed Jul 01 14:22:09 EDT 2015] picard.analysis.CollectGcBiasMetrics CHART_OUTPUT=SCC10055_A01.gc_bias_histogram.pdf SUMMARY_OUTPUT=SCC10055_A01.gc_bias_metrics INPUT=/tmp/picardmetrics_ks38_13802/SCC10055_A01.bam OUTPUT=SCC10055_A01.gc_bias_histogram TMP_DIR=[/tmp/picardmetrics_ks38_13802] VALIDATION_STRINGENCY=LENIENT REFERENCE_SEQUENCE=/data/srlab/external-data/Ensembl/pub/release-75/fasta/homo_sapiens/dna/chr/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa WINDOW_SIZE=100 MINIMUM_GENOME_FRACTION=1.0E-5 IS_BISULFITE_SEQUENCED=false METRIC_ACCUMULATION_LEVEL=[ALL_READS] ASSUME_SORTED=true STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Wed Jul 01 14:22:09 EDT 2015] Executing as ks38 on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13; Picard version: 1.134(a7a08c474e4d99346eec7a9956a8fe71943b5d80_1434033355) IntelDeflater
INFO 2015-07-01 14:24:30 SinglePassSamProgram Processed 1,000,000 records. Elapsed time: 00:00:24s. Time for last 1,000,000: 9s. Last read position: chr12:3,320,784
Ignoring SAM validation error: ERROR: Read name H7TVLADXX140423:2:1112:17883:23072, Read CIGAR M operator maps off end of reference
Ignoring SAM validation error: ERROR: Record 1256555, Read name H7TVLADXX140423:2:1112:17883:23072, bin field of BAM record does not equal value computed based on alignment start and end, and length of sequence to which read is aligned
[Wed Jul 01 14:24:32 EDT 2015] picard.analysis.CollectGcBiasMetrics done. Elapsed time: 2.38 minutes.
Runtime.totalMemory()=5124915200
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Exception counting mismatches for read H7TVLADXX140423:2:1112:17883:23072 1/2 101b aligned read.
at htsjdk.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:317)
at htsjdk.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:331)
at picard.analysis.directed.GcBiasMetricsCollector.addRead(GcBiasMetricsCollector.java:274)
at picard.analysis.directed.GcBiasMetricsCollector.access$100(GcBiasMetricsCollector.java:25)
at picard.analysis.directed.GcBiasMetricsCollector$PerUnitGcBiasMetricsCollector.acceptRecord(GcBiasMetricsCollector.java:123)
at picard.analysis.directed.GcBiasMetricsCollector$PerUnitGcBiasMetricsCollector.acceptRecord(GcBiasMetricsCollector.java:64)
at picard.metrics.MultiLevelCollector$AllReadsDistributor.acceptRecord(MultiLevelCollector.java:192)
at picard.metrics.MultiLevelCollector.acceptRecord(MultiLevelCollector.java:315)
at picard.analysis.directed.GcBiasMetricsCollector.acceptRecord(GcBiasMetricsCollector.java:58)
at picard.analysis.CollectGcBiasMetrics.acceptRead(CollectGcBiasMetrics.java:149)
at picard.analysis.SinglePassSamProgram.makeItSo(SinglePassSamProgram.java:114)
at picard.analysis.SinglePassSamProgram.doWork(SinglePassSamProgram.java:53)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 296164673
at htsjdk.samtools.util.SequenceUtil.countMismatches(SequenceUtil.java:304)
... 14 more
Here are a few other instances of the same bug. Perhaps this is an integer overflow?
Skipped lengths:
samtools view subjunc/SCC10055_D07.bam | grep H7TVLADXX140423:2:1102:12416:8747
H7TVLADXX140423:2:1102:12416:8747 101 * 0 0 * chr17 43267198 0 CCACCATGTCCAGGATTTTTTTTTTTTTTTTGCGAAGGGGCCTCCCTCTGTTCCCTAGGCTCTGTCTCTTATACACATCTCCGAGCCCACGAGACCTCTCT CCCFFBDDHDHHFGBBEFHHIIGGIGEIEF#######################################################################
H7TVLADXX140423:2:1102:12416:8747 153 chr17 43267198 6 26S3M268435446N19M53S * 0 0 GAGTAGATGGTCGGCAGCGTCAGATGTGTAAAAGAGACAGCCCCCATGTCCAGGATTTTTTTTTTTTTTTTGCGATGGAGTCTCACTCTGTTGCCTAGGCT ###########################################?@::+0393380-:@@BEEGIIIIIIIGIIIIIIGIIHHGHDGGFFAHFF;8FDD<?? HI:i:1 NH:i:1 CG:Z:48S17M36S CP:i:14912798 CT:Z:- CC:Z:chr6 NM:i:14
samtools view subjunc/SCC10055_G07.bam | grep H7TVLADXX140423:2:2215:5409:64069
H7TVLADXX140423:2:2215:5409:64069 73 chr10 114919611 0 15S1M268435451N18M67S * 0 0 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTACTAAACACCCCCCCCCCCCCCCCCCCCCCCCCCCTCCACCCTCGCCCCTCCCA +1=B?DDDFAFDFFIIBD?6227@B@@@355@##################################################################### HI:i:1 NH:i:1 CG:Z:34S17M50S CP:i:115346243 CT:Z:- CC:Z:chr10 NM:i:19
H7TVLADXX140423:2:2215:5409:64069 133 * 0 0 * chr10 114919611 0 GTATCAACCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTCTTTTTTCTTTGTTTTAAACATGTCTTTTTTACACATTTATGGGCCGCCCG @@@FFFFFDHDHH:EGIIIIJIG1?DBBHHFDDDDDDDDDDDDDDDB######################################################
samtools view subjunc/SCC10057_F11.bam | grep H9FTCADXX140602:2:2215:9745:68801
H9FTCADXX140602:2:2215:9745:68801 73 chr1 21960565 0 13S4M268435454N17M67S * 0 0 GTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGGTTTTTTTTTCCCCCCCGGGGGGGGGGGGGGGGGCGGTTTCCGGCCCC CBBFFFFFHHHHHIGHIIJJJJJJJJJJJHFDDDDDDDDDDDD########################################################## HI:i:1 NH:i:1 CG:Z:34S19M48S CP:i:22032080 CT:Z:- CC:Z:chr1 NM:i:16
H9FTCADXX140602:2:2215:9745:68801 133 * 0 0 * chr1 21960565 0 ACCCGGGAAGCAGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACTCCATCTCAAAAAAAAAAAAAAAAAAAAAA ;;;95--@((.@.@(66688@<;>????????????<<?<<<;?;>;>?7=??>=?==<<?>?>>==<<<;===<========<::::::9:::::9:8:9
Hi Wei,
Thanks for the reply! I’d be happy to provide any files that you need. I sent you a Dropbox link to a shared folder with FASTQ files.
Curiously, I cannot reproduce the bug when I use two FASTQ files, where each file has just 1 read with the erroneous mapping mentioned in my previous message.
Therefore, it seems to me that the bug is not related to the sequence content of my reads. Instead, I would guess that the read mapping loop fails to reset some variables before mapping the next read. I haven’t studied your code, so this is just speculation.
Here is the index building command:
wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
f=Homo_sapiens.GRCh37.75.dna.primary_assembly.fa
o=Homo_sapiens.GRCh37.75.dna.primary_assembly
subread-buildindex -F -B -o $o $f &> subread-buildindex.log
Here is the log:
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
v1.4.6-p2
//=========================== indexBuilder setting ===========================\\
|| ||
|| Index name : Homo_sapiens.GRCh37.75.dna.primary_assembly ||
|| Index space : base-space ||
|| One block index : yes ||
|| Repeat threshold : 24 repeats ||
|| Distance to next subread : 1 ||
|| ||
|| Input files : 1 file in total ||
|| o ../Homo_sapiens.GRCh37.75.dna.primary_as ... ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
//================================= Running ==================================\\
|| ||
|| Check the integrity of provided reference sequences ... ||
|| No format issues were found ||
|| Scan uninformative subreads in reference sequences ... ||
|| 8%, 1 mins elapsed, rate=7312.8k bps/s, total=3109m ||
|| 16%, 2 mins elapsed, rate=6175.1k bps/s, total=3109m ||
|| 24%, 3 mins elapsed, rate=4992.2k bps/s, total=3109m ||
|| 33%, 4 mins elapsed, rate=4315.3k bps/s, total=3109m ||
|| 41%, 6 mins elapsed, rate=3731.4k bps/s, total=3109m ||
|| 49%, 8 mins elapsed, rate=3242.5k bps/s, total=3109m ||
|| 58%, 11 mins elapsed, rate=2852.2k bps/s, total=3109m ||
|| 66%, 14 mins elapsed, rate=2491.3k bps/s, total=3109m ||
|| 74%, 18 mins elapsed, rate=2218.6k bps/s, total=3109m ||
|| 83%, 23 mins elapsed, rate=1951.2k bps/s, total=3109m ||
|| 91%, 28 mins elapsed, rate=1730.6k bps/s, total=3109m ||
|| 4311183 uninformative subreads were found. ||
|| These subreads were excluded from index building. ||
|| Build the index... ||
|| 8%, 37 mins elapsed, rate=1326.4k bps/s, total=3109m ||
|| 16%, 40 mins elapsed, rate=1284.4k bps/s, total=3109m ||
|| 24%, 44 mins elapsed, rate=1309.2k bps/s, total=3109m ||
|| 33%, 47 mins elapsed, rate=1341.0k bps/s, total=3109m ||
|| 41%, 50 mins elapsed, rate=1329.4k bps/s, total=3109m ||
|| 49%, 53 mins elapsed, rate=1308.1k bps/s, total=3109m ||
|| 58%, 57 mins elapsed, rate=1305.2k bps/s, total=3109m ||
|| 66%, 61 mins elapsed, rate=1261.0k bps/s, total=3109m ||
|| 74%, 65 mins elapsed, rate=1251.3k bps/s, total=3109m ||
|| 83%, 68 mins elapsed, rate=1265.3k bps/s, total=3109m ||
|| 91%, 71 mins elapsed, rate=1274.9k bps/s, total=3109m ||
|| Save current index block... ||
|| [ 0.0% finished ] ||
|| [ 10.0% finished ] ||
|| [ 20.0% finished ] ||
|| [ 30.0% finished ] ||
|| [ 40.0% finished ] ||
|| [ 50.0% finished ] ||
|| [ 60.0% finished ] ||
|| [ 70.0% finished ] ||
|| [ 80.0% finished ] ||
|| [ 90.0% finished ] ||
|| [ 100.0% finished ] ||
|| ||
|| Total running time: 100.8 minutes. ||
|| Index Homo_sapiens.GRCh37.75.dna.primary_assembly was successfully built! ||
|| ||
\\===================== http://subread.sourceforge.net/ ======================//
For others who have encountered this problem, I would like to suggest a script to drop these troublesome reads long skipped regions:
https://gist.github.com/slowkow/6c1c1b37b57fc1b7c795
This is only a temporary workaround, and the correct solution is to fix the bug in subread/subjunc that causes erroneous mappings.