read maps outside of reference boundary

61 views
Skip to first unread message

Daniel Núñez

unread,
May 18, 2017, 12:04:36 PM5/18/17
to rna-star
Hi,

I am observing two issues with the same read that I am having trouble understanding.

First, one of the sites that this multi-mapper aligns to is outside the reference chromosome it claims to align to, i.e. the left-most index indicated on the 4th column of the output SAM file is way larger than the total number of bases on that chromosome (the 211000022278463 chromosome is only 2841 bp long). 

Second, some of the alignments have the wrong sequence on the output SAM (CAGGTGATTACTTT instead of AAAGTAATCACCTG). 

Here is the output I am getting for this read: 

NB501872:18:HNTVHBGX2:1:13204:15469:1924 256 2L 12377715 0 13M1S * 0 0 AAAGTAATCACCTG //////<EA/E6/A NH:i:7 HI:i:1 AS:i:12 nM:i:0
NB501872:18:HNTVHBGX2:1:13204:15469:1924 272 2L 4475385 0 1S13M * 0 0 CAGGTGATTACTTT A/6E/AE<////// NH:i:7 HI:i:2 AS:i:12nM:i:0
NB501872:18:HNTVHBGX2:1:13204:15469:1924 272 2L 20761414 0 1S13M * 0 0 CAGGTGATTACTTT A/6E/AE<////// NH:i:7 HI:i:3 AS:i:12 nM:i:0
NB501872:18:HNTVHBGX2:1:13204:15469:1924 256 2R 20974252 0 13M1S * 0 0 AAAGTAATCACCTG //////<EA/E6/A NH:i:7 HI:i:4 AS:i:12 nM:i:0
NB501872:18:HNTVHBGX2:1:13204:15469:1924 256 211000022280607 1957 0 13M1S * 0 0 AAAGTAATCACCTG //////<EA/E6/A NH:i:7 HI:i:5 AS:i:12 nM:i:0
NB501872:18:HNTVHBGX2:1:13204:15469:1924 16 2L 13011689 0 3M74N10M1S * 0 0 CAGGTGATTACTTT A/6E/AE<////// NH:i:7 HI:i:6 AS:i:13 nM:i:0
NB501872:18:HNTVHBGX2:1:13204:15469:1924 256 211000022278463 262143 0 1S13M * 0 0 AAAGTAATCACCTG //////<EA/E6/A NH:i:7 HI:i:7 AS:i:12 nM:i:0


Any ideas what could be going on here? 

Thanks,

Daniel

Alexander Dobin

unread,
May 18, 2017, 4:30:07 PM5/18/17
to rna-star
Hi Daniel,

by standard SAM conventions, the sequence is that of the (+) strand of the genome, not the read itself. So if the read maps to the (-) strand, the sequence in the SAM file is reverse complemented.
The reads with FLAG=272 map to the (-) strand (0x10 bit set).

The other issue - mapping outside of the boundaries - looks like a bug.
Could you create a simple test example - say, add only one extra "chromosome" 211000022278463 into fly (?) genome and check if the problem persists.
If so, please send me the sequence of this extra chromosome, and I will try to replicate the problem.

Cheers
Alex

Daniel Núñez

unread,
May 18, 2017, 7:20:07 PM5/18/17
to rna-star
Hi Alex,

The 211000022278463 "chromosome" is part of the non-chromosomal fly sequences from Ensembl [Drosophila_melanogaster.BDGP6.dna.nonchromosomal.fa.gz in ftp://ftp.ensembl.org/pub/release-88/fasta/drosophila_melanogaster/dna/]. If I make a genome that is only the chromosomal DNA plus the 211000022278463 "chromosome", the mapping outside problem goes away. However, if I add all of the nonchromosomal DNA to the chromosomal DNA, the problem comes back.

In addition to being able to toss this read out given how short it is, I can also continue with just the chromosomal DNA. If you can offer any insight, I would be interested in knowing where the bug comes from though.

Thanks for your help,

Daniel

Alexander Dobin

unread,
May 19, 2017, 10:52:09 AM5/19/17
to rna-star
Hi Daniel,

I would want to fix this problem - could you please tell me which FASTA files (chromo and non-chromo) from the ftp you used for the genome, and which GTF files and genome generation parameters as well.

Cheers
Alex

Daniel Núñez

unread,
May 19, 2017, 11:47:01 AM5/19/17
to rna-star
Hi Alex,

For the reference, I used:
Drosophila_melanogaster.BDGP6.dna.chromosome.2L.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.2R.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.3L.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.2R.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.Y.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.X.fa.gz
Drosophila_melanogaster.BDGP6.dna.chromosome.dmel_mitochondrion_genome.fa.gz
Drosophila_melanogaster.BDGP6.dna.nonchromosomal.fa.gz

I used the following GTF: Drosophila_melanogaster.BDGP6.88.gtf.gz fom ftp://ftp.ensembl.org/pub/release-88/gtf/drosophila_melanogaster

I built the index using: --runThreadN 23 --runMode genomeGenerate --genomeDir ${genomedir}  --genomeFastaFiles ${genomeFasta} --sjdbGTFfile ${genomeGTF} --sjdbOverhang 100

I made the following test.fastq:
@NB501872:18:HNTVHBGX2:1:13204:15469:1924
AAAGTAATCACCTG
+
//////<EA/E6/A

I mapped using --genomeDir ${genomedir} --runThreadN 1 --outFileNamePrefix ${tmpdir}/star. --readFilesIn test.fastq

Thanks and please let me know if you need any more information from me,

Daniel

Daniel Núñez

unread,
Jun 12, 2017, 4:46:11 PM6/12/17
to rna-star
Hi Alex,

Were you ever able to reproduce the bug? I ran into another read that has the same issue:

@NB501872:18:HNTVHBGX2:2:22202:14231:8250
ACGGTGATCAAGGAAGTCGAAATCCGAAAAGGAGTGTGAAACAACTCACCTGCCTAAGCAACTAA
+
/A6AAE/AE/////E/EE/AE6EEE///<EE///E/EE/EEEEAEEEE//EEE///E/EE/E///

Alexander Dobin

unread,
Jun 13, 2017, 6:10:55 PM6/13/17
to rna-star
Hi Daniel,

sorry for the belayed reply - in the last couple of weeks I have got a backlog of complex questions that I am trying to clear out now.
I did reproduce the problem and it appears to be a bug. I will work on fixing it tomorrow.

Cheers
Alex

Alexander Dobin

unread,
Jun 16, 2017, 12:46:27 PM6/16/17
to rna-star
Hi Daniel,

the problem was even more complex than I thought.
A quick fix for it is to use --genomeSAindexNbases 12 at the genome generation step.
If there are still any problematic alignments, you may have to lower it further.

Cheers
Alex

Daniel Núñez

unread,
Jun 18, 2017, 3:08:30 PM6/18/17
to rna-star
Hi Alex,

Setting --genomeSAindexNbases 12 indeed solved the issue. 

Thanks for looking into this!

-Daniel
Reply all
Reply to author
Forward
0 new messages