terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check

2,499 views
Skip to first unread message

Samuel Zimmerman

unread,
Oct 20, 2015, 10:30:03 AM10/20/15
to rna-star
Hi All,

I am trying to create the genome indexes to run my alignment on, but am getting a "vector::_M_range_check" error. I am using STAR version 2.4.2a. Below is the standard output that shows the error.

Oct 20 00:26:48 ..... Started STAR run
Oct 20 00:26:48 ... Starting to generate Genome files
Oct 20 00:26:57 ... starting to sort  Suffix Array. This may take a long time...
Oct 20 00:27:00 ... sorting Suffix Array chunks and saving them to disk...
Oct 20 00:28:57 ... loading chunks from disk, packing SA...
Oct 20 00:29:06 ... Finished generating suffix array
Oct 20 00:29:06 ... starting to generate Suffix Array index...
Oct 20 00:31:38 ..... Processing annotations GTF
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check
/uge/8.2.0/default/spool/n9/job_scripts/166760: line 12: 473392 Aborted                 (core dumped) STAR --runMode genomeGenerate --sjdbGTFfeatureExon c
DNA_match --genomeSAindexNbases 13 --genomeChrBinNbits 9 --sjdbGTFtagExonParentTranscript Parent --genomeDir ${outDir} --genomeFastaFiles ${fastaFile} --s
jdbGTFfile ${annotationFile} --runThreadN 10 --sjdbOverhang ${maxReads}


Apparently there is an upload error, so I will copy and paste part of my gff file and fasta file below. I got both of these files from the NCBI RefSeq catalog.


I have 2 ideas for why the alignment is not working.

(1). Column 1 of the GFF file does not match the contig name of the fasta file.

(2). There is no "parent" attribute in column 9 of the GFF file when there should be.

I tried to correct problem number 1 by trimming the contig name. For example, I converted ">gi|242117977|ref|NR_027905.1| Mus musculus uncharacterized LOC106740 (LOC106740), long non-coding RNA" to ">NR_027905.1" but there were still many missing chromosomes in the log file.

As for the second problem (if this indeed the problem) I am not sure how to correct it. Any advice would be appreciated.

Thank you very much.

Best,
Sam

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
NC_000068.7    RefSeq    cDNA_match    76982338    76982557    .    -    .    ID=614ff0f3-5a6a-4c26-8573-f782ad863a8f;Target=XM_006499157.2 1 220 +;gap_count=0;num_ident=107676;num_mismatch=0;pct_coverage=100;pct_identity_gap=100;pct_identity_ungap=100
NC_000068.7    RefSeq    cDNA_match    76980092    76980195    .    -    .    ID=614ff0f3-5a6a-4c26-8573-f782ad863a8f;Target=XM_006499157.2 221 324 +;gap_count=0;num_ident=107676;num_mismatch=0;pct_coverage=100;pct_identity_gap=100;pct_identity_ungap=100
NC_000068.7    RefSeq    cDNA_match    76977093    76977296    .    -    .    ID=614ff0f3-5a6a-4c26-8573-f782ad863a8f;Target=XM_006499157.2 325 528 +;gap_count=0;num_ident=107676;num_mismatch=0;pct_coverage=100;pct_identity_gap=100;pct_identity_ungap=100


>gi|242117977|ref|NR_027905.1| Mus musculus uncharacterized LOC106740 (LOC106740), long non-coding RNA
TTAATATCTATACTATGTTAGTGCTACCACAAAAGTTTCAACTTATTGTGCTATTTTTTCAACAAGAAAAGTAAGTATAC
ACGAGCCTTTTATTTGAGATGCTGAGGGCTGTTGACTGAAGAAAACTAAAAGGCTGTGGCAGTGTGTGAAGGCGATCAGA
GGGTCCTAGTGAGGCACAGGGCAGTCCTCCCCCTCCCTCAGTGCTGGTCTTCACTTTGTCTGTGAGGCCCTGGATGCAGG
GTGTGGCCTCTTAGAGCTGGGATGCTGCAGGAACAGTGTGCTTTTGTGTGTTGAACTGTAGCATCTCTACAAAGGGCCAC
AGTGGCCTCTCTCCCACTTCTCCTGCTGCCCTCCCACCTATCACCTTTCTTCTTTCTTCCCCATCCCGCTTCCCTCTCCC
ATCCTCCAGTATTTTTCTTTTCAACAAGATGGAGATCATTGGGTAAGAAAAACGTGAAGTGGTTATCAGGGCATTGTGTG
ATTTTAAAGGTGAACCCCGAAGTTGCTAGTGTCTCCTTTTATTATAGGCATTATGTATACTTTAGTGATTATAGAACTTG
AATTGCTCTAGAATGTGAATTAGTTTGTGTTTTATTTCTTTTGAGTTGCTTTGTAAAGAGTCAATGAGGAATTCTCTTTT
CAAAATTTAATATTGTGTGGTTTCTTCCCTGACTTTAGTAAAAGATATTAGAGGACTTACTCTGCTAGTATCTAACTTAA
AGTGTAAGTTTATCTTTTACATATAGTACAGGTTATTGAAATTTCTGCTGCAGATCACAGCAGTTAAGGCTCAATCTTAG
AAGTGAATTTCCTAGCGTTTTTTTTCCACTGATGTCCAGGACACAGTTAGAGCATGTGCTTAGGATGCAGATGCAGCAGG
AAGAGCAGAGTGACTAACTGCTATCTGGGAGAGCAGGGTATAAACGAGAGGAGGTGGAGAGTAGTCAATGGGACATTCTA
TATGTTTGTTTAACTCCATATTTATTATTTTGTAGAGACCTTTGTGATTGTTTAGTTATTGTTTTTTTATCATATTTATG
TATAAGGTTGACTTTTTCAAAAATAAATAAGCCAAAATTGTTTTTGAA
>gi|156523272|ref|NM_001013372.2| Mus musculus neural regeneration protein (Nrp), mRNA
CGGTCCAAGGAATTTTTCTGACAAACGCAATAGGCCGACCAGTACTGGAACGCAGTGCGCTTAGCCCCTTTATGGCGGAG
GCTGCCATGTTAAAACGGAATGAATCGAAACCCTGGAGTCGTGACCCCGGAAGAACCTGCCAGAGCCGGAATTTCGAGTT
CTGCTTCCGGGCCAAACTGTTGGCAGCCTCGAGATGGGGAAGATGGCGGCTGCTGTGGCTTCATTAGCCACGCTGGCTGC
AGAGCCCAGAGAGGATGCTTTCCGGAAGCTTTTCCGCTTCTACCGGCAGAGCCGGCCGGGGACAGCGGACCTGGGAGCCG
TCATCGACTTCTCAGAGGCGCACTTGGCTCGGAGCCCGAAGCCCGGCGTGCCCCAGGTAGGAAAGGAGGAGTAGTGTGTG
CCAGCCTAGCGGCCGACTGGGCCACCCGAGACTGGGCCGCCTCCGGGCCGGCTTTGGAGGGAAGCCCCTGCTGGGCCTGT
CCAGTGAGCTGTAATGTCGAGCGATGAGCGACCAGCTGCCTCGCTGTCCCAACGCTCTGGCCACGGCTTGTGCCTTGCCG
CCATTTCCCCCAACCCACGCGGGCCACGGCTTGTGCCCTGCCGCCATTTCCCCCAACCCACGCGACCTTGCTAAAAAAAA
AAAAAGAAAGAAAAGAAAAGAAAGAAAGAAAGAAAAAAATCTGGAAATTGCTTGTACCTCCTTAACTATCTGTTTAATAC
TAATACGATATTTTGTGTAAAGCTCAGAAGAACATCTTCGTGGACGTTAGGGTGGCCTCATAACTTCAGATAAAAGCAGC
CATTTAATAAGTCTCAAACCGTTAATCCGTTGGGCCTGAGACTCGATCGACCCTGTCTTCTCTGAGGCTTTGAAAGTAAA
GGTAAAATTAGCAGGTTTTTTTCCTGAGAATCTAGGAGCCTGGAGAGATAGCTCAGTAATTAAGAGCATTTACCTACTGG
TGTTCCCAAGAACACCAAGTAGATTTGGTTCCTTGCAGCCACGTGGCAGCTCACAGCCTTCTTGTAACTCTTCCGGAGGA
TCAGACACCCTCTCTTGAGCTCCACAGGAGAGCACTCGTAGACATGTAAATAAACTTCTAAGCTAAATCTAAACAATTTA
TGTACCCTCCCTATTTCTTCGTGATGAGAAGAAAGGGGCCAGAGGGTATG

Kirill Tsyganov

unread,
Oct 20, 2015, 5:46:21 PM10/20/15
to Samuel Zimmerman, rna-star
Hi Sam, 

This looks like an identical problem that Arti had.. I just posted an answer for that here https://groups.google.com/forum/#!topic/rna-star/MiscsB2SPPw. In the nutshell you need to make sure your "chromosomeId" string is the same between FASTA and GTF..

Unless you are using some special type of mouse and/or novel annotation/sequence, maybe try Ensembl or gencode as your reference files instead. Another option don't use GTF file at the `genomeGenerate` step, unless of course you are studying splice junctions..

Cheers, 

Kirill

--
You received this message because you are subscribed to the Google Groups "rna-star" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rna-star+u...@googlegroups.com.
Visit this group at http://groups.google.com/group/rna-star.

Samuel Zimmerman

unread,
Oct 21, 2015, 8:36:23 PM10/21/15
to rna-star, samuel.e....@gmail.com
Hi Kirill,

Thanks for helping. I tried editing the contig names of the fasta file to be like this:

">NR_027905.1" instead of ">gi|242117977|ref|NR_027905.1| Mus musculus uncharacterized LOC106740 (LOC106740), long non-coding RNA" however maybe that is not the correct contig name either.

I am specifically using this fasta file as my reference file because I want to find the number of genes that align to different types of RNA (i.e. rRNA, mRNA/protein coding regions). My plan was to align the reference to my reads. This would allow me to parse the SAM file to find how reads mapped to genes that coded for the different types of RNA.

How would not using a GTF file at the genomeGenerate step help? Don't I have to specify it in the alignment step anyway?

Thanks for your help.

Best,
Sam

Kirill Tsyganov

unread,
Oct 21, 2015, 9:09:03 PM10/21/15
to Samuel Zimmerman, rna-star
I think, there two things here. 

1. 

General workflow get fastq files, map fastq files to the reference genome. The reference genome can really be anything. You can map you read to transcriptome or any other nucleotide sequences in the FASTA format. 

With most of mapping tools that I know of including STAR there is two step process:

  A. make index
  B do the alignment 

As an aside there are aligners out there that are "aware" of the splicing and there those that aren't "aware" of splicing e.g STAR and TopHat can detect splicing events whereas bwa and bowtie can't !

All you need GTF/GFF file for is to enable your aligner to identify splicing events in you RNA-seq data, obviously there won't be any splicing events in DNA-seq data. 

In STAR you can, optionally, specify GTF file during indexing step (although in the latest release you can do that at the alignment step, but lets not worry about that for now), which enable STAR to detect splicing event. 

If you don't care about your splicing event and maybe just care about differentially expressed genes, you don't really need to supply GTF file.

2. 

As for different RNA types, Ensembl GTF file does hold that information under "gene_biotype" tag. I've attached a picture that might help, those three columns were parsed from standard Ensembl GTF file using custom python script (I need this for my own project).

As for parsing SAM file..I'm not sure about your strategy, maybe consult BioStars forums its really good for those things. In general SAM (sequence alignment map) just hold information about your mapping coordinates and no information about gene names and types.

Obviously if you are using reference FASTA file that just hold reference sequence for all possible non-coding RNAs, then yes you can just map to that file and use samtools to counts percentage of mapped reads.I guess to know if your sample contaminated with non-coding RNA's, but that still wouldn't give (directly) gene names..
 
Kirill
gtfInfo.png

Alexander Dobin

unread,
Oct 23, 2015, 5:11:01 PM10/23/15
to rna-star, samuel.e....@gmail.com
Hi Samuel,

Kirill is right that the chromosome names in the FASTA and GTF files should coincide, otherwise there will be a lot of warning in the Log.out file.
On top of that, the lines without the Parent attribute will cause the error you are seeing.

Could you explain a bit more what you are trying to do? Are you mapping your reads to the collection of RNA sequences, not to the whole genome?
If this is the case, then you do not need the GTF file, since the RNA sequences already contain all the information from the GTF.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages