Issues with processing annotations

Rez

unread,

Aug 15, 2018, 11:09:15 AM8/15/18

to rna-star

I am mapping RNAseq reads to a de novo transcriptome reference I have assembled. The reference has also been processed post-assembly to resemble unigenes (not the raw assembly data). The use of the annotation file (a GFF3 file generated using InterProScan) is therefore very important.

I am getting the following error when running a mapping job:

Aug 15 14:37:13 ..... processing annotations GTF

terminate called after throwing an instance of 'std::out_of_range'

what(): vector::_M_range_check

The same issue occurs if I include the annotation file when generating the indices. Having looked at various posts elsewhere I have seen a number of reasons why this may be, but I'm still having difficulty troubleshooting the issue. The chromosome names in the GFF3 file are identical to the FASTA.

The Log.out file does give various warnings: 'no transcript_id for line' or 'no gene_id for line'.

For reference here is part of my gff3 annotation file:

##gff-version 3

##feature-ontology http://song.cvs.sourceforge.net/viewvc/song/ontology/sofa.obo?revision=1.269

##interproscan-version 5.30-69.0

##sequence-region L.tenue_CDS|Contig_000045 1 2145

L.tenue_CDS|Contig_000045 provided_by_user nucleic_acid 1 2145 . + . ID=L.tenue_CDS|Contig_000045;Name=L.tenue_CDS|Contig_000045;md5=ebb335ac9f2744865c41287bd5ab63b1

L.tenue_CDS|Contig_000045 getorf ORF 131 1543 . + . Target=pep_L.tenue_CDS|Contig_000045_131_1543 1 471;ID=orf_L.tenue_CDS|Contig_000045_131_1543;Name=L.tenue_CDS|Contig_000045_14;md5=ebb335ac9f2744865c41287bd5ab63b1

L.tenue_CDS|Contig_000045 getorf polypeptide 1 471 . + . ID=pep_L.tenue_CDS|Contig_000045_131_1543;md5=118d52e86011586a296e6bd9ae6c8748

L.tenue_CDS|Contig_000045 Pfam protein_match 334 392 1.5E-17 + . date=02-08-2018;Target=null 334 392;Ontology_term="GO:0007165";ID=match$1_334_392;signature_desc=NAF domain;Name=PF03822;status=T;Dbxref="InterPro:IPR004041","KEGG:04150+2.7.11.1","KEGG:04151+2.7.11.1","KEGG:04714+2.7.11.1","KEGG:04926+2.7.11.1","KEGG:05165+2.7.11.1"

L.tenue_CDS|Contig_000045 Pfam protein_match 39 295 5.8E-73 + . date=02-08-2018;Target=null 39 295;Ontology_term="GO:0004672","GO:0005524","GO:0006468";ID=match$2_39_295;signature_desc=Protein kinase domain;Name=PF00069;status=T;Dbxref="InterPro:IPR000719"

L.tenue_CDS|Contig_000045 ProSiteProfiles protein_match 39 295 51.962 + . date=02-08-2018;Target=null 39 295;Ontology_term="GO:0004672","GO:0005524","GO:0006468";ID=match$3_39_295;signature_desc=Protein kinase domain profile.;Name=PS50011;status=T;Dbxref="InterPro:IPR000719"

L.tenue_CDS|Contig_000045 ProSiteProfiles protein_match 332 356 11.41 + . date=02-08-2018;Target=null 332 356;Ontology_term="GO:0007165";ID=match$4_332_356;signature_desc=NAF domain profile.;Name=PS50816;status=T;Dbxref="InterPro:IPR018451","KEGG:04150+2.7.11.1","KEGG:04151+2.7.11.1","KEGG:04714+2.7.11.1","KEGG:04926+2.7.11.1","KEGG:05165+2.7.11.1"

L.tenue_CDS|Contig_000045 Gene3D protein_match 28 118 7.8E-32 + . date=02-08-2018;Target=null 28 118;ID=match$5_28_118;Name=G3DSA:3.30.200.20;status=T

And here is part of my FASTA:

>L.tenue_CDS|Contig_000001 len=2064 TAIR10_locus=AT1G0102

CCAAAGCTTATCGTCATCTACTCTTCAACGTTCTCAATCAGAAGAGTGCAGTTATTGAGG

TACAATGGCAGCTTACATTCGGTTATGGTTGTCTCTTATCCAATGATTGCTAATGGTTGG

TGATGGAGAGCTGCTGAAAATTTTCAGGGTCTGTTGAGGAATGCGATTTTTGGTTTCATT

CTTCTGGATGGTTGTATCCACTTGAGTGGCCATCATGTGTCTTGCTTTAATCAAGTTGAA

ATGCCCTTAACTTACTGAATGTTATTTGTTGAGGTCCTTAGCAAATCCAGATAGATGTAT

GCTTTTGAATTGGCAGGAGGGTGAAAGAAGAGGTTCTTCATCGAGCATAATTTCTTTGTT

TTGGAGATACCGAAAGGTGTTCATGGCTTGTCTTGTTGGTAATTCACTTTTTCTCTGTGT

TTTCCTTCTCGCCACAAGGATTCTGCTGAACTCTTCGCTTACATTTTCCAGGTAAATGTT

TCCCCTGCCCTACCCTAATTTGTCCCCCCTATACATTTACAAGTTTTTTCATTAAACCCA

And finally here are the STAR commands I am using:

/path/to/STAR \

--runThreadN 8 \

--runMode genomeGenerate \

--outFileNamePrefix gendir/prefix \

--genomeDir gendir \

--limitGenomeGenerateRAM 119287430528 \

--genomeFastaFiles fasta.fa

/path/to/STAR \

--runThreadN 8 \

--genomeDir gendir \

--sjdbGTFfile fasta.gff3.gff3 \

--sjdbGTFtagExonParentGene Parent \

--sjdbGTFfeatureExon protein_match \

--alignEndsType EndToEnd \

--alignIntronMax 100000 \

--outFileNamePrefix outdir/pref \

--outSAMtype BAM SortedByCoordinate \

--outMultimapperOrder Random --outFilterMultimapNmax 1 \

--readFilesCommand zcat \

--readFilesIn condition_paired_1.fq.gz condition_paired_2.fq.gz

Rez

unread,

Aug 15, 2018, 11:59:45 AM8/15/18

to rna-star

For additional information, I have attemted converting the gff3 file to a gtf using gffread, but get the following errors:

GFF Error: overlapping duplicate polypeptide feature (ID=pep_L.tenue_CDS|Contig_168610|L.tenue_CDS|Contig_168610_119_196_r)

GFF Error: overlapping duplicate nucleic_acid feature (ID=L.tenue_CDS|Contig_168983)

Alexander Dobin

unread,

Aug 16, 2018, 4:00:28 PM8/16/18

to rna-star

Hi Rez,

gffread complaining is a bad sign - it means the GFF is not compliant with the standard GFF formatting.

GFF files has to contain "exon" lines (i.e. col3==exon) with Parent tag pointing to the transcripts, "transcript" lines with ID tags and Parent pointing to genes, and "gene" lines with ID tags.