Hello all,
I've already used STAR many times to align RNA-Seq reads to mouse whole genome reference. Generally I retrieve the genome and annotation files from USCS.
Now I want to align reads only against lncRNA reference and I found at Gencode database the fasta (Long non-coding RNA transcript sequences) and gtf (Long non-coding RNA gene annotation) files for this purpose.
I used the following command to generate STAR index:
nohup STAR --runThreadN 12 --runMode genomeGenerate --genomeDir /work/gap/aire/genome_Gencode/Longnon-coding/Sequence --genomeFastaFiles /work/gap/aire/genome_Gencode/Longnon-coding/Sequence/gencode.vM11.lncRNA_transcripts.fa --sjdbGTFfile /work/gap/aire/genome_Gencode/Longnon-coding/Annotation/gencode.vM11.long_noncoding_RNAs.gtf --sjdbOverhang 99 &
But the index wasn't generated because of an fatal error: (log file attached)
Fatal INPUT FILE error, no valid exon lines in the GTF file: /work/gap/aire/genome_Gencode/Longnon-coding/Annotation/gencode.vM11.long_noncoding_RNAs.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.
In fact, after that I checked the identifiers in both, fasta and gtf files, and realized that they are different, although downloaded from the same database.
Fasta ID: >ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-001|4933401J01Rik|1070|
gtf ID: chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";
Does anybody can help me to fix this problem?
Thanks a lot
Amanda