Problems in generating STAR index from Gencode v11 Mus musculus long noncoding RNA

1,037 views
Skip to first unread message

Amanda Freire De Assis Riccardi

unread,
Nov 16, 2016, 5:56:44 PM11/16/16
to rna-star
Hello all,

I've already used STAR many times to align  RNA-Seq reads to mouse whole genome reference. Generally I retrieve the genome and annotation files from USCS.

Now I want to align reads only against lncRNA reference and I found at Gencode database the fasta (Long non-coding RNA transcript sequences) and gtf (Long non-coding RNA gene annotation) files for this purpose.

I used the following command to generate STAR index: 
 
nohup STAR  --runThreadN 12  --runMode genomeGenerate --genomeDir /work/gap/aire/genome_Gencode/Longnon-coding/Sequence  --genomeFastaFiles /work/gap/aire/genome_Gencode/Longnon-coding/Sequence/gencode.vM11.lncRNA_transcripts.fa --sjdbGTFfile /work/gap/aire/genome_Gencode/Longnon-coding/Annotation/gencode.vM11.long_noncoding_RNAs.gtf --sjdbOverhang 99 &

But the index wasn't generated because of an fatal error: (log file attached)

Fatal INPUT FILE error, no valid exon lines in the GTF file: /work/gap/aire/genome_Gencode/Longnon-coding/Annotation/gencode.vM11.long_noncoding_RNAs.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

In fact, after that I checked the identifiers in both, fasta and gtf files, and realized that they are different, although downloaded from the same database.
Fasta ID: >ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-001|4933401J01Rik|1070|

gtf ID: chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";

Does anybody can help me to fix this problem?

Thanks a lot

Amanda
nohup.out

chaoi

unread,
Nov 16, 2016, 11:37:43 PM11/16/16
to rna-star
Hi,
 

I was just about to post the message like Amanda's problem.

I run the same command as hers except for "--sjdbOverhang" value, but the return message was diffrent:
~/tools/STAR/bin/Linux_x86_64_static$ STAR --runThreadN 8 --runMode genomeGenerate --genomeDir ./STARindex.lncRNA.gencode.vM11 --genomeFastaFiles ./gencode.vM11.lncRNA_transcripts.fa --sjdbGTFfile ./gencode.vM11.long_noncoding_RNAs.gtf --sjdbOverhang 100

EXITING because of fatal input ERROR: could not open readFilesIn=Read1

Even if replacing gencode.vM11.lncRNA_transcripts.fa to genome fasta file, the same message was returned.

Of course, the input files are downloaded from GENCODE.

I attached the log file.

I also appreciate any advises.

Thanks.

chaoi 

2016年11月17日木曜日 7時56分44秒 UTC+9 Amanda Freire De Assis Riccardi:
Log.out

Alexander Dobin

unread,
Nov 17, 2016, 4:21:52 PM11/17/16
to rna-...@googlegroups.com
Hi Amanda, Chaoi,

since you are mapping to the lncRNA sequences extracted from the genome, you do not need to supply the gtf file (which contains their coordinates in the genome).
If you omit the --sjdbGTFfile and --sjdbOverhang parameters, genome generation should work.

You may expect to see slow mapping rate as most of the reads will not map to such a limited reference.

Cheers
Alex

chaoi

unread,
Nov 18, 2016, 12:35:36 AM11/18/16
to rna-star
Hi, Alexander 

Thanks for your advise.

The index was successfully generating !!

Chaoi

2016年11月18日金曜日 6時21分52秒 UTC+9 Alexander Dobin:

Ehsan Hajiramezanali

unread,
Jun 19, 2019, 11:42:58 PM6/19/19
to rna-...@googlegroups.com
Hi Alex,

In such a case, when we did not use gtf file for the indexing part, do we still able to use --quantMode to get count results?

Thanks
Ehsan

Alexander Dobin

unread,
Jun 20, 2019, 2:23:52 PM6/20/19
to rna-star
Hi Ehsan,

What "genome" are you mapping to?
For --quantMode options you need to supply the GTF file to tell STAR the locations of the genes/transcripts.

Cheers
Alex

Ehsan Hajiramezanali

unread,
Jun 21, 2019, 12:56:45 PM6/21/19
to rna-...@googlegroups.com
Thanks Alex, 

I'm using lncRNA from GENCODE and had a similar problem for indexing. As you suggested, I did not use GTF in the indexing step. As Amanda mentioned, we have both FASTA and GTF files that are different, although downloaded from the same database.

Fasta ID: >ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-001|4933401J01Rik|1070|

gtf ID: chr1    HAVANA  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693.1"; gene_type "TEC"; gene_status "KNOWN"; gene_name "4933401J01Rik"; level 2; havana_gene "OTTMUSG00000049935.1";

I have another question too.
Can one use the whole genome sequence with the lncRNA GTF file (in generating step and etc) to do differential expression analysis of lncRNAs?

Thanks

Alexander Dobin

unread,
Jun 24, 2019, 11:00:34 AM6/24/19
to rna-star
Hi Ehsan,

if you are mapping to the sequences of the transcripts, and you want to count reads per gene with --qunatMode GeneCounts, you would need to create a a different type of GTF file, which refers to the transcript sequences rather than whole genome chromosomes.
The whole genome GTF cannot be used with the FASTA made of transcript sequences. The whole genome GTF can only be used with the whole genome FASTA.

>>>Can one use the whole genome sequence with the lncRNA GTF file (in generating step and etc) to do differential expression analysis of lncRNAs?
Yes, this is what I would recommend. Use the whole genome FASTA and whole genome GTF. The latter will contain annotations for lncRNAs (in addition to all other classes of RANs, of course).
If you are interested in the lncRNAs, you can look only at the counts of lncRNA genes. You would need to think how to normalize the expression between lncRNAs properly between the samples.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages