building genome for saccharomyces cerevisiae

1,147 views
Skip to first unread message

Kieran Mace

unread,
May 23, 2016, 11:58:58 AM5/23/16
to rna-star
Hi, I'm having some trouble building the genome for saccharomyces cerevisiae


Here is my current workflow:


Download genome data from SGD:
http://www.yeastgenome.org/download-data/sequence

Not sure which files exactly to download, but I'm currently trying:


http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz


after unzipping this, I then run:

STAR --runThreadN 4 --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles S288C_reference_genome_R64-2-1_20150113/S288C_reference_sequence_R64-2-1_20150113.fsa --sjdbGTFfile S288C_reference_genome_R64-2-1_20150113/saccharomyces_cerevisiae_R64-2-1_20150113.gff --sjdbOverhang 300 --sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent

and getting the following error:

Fatal INPUT FILE error, no valid exon lines in the GTF file: S288C_reference_genome_R64-2-1_20150113/saccharomyces_cerevisiae_R64-2-1_20150113.gff

Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.




May 20 15:49:10 ...... FATAL ERROR, exiting


Any chance someone has used STAR before with saccharomyces cerevisiae?


Alexander Dobin

unread,
May 23, 2016, 5:13:03 PM5/23/16
to rna-star
Hi Kieran,

the problem is that the chromosome names in the fasta (.fsa) file do not agree with the names in the .gff file. The former are formatted as
>ref|NC_001133| [org=Saccharomyces cerevisiae] [strain=S288C] [moltype=genomic] [chromosome=I]
while the latter have chrI,chrII... names.
You would need to convert the names in the .fsa file into the chr... format.
Also, it seems that the GFF file has a FASTA sequences attached at the end - it might be possible to save those sequences in a separate files and use them as the FASTA input to STAR.
It's best to cut these FASTA sequnces from the GFF file to avoid confusion with formatting.

I generally recommend converting the GFF file into GTF, for instance you can use gffread from Cufflinks package to convert gff to gtf:
$ gffread -T In.gff3 -o Out.gtf

Cheers
Alex

Aditya Saxena

unread,
Jun 19, 2016, 4:51:47 PM6/19/16
to rna-...@googlegroups.com
Hi Alex,

I am trying to build STAR-Genome-Index for Jaculus, a non-model rodent. I have genome.fa and annotation.gff from NCBI, and generated annotation.gtf with gffread.

Jaculus genome is unscaffolded and I have the same error as Kieran when I try to generate genome index (with gff or gtf)- 
'Fatal INPUT FILE error, no valid exon lines in the GTF file'
'Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.'

Since I have no Chromosome info, what can I do to deal with this? Can I still convert .fa file into the chr... format? If so, how?

thanks in advance,

cheers
Adi

Alexander Dobin

unread,
Jun 20, 2016, 5:24:49 PM6/20/16
to rna-star
Hi Adi,

please post me the first few lines of the GTF file, as well as chromosome names from the fasta file.

Cheers
Alex

Aditya Saxena

unread,
Jun 20, 2016, 5:43:28 PM6/20/16
to rna-...@googlegroups.com
Hi Alex,

here's the head for my GTF file:

NC_005314.1 RefSeq exon 1 65 . + . transcript_id "rna26311";

NC_005314.1 RefSeq exon 66 1033 . + . transcript_id "rna26312";

NC_005314.1 RefSeq exon 1034 1101 . + . transcript_id "rna26313";

NC_005314.1 RefSeq exon 1102 2677 . + . transcript_id "rna26314";

NC_005314.1 RefSeq exon 2678 2752 . + . transcript_id "rna26315";

NC_005314.1 RefSeq CDS 2756 3710 . + 0 transcript_id "gene25383"; gene_id "gene25383"; gene_name "ND1";

NC_005314.1 RefSeq exon 3711 3779 . + . transcript_id "rna26316";

NC_005314.1 RefSeq exon 3777 3849 . - . transcript_id "rna26317";

NC_005314.1 RefSeq exon 3850 3918 . + . transcript_id "rna26318";

NC_005314.1 RefSeq CDS 3919 4960 . + 0 transcript_id "gene25384"; gene_id "gene25384"; gene_name "ND2";

NC_005314.1 RefSeq exon 4961 5027 . + . transcript_id "rna26319";

NC_005314.1 RefSeq exon 5031 5099 . - . transcript_id "rna26320";

NC_005314.1 RefSeq exon 5102 5174 . - . transcript_id "rna26321";

NC_005314.1 RefSeq exon 5209 5276 . - . transcript_id "rna26322";

NC_005314.1 RefSeq exon 5281 5348 . - . transcript_id "rna26323";

NC_005314.1 RefSeq CDS 5350 6894 . + 0 transcript_id "gene25385"; gene_id "gene25385"; gene_name "COX1";

NC_005314.1 RefSeq exon 6892 6961 . - . transcript_id "rna26324"; gene_name "COX1";

NC_005314.1 RefSeq exon 6966 7032 . + . transcript_id "rna26325";

NC_005314.1 RefSeq CDS 7034 7717 . + 0 transcript_id "gene25386"; gene_id "gene25386"; gene_name "COX2";

NC_005314.1 RefSeq exon 7723 7787 . + . transcript_id "rna26326";


My genome.fa has no chromosome names as the genome has only unplaced scaffolds (right?). This how the genome.fa file heads. 


>gi|484394521|ref|NW_004504313.1| Jaculus jaculus isolate JJ0015 unplaced genomic scaffold, JacJac1.0 scaffold00001, whole genome shotgun sequence

GTCTGTGAGGAAATGACCTACGAGGAAATTCAGGCCCATTATCCACTTGAGTTCGCCCTACACGACCAGG

AGAAGTACCGTTACTGGTATCCGAAGGGTGAGTCCTATGAGGACCTGGTCCAGCGACTGGAGCCTGTCAT

CTTGGAATTGGAGAGACAGGAGAACATGCTGGTCATGTGCCACCAGGCTGTGATGCGAGGCCACCTGGCA

CACTTCAAAGACAAGGCAGCAGAACAGCTGGCCTACCTCAAGTGTCCCCTTCACACGGTCCTGAAGCTGA

CCCTTGTGGCTTACGGCTGTAAAGTGAAGTCCATATTCTTGAATGTGGCAGCTGTGAATACACACTGAGA

CAGGCTGCAGAATGTAGACATCTCCAGGCCTCCAGAGGAAGCCGTTGTCACAGTCTCTGCTCACCAGTGA [.......]


Thanks much for your help and reply!


cheers

Adi

Alexander Dobin

unread,
Jun 20, 2016, 5:53:37 PM6/20/16
to rna-star
Hi Adi,

when STAR reads FASTA file, it will use the whole scaffold name until the space char, e.g.
gi|484394521|ref|NW_004504313.1|
However, the GTF file contains names of the N?_* format - hence the error message.
I think the best solution is to convert the names in the fasta file.

Also, the GTF file you have is not well formatted. In particular, the "exon" lines do not have "gene_id" tags (you would have to add them),
and CDS lines do not overlap any exon lines, which means CDS and exon lines define different sets of exons (likely you need to replace CDS with exon for all the CDS lines).

Cheers
Alex

Aditya Saxena

unread,
Jun 21, 2016, 4:12:08 PM6/21/16
to rna-...@googlegroups.com
HI Alex,

Many thanks for your reply and suggestions!

Regarding my gff and gtf files, I noticed that the GTF file after gffread conversion has NC_* format that is not there in the GFF. The original GFF file has NW_* format (consistent with the fasta file) and has all entries associated with a gene_id. I attach heads of both files.
This makes me wonder if the conversion messed the formatting? If so, do you know of a way to fix it or tools other than gffread?

thnaks much in advance!
cheers
Adi
Adi_.GFF_head.png
Adi_.GTF_head.png

Alexander Dobin

unread,
Jun 21, 2016, 4:26:19 PM6/21/16
to rna-star
Hi Adi,

I suspect the GFF file has both NW_ and NC_ entries, and so does the converted GTF, but they are sorted differently, so the ones that appear on top are different.
I do not think gffread will rename chromosomes in any way.

Cheers
Alex

Aditya Saxena

unread,
Jun 24, 2016, 5:29:54 PM6/24/16
to rna-...@googlegroups.com
Hi Alex,

thanks for your suggestions. Renaming the fasta made indexing possible.

I changed long names in the genome.fasta to simple names using the sed command suggested elsewhere-

sed 's/^[^ ]*[|]\([^|]*\)[|] .*$/>\1/' genome.fasta > genome_renamed.fasta 

This command changed entries in the original genome.fasta file from

>gi|484394521|ref|NW_004504313.1| Jaculus jaculus isolate JJ0015 unplaced genomic scaffold, JacJac1.0 scaffold00001, whole genome shotgun sequence

GTCTGTGAGGAAATGACCTACGAGGAAATT...........


to simple headers in the genome_renamed.fasta like so,


>NW_004504313.1

GTCTGTGAGGAAATGACCTACGAGGAAATT...........


The genomeGenerate step ran to completion with the genome_renamed.fasta and the genome.gtf (generated from .gff with gffread). I was able to map ~95% of my SR-50 reads to this indexed genome.


Thank you for your advice and help! STAR is awesome :)


cheers

Adi


Alexander Dobin

unread,
Jun 29, 2016, 3:40:18 PM6/29/16
to rna-star
Hi Adi,

great, thanks a lot for sharing your solution, it will be helpful to other users.

Cheers
Alex
Reply all
Reply to author
Forward
Message has been deleted
0 new messages