Trouble indexing bovine genome (UMD3.1) with annotation file (ensembl 87)

130 views
Skip to first unread message

Sergio PV

unread,
Mar 21, 2017, 7:03:41 AM3/21/17
to rna-...@googlegroups.com
Dear group
my problem is that while indexing UMD3.1 with Bos_taurus.UMD3.1.87.gtf I get the error:
"Fatal INPUT FILE error, no valid exon lines in the GTF file: path/to/Bos_taurus.UMD3.1.87.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file."

Indeed, the first line of the GTF file is looks like this:
1       ensembl gene    19774   19899   .       -       .       gene_id "ENSBTAG00000046619"; gene_version "1"; gene_name "5S_rRNA"; gene_source "ensembl"; gene_biotype "rRNA";


And for the reference file, the first entry looks like this:
>gnl|UMD3.1|GK000010.2 Chromosome 10 AC_000167.1
GTGATAGCCACGTGATAAATGCATGATCATTTGCATGATCAGTGCATGGGCAGTCAGGTGATCAGTGTAT
...


I think the way both files would be compatible it's by placing the chr number at the beginning of the header in the reference, so it would look like this:
>10 gnl|UMD3.1|GK000010.2 Chromosome 10 AC_000167.1
GTGATAGCCACGTGATAAATGCATGATCATTTGCATGATCAGTGCATGGGCAGTCAGGTGATCAGTGTAT
...


I have posted a similar question in Biostars to specifically address the problem of re-ordering the header:

But now I think that the solution is not so simple, as unplaced genomic scaffolds are found in the genome, having the following notation:
>gnl|UMD3.1|GJ060418.1 GPS_000344858.1 NW_003101163.1

Here, of course there is no chromosome ID to match with the entries of the gtf file.


My questions are the following:
- How to make UMD3.1 compatible with its corresponding annotation (Bos_taurus.UMD3.1.87.gtf)?
- Has anyone performed indexing of the bos taurus genome? 
would the solution be to convert the gff3 file into gtf with cufflinks?, It seems too complicated, as there is already a gtf file available.
- Is there another GTF file that will work?


Thank you in advance

Alexander Dobin

unread,
Mar 21, 2017, 3:25:26 PM3/21/17
to rna-star
Hi Sergio,

if you are using the GTF from ENSEMBL, why not use their FASTA as well:
The chromosome and scaffolds names are matched between their GTF and FASTA.
My only concern is - I am not sure if patches and alternative loci are included in the "toplevel" FASTA. For human genome they also have the "primary assembly" FASTA that excludes patches and alts, which is recommended for RNA-seq mapping.

If you want to use the FASTA file, you indeed need to make sure that the chr/scaffolds names are matched between the GTF and FASTA files.

Cheers
Alex

Sergio PV

unread,
Mar 22, 2017, 2:00:19 PM3/22/17
to rna-star
Thank you Alex. Your suggestion allowed STAR to build the index including the GTF file and the toplevel genome assembly
Reply all
Reply to author
Forward
0 new messages