indexing GTF/GFF3 files

10,532 views
Skip to first unread message

Todd Creasy

unread,
Feb 15, 2012, 2:14:56 PM2/15/12
to igv-...@googlegroups.com
I've heard that there is a way to index large gtf/gff3 files so they can load into IGV quickly.  Right now, anything over 10MB loads to slowly, sometime even timing out.  Suggestions?

Thanks,

todd

James Robinson

unread,
Feb 15, 2012, 2:48:31 PM2/15/12
to igv-...@googlegroups.com
Hi, you can index them with igvtools, but unfortunately the gene model will not be preserved (parent/child relations). So this is only useful for indpenendent gff features. The best workaround at the moment is to convert the file to bed, then index the bed with igvtools or tabix. I found a gff3 -> bed file converter that works really well on a Galaxy server somewhere, sorry I didn't note the location but google for it and you should find it. You will loose your attibutes from column 9, unfortunately.

We'll update the group here when full gff3 indexing is supported, but its a beast of a format and that won't be in the immediate future. If you can create bed files I suggest that.

Jim

Todd Creasy

unread,
Feb 15, 2012, 2:52:37 PM2/15/12
to igv-...@googlegroups.com
Do you have a recommendation for what you do when you have a human Ensembl GTF file?  Right now, the best thing I tend to do is just pull out the gene features but obviously that doesn't show exon/intron boundaries.

James Robinson

unread,
Feb 15, 2012, 3:01:46 PM2/15/12
to igv-...@googlegroups.com
I infer that this GTF file is very large?  Normally Human gene annotations are manageable.  Until we can properly index them I suggest converting to a BED file,   I think there must be a GTF to BED file available,  if not we can possible provide one as an interim solution.  

Jim Robinson

unread,
Feb 22, 2012, 11:31:06 PM2/22/12
to igv-...@googlegroups.com
Todd,
I still have this on my todo list,  but in the meantime I found a workaround that works well while trying to deal with the Tomato genome.   I used the "gff3ToGenepred"  converter available from UCSC.   This outputs a file that is almost a "genePred" format as described on their site.   You can load the file into IGV by doing the following  (1) add a column at the beginning of each row,  normally this would be the database ID,  and (2) name the file to end with ".genepred".  The value in the first column can be anything,  its ignored.   You might find the resulting file is small enough to just be loaded without any indexing,  if its not you can index it with "tabix".    

Jim

Todd Creasy

unread,
Feb 23, 2012, 8:25:44 PM2/23/12
to igv-...@googlegroups.com
Hey Jim,

I ended up doing that myself.  Sorry I didn't get back to you about it.  Looks like a bigbed is more efficient too.  Regardless, large gtf/gff3 files like Ensemble's human gtf is going to be too large to load with a reasonable amount of resources.

Thanks for helping me through this!

todd

Erwan SCAON

unread,
Nov 10, 2017, 10:01:28 AM11/10/17
to igv-help
Hi there.

Did things change regarding GTF indexing ?
I am using GENCODE M15 GTF to annotate the mm10 reference genome (it's much more complete than RefSeq, esp for the IGH locus).
It works fine but each time I load the GTF in IGV I get the following messages :

An index file for /home/erwann/Desktop/PTCB/BISCEm/Data/IGV_annotation_tracks/gencode.vM15.annotation.gtf could not be located. An index is recommended to view files of this size.   Click "Go" to create one now or "Cancel to proceed without an index.

Files must be sorted by start position prior to indexing. Input file is not sorted by start position. We saw a record with a start of chr1:3205900 after a record with a start of chr1:3213608, for input source: /home/erwann/Desktop/PTCB/BISCEm/Data/IGV_annotation_tracks/gencode.vM15.annotation.gtf  Note: igvtools can be used to sort the file, select "File > Run igvtools...".

And it take IGV some time to process things each time. I guess it would be faster if the GTF was indexed.
Any tips

Best regards

James Robinson

unread,
Nov 10, 2017, 10:45:21 PM11/10/17
to igv-help
Hi,

Could you elaborate on what appears to have changed?    Also what version of IGV you are using.

I don't know the size of your GTF but yes,  it would likely be much faster if it were indexed.   You can index with igvtools,  or tabix  (part fo samtools).   Tabix is recommended.

Jim


--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/679be5d5-b765-4170-bb0b-8e1f7dc17cf6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elizabeth King

unread,
May 17, 2023, 2:23:10 AM5/17/23
to igv-help
Just in case anyone is still querying this. One work around is:
bedtools sort -i genes.gtf > genes.sorted.gtf. Then when you try to visualise the gtf by selecting File --> Load From File. The index gets autogenerated 

On Saturday, 11 November 2017 at 2:45:21 pm UTC+11 James Robinson wrote:
Hi,

Could you elaborate on what appears to have changed?    Also what version of IGV you are using.

I don't know the size of your GTF but yes,  it would likely be much faster if it were indexed.   You can index with igvtools,  or tabix  (part fo samtools).   Tabix is recommended.

Jim


On Fri, Nov 10, 2017 at 7:01 AM, Erwan SCAON <erwan...@gmail.com> wrote:
Hi there.

Did things change regarding GTF indexing ?
I am using GENCODE M15 GTF to annotate the mm10 reference genome (it's much more complete than RefSeq, esp for the IGH locus).
It works fine but each time I load the GTF in IGV I get the following messages :

An index file for /home/erwann/Desktop/PTCB/BISCEm/Data/IGV_annotation_tracks/gencode.vM15.annotation.gtf could not be located. An index is recommended to view files of this size.   Click "Go" to create one now or "Cancel to proceed without an index.

Files must be sorted by start position prior to indexing. Input file is not sorted by start position. We saw a record with a start of chr1:3205900 after a record with a start of chr1:3213608, for input source: /home/erwann/Desktop/PTCB/BISCEm/Data/IGV_annotation_tracks/gencode.vM15.annotation.gtf  Note: igvtools can be used to sort the file, select "File > Run igvtools...".

And it take IGV some time to process things each time. I guess it would be faster if the GTF was indexed.
Any tips

Best regards

On Wednesday, 15 February 2012 20:14:56 UTC+1, Todd Creasy wrote:
I've heard that there is a way to index large gtf/gff3 files so they can load into IGV quickly.  Right now, anything over 10MB loads to slowly, sometime even timing out.  Suggestions?

Thanks,

todd

--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages