Question about GTF files

964 views
Skip to first unread message

CHAVAN, SHWETA

unread,
Jan 8, 2013, 3:34:40 PM1/8/13
to gen...@soe.ucsc.edu

Dear UCSC team,

 

I have a question about genome reference files- with respect to Next Generation Sequencing. We perform RNA-Seq experiments and align the reads to the three reference genomes – from Ensembl , NCBI, and UCSC (Hg19 / GrCH37 build).

 

1.      What is the difference between the three GTF files obtained from three different sources resp. (Ensembl, NCBI and UCSC?).?

    1. Is it that for a particular geneA the co-ordinates of start and stop would differ in the three GTFs or it is consistent across all gtfs that correspond to a particular genome assembly build (e.g. Hg19/ GrCH37). In other words, do all gtf files irrespective of the source, have the same co-ordinates as long as they originate from the same human genome assembly build version (e.g. Hg19/ GrCH37)?
    2. In addition how different is the GTF from ENCODE website, http://www.gencodegenes.org/ file name is gencode.v14.annotation.gtf, corresponding to GrCH37. And how does one decide to choose one of these GTF files?

 

Thanks,

Shweta

 

Shweta S. Chavan

Bioinformatics NGS Post-doc

Myeloma Institute for Research and Therapy

University of Arkansas Medical Sciences

Little Rock, AR.

sch...@uams.edu

 

Confidentiality Notice: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and privileged information.  Any unauthorized review,
use, disclosure or distribution is prohibited.  If you are not the
intended recipient, please contact the sender by reply
e-mail and destroy all copies of the original message.

Brooke Rhead

unread,
Jan 8, 2013, 9:24:28 PM1/8/13
to CHAVAN, SHWETA, gen...@soe.ucsc.edu
Hello Shweta,

Can you clarify what GTF files at UCSC you are referring to? Most of
our files are not stored in GTF format, but it is possible to export
gene annotation tracks from the Table Browser in GTF format. How did
you retrieve GTF files from UCSC? Which track are the GTF files from?

To be clear, by "GTF files," I am referring to annotation files (not
sequence files) that are in this format: http://mblab.wustl.edu/GTF2.html.

Regarding your second question about GENCODE Genes, the gene sets that
have the particular version number should be the same, with the same
coordinates, regardless of the source of the download, with the
exception of coordinates on the mitochondrial chromosome on hg19/GRCh37
at UCSC. See the "Note on chrM" on the hg19 gateway page,
http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19, and the "NOTE" on the
GENCODE Gene track description page,
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeSuper.
Additionally, there are often newer versions of GENCODE Gene sets
available from http://www.gencodegenes.org/ that have not yet made their
way into the UCSC Genome Browser.

To decide which gene set would be most useful for your purposes, you can
read about the different GENCODE Gene sets (basic, comprehensive, etc.)
on the track description page. The latest set that we are displaying
(version 12) is described here:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg191&g=wgEncodeGencodeV12

I hope this information is helpful. Please reply to gen...@soe.ucsc.edu
with follow-up comments and questions.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 1/8/13 12:34 PM, CHAVAN, SHWETA wrote:
> Dear UCSC team,
>
> I have a question about genome reference files- with respect to Next
> Generation Sequencing. We perform RNA-Seq experiments and align the
> reads to the three reference genomes � from Ensembl , NCBI, and UCSC
> (Hg19 / GrCH37 build).
>
> 1.What is the difference between the three GTF files obtained from three
> different sources resp. (Ensembl, NCBI and UCSC?).?
>
> 1. Is it that for a particular geneA the co-ordinates of start and
> stop would differ in the three GTFs or it is consistent across
> all gtfs that correspond to a particular genome assembly build
> (e.g. Hg19/ GrCH37). In other words, do all gtf files
> irrespective of the source, have the same co-ordinates as long
> as they originate from the same human genome assembly build
> version (e.g. Hg19/ GrCH37)?
> 2. In addition how different is the GTF from ENCODE website,
> http://www.gencodegenes.org/ file name is
> gencode.v14.annotation.gtf, corresponding to GrCH37. And how
> does one decide to choose one of these GTF files?
>
> Thanks,
>
> Shweta
>
> Shweta S. Chavan
>
> Bioinformatics NGS Post-doc
>
> Myeloma Institute for Research and Therapy
>
> University of Arkansas Medical Sciences
>
> Little Rock, AR.
>
> sch...@uams.edu <mailto:sch...@uams.edu>
>
> Confidentiality Notice: This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> confidential and privileged information. Any unauthorized review,
> use, disclosure or distribution is prohibited. If you are not the
> intended recipient, please contact the sender by reply
> e-mail and destroy all copies of the original message.
>
> --
>
>
>

CHAVAN, SHWETA

unread,
Jan 9, 2013, 5:52:31 PM1/9/13
to Brooke Rhead, gen...@soe.ucsc.edu
Hello Brooke,

Thank you for your response and the information that the mitochondrial co-ordinates would change across the 3 different sources of gene annotations (UCSC/Ensembl/NCBI).

So just to clarify, below are the GTF files that we use:

UCSC (So I did not retrieve them from UCSC Table browser)
ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg18/database/refFlat.txt.gz
NCBI
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/mapview/seq_gene.md.gz
ENSEMBL
ftp://ftp.ensembl.org/pub/release-69/gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz
GENCODE
http://www.gencodegenes.org/ gencode.v14.annotation.gtf

We got these file from above sources.

1. Sorry for any confusion, just to clarify and repeat my question, how different are these 4 GTFs from different sources? Specifically, for a given e.g. BRCA1, will the co-ordinates of start stop for different features (genes/exons/transcripts) be different in different GTFs OR same as long as they come from the same build (i.e. all the above GTFs correspond to H19/GrCH37 assembly). I see that you mentioned that for Mitochondria it would change, does it also imply that for other Chromosomes and specifically for the gene features it won't change?

2. "> Regarding your second question about GENCODE Genes, the gene sets that have the particular version number should be the same, with the same coordinates "- from your email
Could you please give an example of "gene sets" (gene sets as reported in GTF file?) and "version" (of the assembly ? e.g. hg19 or hg18 etc?), just to be sure that I understand correctly.

3. This question might be naïve, as I am new to the field of genome assembly and annotations.
About the ENCODE project I read that it provides a set of highly accurate annotations of evidence-based gene features on the human reference genome and includes all protein-coding loci with associated alternative splice variants, non-coding with transcript evidence. How does the GTF from ENCODE differ from the one UCSC provides for the same reference genome assembly? Does it mean that the GTF from UCSC is a more comprehensive one that includes evidence based as well as prediction based transcripts as well?

Thank you and I appreciate your help.

Thanks
Shweta


-----Original Message-----
From: Brooke Rhead [mailto:rh...@soe.ucsc.edu]
Sent: Tuesday, January 08, 2013 8:24 PM
To: CHAVAN, SHWETA
Cc: 'gen...@soe.ucsc.edu'
Subject: Re: [genome] Question about GTF files

Hello Shweta,

Can you clarify what GTF files at UCSC you are referring to? Most of our files are not stored in GTF format, but it is possible to export gene annotation tracks from the Table Browser in GTF format. How did you retrieve GTF files from UCSC? Which track are the GTF files from?

To be clear, by "GTF files," I am referring to annotation files (not sequence files) that are in this format: http://mblab.wustl.edu/GTF2.html.

Regarding your second question about GENCODE Genes, the gene sets that have the particular version number should be the same, with the same coordinates, regardless of the source of the download, with the exception of coordinates on the mitochondrial chromosome on hg19/GRCh37 at UCSC. See the "Note on chrM" on the hg19 gateway page, http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19, and the "NOTE" on the GENCODE Gene track description page, http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeSuper.
Additionally, there are often newer versions of GENCODE Gene sets available from http://www.gencodegenes.org/ that have not yet made their way into the UCSC Genome Browser.

To decide which gene set would be most useful for your purposes, you can read about the different GENCODE Gene sets (basic, comprehensive, etc.) on the track description page. The latest set that we are displaying (version 12) is described here:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg191&g=wgEncodeGencodeV12

I hope this information is helpful. Please reply to gen...@soe.ucsc.edu with follow-up comments and questions.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 1/8/13 12:34 PM, CHAVAN, SHWETA wrote:
> Dear UCSC team,
>
> I have a question about genome reference files- with respect to Next
> Generation Sequencing. We perform RNA-Seq experiments and align the
> reads to the three reference genomes - from Ensembl , NCBI, and UCSC

Brooke Rhead

unread,
Jan 10, 2013, 12:00:27 AM1/10/13
to CHAVAN, SHWETA, gen...@soe.ucsc.edu
Hi Shweta,

Thank you for sending links to your data sources. I apologize in
advance for this long-winded response, but I think it answers your
questions.

First, the source you list for UCSC,
ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg18/database/refFlat.txt.gz,
contains coordinates of gene annotations, but it is not actually a "GTF
file." GTF is a very specific file format, among many. See
http://genome.ucsc.edu/FAQ/FAQformat.html for a list of file formats we
support in the Genome Browser. Most of our gene annotations happen to
be stored in what we call "genePred" format. Both GTF and genePred are
ways of storing gene names and coordinates, and they can both contain
the same information, but we do not refer to them here as a GTF file.

The answer to your first question is that these the coordinates in these
files may not be the same at all, with the exception of the Ensembl and
GENCODE sets (which I'll explain later). In general, there are many
methods for determining where genes are located in the reference genome
assembly, and there is no one "true" set of gene predictions. See
http://en.wikipedia.org/wiki/Gene_prediction for a brief overview of
what is involved in predicting gene locations. We display several
different gene sets from different sources in the Genome Browser. To
get a sense of what is available from UCSC, go to our GRCh37/hg19
browser (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19) and scroll
down to the "Genes and Gene Prediction Tracks" section. Click on the
blue track name to see a description of how each track was made (not
every track in this section is a gene prediction track, but you could
start by comparing UCSC Genes, GENCODE, CCDS, RefSeq Genes, and Ensembl
Genes, for instance). Each of these gene sets uses a different method
for predicting genes.

If you are completely unfamiliar with the Genome Browser, you might want
to first check out some of the tutorials linked to from this page:

http://genome.ucsc.edu/training.html

Specifically, the Open Helix videos might be useful. Go to:

http://www.openhelix.com/ucsc -> hit "Launch online tutorial" and go to
"Understanding Displays"

Finally, the "Gene structure and transcripts" section of this paper
addresses some of the differences that exist among gene sets:

Cline MS, Kent WJ. Understanding genome browsing. Nat Biotechnol. 2009
Feb;27(2):153-5.
http://www.nature.com/nbt/journal/v27/n2/full/nbt0209-153.html

In answer to your question of how different the files are from the
various sources in the links you list: the UCSC link corresponds to the
"RefSeq Genes" track, so I would expect it to be very different from the
GENCODE gene set. I'm not sure what the data from NCBI represents. The
data from your Ensembl link will correspond to the "Ensembl Genes" track
in the UCSC Genome Browser. Finally, the GENCODE link corresponds to
the "GENCODE Genes" track in the UCSC Genome Browser.

I point out one perhaps surprising detail here, though, which is also
noted on the GENCODE track description page: "As of GENCODE Version 11,
Ensembl and GENCODE have converged. The gene annotations in the GENCODE
comprehensive set are the same as the corresponding Ensembl release.
UCSC will continue to provide a separate Ensembl track on Human in the
same format as the Ensembl tracks on other organisms."

The answer to your question 2 is that there are different versions of
the assembly (hg19, hg18, etc.), AND different versions of the GENCODE
gene set, and both need to be the same if you are comparing data from
different sources and expect the coordinates to be the same. If you
have the same version of GENCODE Genes coordinates from the same
assembly from two different sources, the coordinates should be the same,
with the exception of the mitochondrial chromosome coordinates if the
data came from UCSC. The most recent gene set available from
gencodegenes.org is GENCODE gene set version 14. We are in the process
of making GENCODE version 14 available on the hg19 assembly genome
browser, but it is not there yet. Currently we display GENCODE version
12. (Hopefully this answers question 3 as well.)

If you have further questions, please contact us again at
gen...@soe.ucsc.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group



> 3. This question might be na�ve, as I am new to the field of genome

CHAVAN, SHWETA

unread,
Jan 18, 2013, 2:50:06 PM1/18/13
to Brooke Rhead, gen...@soe.ucsc.edu
Dear Dr. Rhead,

Thank you for the detailed explanations.

Actually we download the reference genomes and their corresponding GTFs from Illumina iGenomes ftp site. That is just because it is convenient to download all 3 - Ensembl, NCBI, and UCSC reference files from a single site. So I asked Illumina Tech Support where did they get the GTF files from, and the links are the ones that they pointed me to. I tried opening the links they sent me and found that the files seemed to be in GTF format. So apparently Illumina downloads it from respective website, and makes their extension as GTF.

Thank you very much for the other links and the information that ENCODE and Ensembl GTF might indeed be the same.

Regards,
> 3. This question might be naïve, as I am new to the field of genome

Brooke Rhead

unread,
Jan 22, 2013, 5:19:48 PM1/22/13
to CHAVAN, SHWETA, Gen...@soe.ucsc.edu
Hi Shweta,

We would be interested in following up with Illumina about distributing
the gene files from UCSC with the GTF extension, so that we can prevent
any future confusion. Do you happen to have the name/email of the
person you talked to from Illumina Tech Support? You can send it to
just me if you prefer (and don't cc the mailing list address).

Thank you!

--
Brooke Rhead
UCSC Genome Bioinformatics Group


>> 3. This question might be na�ve, as I am new to the field of genome
Reply all
Reply to author
Forward
0 new messages