Hi Shweta,
Thank you for sending links to your data sources. I apologize in
advance for this long-winded response, but I think it answers your
questions.
First, the source you list for UCSC,
ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg18/database/refFlat.txt.gz,
contains coordinates of gene annotations, but it is not actually a "GTF
file." GTF is a very specific file format, among many. See
http://genome.ucsc.edu/FAQ/FAQformat.html for a list of file formats we
support in the Genome Browser. Most of our gene annotations happen to
be stored in what we call "genePred" format. Both GTF and genePred are
ways of storing gene names and coordinates, and they can both contain
the same information, but we do not refer to them here as a GTF file.
The answer to your first question is that these the coordinates in these
files may not be the same at all, with the exception of the Ensembl and
GENCODE sets (which I'll explain later). In general, there are many
methods for determining where genes are located in the reference genome
assembly, and there is no one "true" set of gene predictions. See
http://en.wikipedia.org/wiki/Gene_prediction for a brief overview of
what is involved in predicting gene locations. We display several
different gene sets from different sources in the Genome Browser. To
get a sense of what is available from UCSC, go to our GRCh37/hg19
browser (
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19) and scroll
down to the "Genes and Gene Prediction Tracks" section. Click on the
blue track name to see a description of how each track was made (not
every track in this section is a gene prediction track, but you could
start by comparing UCSC Genes, GENCODE, CCDS, RefSeq Genes, and Ensembl
Genes, for instance). Each of these gene sets uses a different method
for predicting genes.
If you are completely unfamiliar with the Genome Browser, you might want
to first check out some of the tutorials linked to from this page:
http://genome.ucsc.edu/training.html
Specifically, the Open Helix videos might be useful. Go to:
http://www.openhelix.com/ucsc -> hit "Launch online tutorial" and go to
"Understanding Displays"
Finally, the "Gene structure and transcripts" section of this paper
addresses some of the differences that exist among gene sets:
Cline MS, Kent WJ. Understanding genome browsing. Nat Biotechnol. 2009
Feb;27(2):153-5.
http://www.nature.com/nbt/journal/v27/n2/full/nbt0209-153.html
In answer to your question of how different the files are from the
various sources in the links you list: the UCSC link corresponds to the
"RefSeq Genes" track, so I would expect it to be very different from the
GENCODE gene set. I'm not sure what the data from NCBI represents. The
data from your Ensembl link will correspond to the "Ensembl Genes" track
in the UCSC Genome Browser. Finally, the GENCODE link corresponds to
the "GENCODE Genes" track in the UCSC Genome Browser.
I point out one perhaps surprising detail here, though, which is also
noted on the GENCODE track description page: "As of GENCODE Version 11,
Ensembl and GENCODE have converged. The gene annotations in the GENCODE
comprehensive set are the same as the corresponding Ensembl release.
UCSC will continue to provide a separate Ensembl track on Human in the
same format as the Ensembl tracks on other organisms."
The answer to your question 2 is that there are different versions of
the assembly (hg19, hg18, etc.), AND different versions of the GENCODE
gene set, and both need to be the same if you are comparing data from
different sources and expect the coordinates to be the same. If you
have the same version of GENCODE Genes coordinates from the same
assembly from two different sources, the coordinates should be the same,
with the exception of the mitochondrial chromosome coordinates if the
data came from UCSC. The most recent gene set available from
gencodegenes.org is GENCODE gene set version 14. We are in the process
of making GENCODE version 14 available on the hg19 assembly genome
browser, but it is not there yet. Currently we display GENCODE version
12. (Hopefully this answers question 3 as well.)
If you have further questions, please contact us again at
gen...@soe.ucsc.edu.
--
Brooke Rhead
UCSC Genome Bioinformatics Group
> 3. This question might be na�ve, as I am new to the field of genome