Two ways of downloading gene annotation files ? (ref Gene)

581 views
Skip to first unread message

Jackie Jia Zhou

unread,
May 15, 2013, 1:33:17 PM5/15/13
to gen...@lists.soe.ucsc.edu
Hi,

I am trying to download gene annotation files for mm9 from genome.ucsc.edu
I thought there might be two ways of getting the annotation files:(1) go to the 'downloads' page for mm9, and download 'refGene.txt' from there; (2) go the 'Table' page, select the correct 'genome' and 'assembly'. Select 'Gene and Gene prediction Tracks' --> 'RefSeq Genes' --> 'refGene' , and then click on 'get output' , and then select 'whole Gene' to get the .bed file for genes. 

However, the files I could get from these two different ways are very different. the total number of entries are the same, but the starting and ending coordinates of each entry and so different in these files. 

I wonder which file I can trust more? and why is such difference in the starting and ending coordinates?

Thank you,

Jackie Zhou
PhD Candidate 
Division of Biology & Biological Medical Sciences
Washington University in St. Louis, School of Medicine


Brooke Rhead

unread,
May 15, 2013, 5:55:25 PM5/15/13
to Jackie Jia Zhou, gen...@lists.soe.ucsc.edu
Hi Jackie,

The file you are getting from our downloads page is in genePred format.
You can see the format of the table by hitting the "describe table
schema" button in the Table Browser when you have the refGene table
selected.

If you choose the output format "all fields from selected table" in the
Table Browser, you should get results in genePred format, just like you
see it on the downloads page. (Note however that we update the table
available via the Table Browser daily, while we update the file
available via the downloads page only on the weekends, so it is possible
to see a small number of differences between the two files.)

It sounds like you were using the BED output format option in the Table
Browser. Many table types can be converted to BED format in the Table
Browser, including genePred. BED format is described here:

http://genome.ucsc.edu/FAQ/FAQformat.html#format1

The "blockSizes" and "blockStarts" fields in BED format look similar to
the "exonStarts" and "exonEnds" fields in genePred format, but the
fields in BED format are not chromosomal positions, as they are in
genePred format.

I hope this information explains what you are seeing in the various
downloads of the refGene table. If you have further questions, please
contact us again at gen...@soe.ucsc.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 5/15/13 10:33 AM, Jackie Jia Zhou wrote:
> Hi,
>
> I am trying to download gene annotation files for mm9 from
> genome.ucsc.edu <http://genome.ucsc.edu>
> I thought there might be two ways of getting the annotation files:(1) go
> to the 'downloads' page for mm9, and download 'refGene.txt' from there;
> (2) go the 'Table' page, select the correct 'genome' and 'assembly'.
> Select 'Gene and Gene prediction Tracks' --> 'RefSeq Genes' -->
> 'refGene' , and then click on 'get output' , and then select 'whole
> Gene' to get the .bed file for genes.
>
> However, the files I could get from these two different ways are very
> different. the total number of entries are the same, but the starting
> and ending coordinates of each entry and so different in these files.
>
> I wonder which file I can trust more? and why is such difference in the
> starting and ending coordinates?
>
> Thank you,
>
> Jackie Zhou
> /PhD Candidate /
> /Division of Biology & Biological Medical Sciences/
> /Washington University in St. Louis, School of Medicine/
>
>
> --
>
>
>

Jackie Jia Zhou

unread,
May 15, 2013, 6:57:08 PM5/15/13
to Brooke Rhead, gen...@lists.soe.ucsc.edu
Hi Brooke,

Thank you for the detailed explanations! It is very helpful.
However, there is still some part that confuses me a little bit.  I am working on mm9, and let me give you an example about the difference I was talking about:
(1) download the 'refGene.txt' file from 'downloads'
(2) go to 'Tables', choose the output format "all fields from selected table", and click on 'get output', the gene_id (name) of the first row would be 'NM_028778', you can see the chrSt and chrEnd would be 134212701 and 134230065, respectively; Then, if you go to the 'refGene.txt' file, find the row of which the 'gene_id'(name) is 'NM_028778', you would see that the chrSt and chrEnd would be 132316124 and 132333488, respectively.

I am thinking, maybe the '132316124 and 132333488' in the 'refGene.txt' file are actually not representing the Str and End position of the gene, that's why they are different from what I see in the 'tables' output? Anyway, I thought getting some clarification from the list will help me understand the files better.

Thank you very much!

Best,

-Jackie

Brooke Rhead

unread,
May 15, 2013, 7:15:46 PM5/15/13
to Jackie Jia Zhou, gen...@lists.soe.ucsc.edu
Hi Jackie,

I think I see the problem. I see start and end coordinates of 134212701
and 134230065 for the alignment of NM_028778 on mm9 in both the Table
Browser and via the downloads page. However, I see start and end
coordinates of 132316124 and 132333488 for NM_028778 on *mm10*. It
sounds like you may have inadvertently downloaded the mm10 version of
refGene.txt from the downloads page.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


> http://genome.ucsc.edu/FAQ/__FAQformat.html#format1
> <http://genome.ucsc.edu/FAQ/FAQformat.html#format1>
>
> The "blockSizes" and "blockStarts" fields in BED format look similar
> to the "exonStarts" and "exonEnds" fields in genePred format, but
> the fields in BED format are not chromosomal positions, as they are
> in genePred format.
>
> I hope this information explains what you are seeing in the various
> downloads of the refGene table. If you have further questions,
> please contact us again at gen...@soe.ucsc.edu
> <mailto:gen...@soe.ucsc.edu>.
>
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
>
> On 5/15/13 10:33 AM, Jackie Jia Zhou wrote:
>
> Hi,
>
> I am trying to download gene annotation files for mm9 from
> genome.ucsc.edu <http://genome.ucsc.edu> <http://genome.ucsc.edu>
Reply all
Reply to author
Forward
0 new messages