[Genome] Downloading old refseq and ensemble transcripts with the "version numbers" in the accession IDs.

284 views
Skip to first unread message

Laura Smith

unread,
May 29, 2012, 4:55:51 PM5/29/12
to gen...@soe.ucsc.edu
Hello, 

I have been using the refseq transcripts and ensemble transcripts downloaded from UCSC genome browser table on June 23 2011. 
The transcript IDs in these datasets that were downloaded from UCSC do not have the version numbers (such as NM_134564.2)  where ".2" is the version number after the period. 

However, recently, it turns out that I need to have the version numbers of each transcript.  So, I tried to look for them and download them using the info provided here, however there is no way for me to choose the refseq transcripts for the date June 23 2011: 

https://lists.soe.ucsc.edu/pipermail/genome/2011-September/027099.html


Would it be possible for you to please send me the refseq and ensemble transcripts for June 23 2011 from your archives please which includes the version numbers for each transcript in them? 


Or if there is a way that I could access this data myself, if you could please let me know I would very much appreciate it. 


Thank you,
Laura

Steve Heitner

unread,
May 30, 2012, 1:47:58 PM5/30/12
to Laura Smith, gen...@soe.ucsc.edu
Hello, Laura.

As you mentioned, the refGene table does not actually list the Genbank
version number. The hg19.gbStatus.version field does list the version
number, but the problem is that this is the current version number and not
necessarily the version that was current as of June 23, 2011. There is also
a field called hg19.gbStatus.modDate that lists the last modified date, but
there are two problems with this. First, our modDate does not necessarily
coincide precisely with the official Genbank version date (e.g., our modDate
for NM_021219.2 is March 21, 2012 while Genbank lists it as April 21, 2012).
Also, if the particular transcript you are looking at is a version 3 (e.g.,
NR_001458.3), the gbStatus table does not keep a history of previous
versions and modDates, so there is no way to know whether it was NR_001458.1
or NR_001458.2 on June 23, 2011.

We do not keep histories of the refGene table, so there is no June 23, 2011
version of refGene that we can direct you to. There is no easy way to get a
snapshot of the data as it existed on June 23, 2011. It is possible to look
directly at Genbank to find the dates corresponding with the various
transcript versions (e.g., http://www.ncbi.nlm.nih.gov/nuccore/NM_021219.1
shows that NM_021219.1 was released on April 24, 2002 and
http://www.ncbi.nlm.nih.gov/nuccore/NM_021219.2 shows that NM_021219.2 was
released on April 21, 2012), but if you have a large number of IDs, this
would be very tedious without some kind of custom script.

Please contact us again at gen...@soe.ucsc.edu if you have any further
questions.

---
Steve Heitner
UCSC Genome Bioinformatics Group
_______________________________________________
Genome maillist - Gen...@soe.ucsc.edu
https://lists.soe.ucsc.edu/mailman/listinfo/genome


Laura Smith

unread,
Jun 1, 2012, 4:53:17 PM6/1/12
to st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hello Steve, 

Thank you very much for your reply. Based on your suggestion, I decided to download the newest REFSEQ and ENSEMBLE transcripts from UCSC Browser with all of the gbstatus subfields. 

I have tried to download these files with the gbstatus fields, however I keep getting error from UCSC genome browser website. I am following the directions listed here:  

https://lists.soe.ucsc.edu/pipermail/genome/2011-September/027099.html 


Is there something I am doing wrong perhaps? Please see the attached 2 files for the screenshots of the error messages from UCSC browser. 


If not, since I am not able to download these files,  would it be possible for you to please send me or provide me a link to the latest Refseq and Ensemble transcripts please with all of the gbstatus subfields?  
or if you could please let me know how I can download them with the gbstatus fields, I would very much appreciate it. 

Thank you,
Laura 


________________________________
From: Steve Heitner <st...@soe.ucsc.edu>
To: 'Laura Smith' <lsmith...@yahoo.com>; gen...@soe.ucsc.edu
Sent: Wednesday, May 30, 2012 10:47 AM
Subject: RE: [Genome] Downloading old refseq and ensemble transcripts with the "version numbers" in the accession IDs.

Brooke Rhead

unread,
Jun 4, 2012, 8:15:31 PM6/4/12
to Laura Smith, gen...@soe.ucsc.edu
Hi Laura,

It looks like the Table Browser is timing out on this large query.
There are a couple of ways you could work around this:

You could try limiting the output by pasting a list of the RefSeq
identifiers that you are interested in. When I followed the
instructions in the link you sent but pasted in a single identifier, I
was able to get results.

Another way to get the information you want would be to download the two
tables you are working with from our downloads server:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
(this page takes a while to load)

Specifically, you would need:
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gbStatus.txt.gz

Then you could join the two tables yourself.

If you don't have a good way to accomplish a join of the tables, you
could use Galaxy: https://main.g2.bx.psu.edu/. You would need to first
fetch each of the tables separately using the "UCSC Main table browser"
link (under "Get Data"), and then join them on the
refGene.name/gbStatus.acc fields using the "Join two Datasets" link
(under "Join, Subtract and Group").

If you have any questions about using Galaxy, please contact their
helpdesk at galax...@lists.bx.psu.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group

Laura Smith

unread,
Jun 6, 2012, 2:40:20 PM6/6/12
to Brooke Rhead, gen...@soe.ucsc.edu
Hi Brooke, 

Thank you very much for your very informative email. 

I followed your instructions and I downloaded the REFSEQ and ENSEMBL transcripts from GALAXY exactly the way you described and I also downloaded the gbstatus and did a "join" on the transcript name. 

Now, I need to know which version of ENSEMBLE and REFSEQ are these that I downloaded. Would it be possible for you to please kindly let me know how and where I can retrieve this information? 

Basically to summarize, what versions of ENSEMBLE and REFSEQ transcripts are on currently UCSC website? How often are they updated on UCSC website?  Is there an online link where this information is provided?


Another issue is, is it for sure that GALAXY is in-sync with all updates from UCSC website?  Perhaps, this is a question for GALAXY, but in case you may know, I wanted to ask you as well. When users access UCSC MAIN from GALAXY, are they connected to UCSC online website or some version of "in-house UCSC browser within GALAXY"?

Once again, thank you very much for your help. 

Laura
 


________________________________
From: Brooke Rhead <rh...@soe.ucsc.edu>
To: Laura Smith <lsmith...@yahoo.com>
Cc: "gen...@soe.ucsc.edu" <gen...@soe.ucsc.edu>
Sent: Monday, June 4, 2012 5:15 PM
Subject: Re: [Genome] Downloading old refseq and ensemble transcripts with the "version numbers" in the accession IDs.

Pauline Fujita

unread,
Jun 6, 2012, 7:56:45 PM6/6/12
to Laura Smith, gen...@soe.ucsc.edu
Hello Laura,

For many tracks you can find the version number or time of last update
on the description page for the track. To see the description for any
track you can click on the gray bar to the left of the track in the main
display or click on the track title above its configuration pulldown menu.

For ENSEMBL we are currently displaying v65. Our refseq tracks are
updated nightly.

Regarding Galaxy, they run a proxy server which is a real-time interface
to the UCSC Table Browser so it is in sync.

Best regards,

Pauline Fujita
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu

Laura Smith

unread,
Jun 7, 2012, 12:20:09 PM6/7/12
to Brooke Rhead, gen...@soe.ucsc.edu
Hi Brooke, 

Thank you again for your email. 

I have a question on gbstatus. I downloaded the gbStatus.txt file in this link you sent to me: 

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gbStatus.txt.gz 


However, this file only contains "REFSEQ transcripts". It does not contain "ENSEMBL transcripts". 

I also would like to get the gbStatus for ENSEMBL transcripts.  

Does UCSC browser provide them? 

To be more clear:

For example, on ucsc genome website, when I look at a random gene's refseq transcripts and ensembl transcripts, I noticed that when I click on refseq transcripts, they contain the accession ID.VERSION number as is NM_123.4 where ".4" is the version. 
However, the ENSEMBL transcripts do not have version numbers. They only have accession numbers such as ENS_123  on UCSC website.  ENSEMBL transcripts should also have version numbers as listed in ENSEMBL website. 

So, do you know why this information is not included in the UCSC genome browser website? 


Another thing I tried is this:  When I try to get the ENSEMBL gbstatus from galaxy website and do a join between gbstatus and ensembl transcripts and empty set is returned. I am guessing that the reason is because in the original gbstatus.txt file there is no ENSEMBL transcripts so there is nothing to join based on accession ids. 


Is there any plan to include "ENSEMBL versions" in gbstatus.txt file in the near future? If not, is there another way for me to retrieve them from ucsc genome browser? 

If you could please provide me any recommendation I would greatly appreciate it. 

Thank you,
Laura 

________________________________


From: Brooke Rhead <rh...@soe.ucsc.edu>
To: Laura Smith <lsmith...@yahoo.com>
Cc: "gen...@soe.ucsc.edu" <gen...@soe.ucsc.edu>
Sent: Monday, June 4, 2012 5:15 PM

Subject: Re: [Genome] Downloading old refseq and ensemble transcripts with the "version numbers" in the accession IDs.

Brooke Rhead

unread,
Jun 8, 2012, 9:00:01 PM6/8/12
to Laura Smith, gen...@soe.ucsc.edu
Hi Laura,

The gbStatus table is part of the suite of tables that supports our
Genbank tracks; it is not related to the Ensembl tracks.

I looked at the Ensembl website for an example of the version number you
are referring to. I see that they do list a version number for each
transcript or gene on their pages. However, we do not keep the version
in any of our tables. You might be able to get the version numbers for
a specific genebuild (version 65, in this case) directly from Ensembl.

If you have further questions, please contact us again at
gen...@soe.ucsc.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


> ------------------------------------------------------------------------
> *From:* Brooke Rhead <rh...@soe.ucsc.edu>
> *To:* Laura Smith <lsmith...@yahoo.com>
> *Cc:* "gen...@soe.ucsc.edu" <gen...@soe.ucsc.edu>
> *Sent:* Monday, June 4, 2012 5:15 PM
> *Subject:* Re: [Genome] Downloading old refseq and ensemble transcripts
> with the "version numbers" in the accession IDs.
>
> Hi Laura,
>
> It looks like the Table Browser is timing out on this large query.
> There are a couple of ways you could work around this:
>
> You could try limiting the output by pasting a list of the RefSeq
> identifiers that you are interested in. When I followed the
> instructions in the link you sent but pasted in a single identifier, I
> was able to get results.
>
> Another way to get the information you want would be to download the two
> tables you are working with from our downloads server:
>
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/
> (this page takes a while to load)
>
> Specifically, you would need:
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz
> http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gbStatus.txt.gz
>
> Then you could join the two tables yourself.
>
> If you don't have a good way to accomplish a join of the tables, you
> could use Galaxy: https://main.g2.bx.psu.edu/. You would need to first
> fetch each of the tables separately using the "UCSC Main table browser"
> link (under "Get Data"), and then join them on the
> refGene.name/gbStatus.acc fields using the "Join two Datasets" link
> (under "Join, Subtract and Group").
>
> If you have any questions about using Galaxy, please contact their
> helpdesk at galax...@lists.bx.psu.edu
> <mailto:galax...@lists.bx.psu.edu>.
>
> --
> Brooke Rhead
> UCSC Genome Bioinformatics Group
>
>
> On 6/1/12 1:53 PM, Laura Smith wrote:
> > Hello Steve,
> >
> > Thank you very much for your reply. Based on your suggestion, I
> decided to download the newest REFSEQ and ENSEMBLE transcripts from UCSC
> Browser with all of the gbstatus subfields.
> >
> > I have tried to download these files with the gbstatus fields,
> > however
> I keep getting error from UCSC genome browser website. I am following
> the directions listed here:
> >
> > https://lists.soe.ucsc.edu/pipermail/genome/2011-September/027099.html
> >
> >
> >
> >
> Is there something I am doing wrong perhaps? Please see the attached 2
> files for the screenshots of the error messages from UCSC browser.
> >
> >
> > If not, since I am not able to download these files, would it be
> possible for you to please send me or provide me a link to the latest
> Refseq and Ensemble transcripts please with all of the gbstatus subfields?
> > or if you could please let me know how I can download them with the
> gbstatus fields, I would very much appreciate it.
>
> > Thank you,
> > Laura
> >
> >
> >
> >
> > ________________________________
> > From: Steve Heitner<st...@soe.ucsc.edu <mailto:st...@soe.ucsc.edu>>
> > To: 'Laura Smith'<lsmith...@yahoo.com
> <mailto:lsmith...@yahoo.com>>; gen...@soe.ucsc.edu
> <mailto:gen...@soe.ucsc.edu>
> <mailto:gen...@soe.ucsc.edu> if you have any further
> > Genome maillist - Gen...@soe.ucsc.edu <mailto:Gen...@soe.ucsc.edu>
> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
> >
> >
> >
> > _______________________________________________
> > Genome maillist - Gen...@soe.ucsc.edu <mailto:Gen...@soe.ucsc.edu>
> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
>
>

Laura Smith

unread,
Jun 15, 2012, 4:15:57 PM6/15/12
to gen...@soe.ucsc.edu
Hi, 

I have downloaded the placental pylop scores from UCSC GENOME BROWSER website from here: 

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phyloP46way/placentalMammals/ 


http://hgdownload.cse.ucsc.edu/goldenpath/hg19/phyloP46way/ 

Could you please confirm the version of chromosome M used when calculating the phylop scores? Was it the new RCRS CHRM version? or was it the old chrM? 

Thank you,
Laura

Brooke Rhead

unread,
Jun 15, 2012, 6:19:34 PM6/15/12
to Laura Smith, gen...@soe.ucsc.edu
Hi Laura,

We have not rebuilt any of our tracks with the RCRS chrM for hg19. This
includes the calculations of the phylop scores. All tracks use the old
version (NC_001807).

There is a note about it on our hg19 gateway page
(http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19):

Note on chrM
Since the release of the UCSC hg19 assembly, the Homo sapiens
mitochondrion sequence (represented as "chrM" in the Genome Browser) has
been replaced in GenBank with the record NC_012920. We have not replaced
the original sequence, NC_001807, in the hg19 Genome Browser. We plan to
use the Revised Cambridge Reference Sequence (rCRS) in the next human
assembly release.

If you have further questions, please contact us again at
gen...@soe.ucsc.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group

> Thank you, Laura _______________________________________________
Reply all
Reply to author
Forward
0 new messages