Missing data

Leon Peshkin

unread,

Jul 20, 2015, 3:38:36 PM7/20/15

to gen...@soe.ucsc.edu, ke...@soe.ucsc.edu

Dear Colleagues,

I am interested in historical data from the early releases of human and mouse genomes,

specifically in protein sequences. This page appears to link to at least some of these, however all the links to proteins are stale.

Would you mind checking if you could re-vive these links and generally find early data in your archives ?

-Leon Peshkin

http://hgdownload.soe.ucsc.edu/downloads.html#human

"Protein database for hg16"

http://hgdownload.soe.ucsc.edu/goldenPath/proteinDB/proteins040115/database/

http://hgdownload.soe.ucsc.edu/goldenPath/proteinDB/proteins0111/database/

http://hgdownload.soe.ucsc.edu/goldenPath/proteinDB/proteins040315/database/

      Name                    Last modified      Size  Description
      Parent Directory                             -   
      proteins120806/         27-Nov-2012 13:18    -   
      proteins121210/         27-Jun-2013 11:06    -   
      proteins140122/         26-Mar-2014 16:32    -   
      proteins150225/         21-Apr-2015 12:57    -

Jonathan Casper

unread,

Jul 24, 2015, 6:37:19 PM7/24/15

to Leon Peshkin, gen...@soe.ucsc.edu, Jim Kent

Hello Leon,

Thank you for your question about locating old data. Our proteome browser was retired several years ago, but one of our engineers was able to recover the contents of the proteins040315 directory. Those files have been placed at http://hgdownload-test.soe.ucsc.edu/goldenPath/proteinDB/proteins040315/database. For other old data files, you may be able to find information about the processes used to generate them in documents from our kent source tree at http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/hg/makeDb/doc/proteins;hb=HEAD.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

Leon Peshkin

unread,

Aug 24, 2015, 12:14:56 PM8/24/15

to Jonathan Casper, gen...@soe.ucsc.edu, Jim Kent

Hi Jonathan

thanks for undiging these files, unfortunately I still can not find what I am looking for.

I am interested in obtaining protein sequences (ideally for the "primary/longest" transcript) for all genes

from early releases of Human genome. Could someone help me locate these ?

-Leon

Steve Heitner

unread,

Aug 25, 2015, 1:54:09 PM8/25/15

to Leon Peshkin, Jonathan Casper, gen...@soe.ucsc.edu, Jim Kent

Hello, Leon.

It is unclear if you were certain that the information you were looking for was in the directories you referenced in your initial email or if you were hoping that the information would be there and when you viewed the contents of the directory that Jonathan made available, you did not find what you were hoping to find.

Based on the contents of your latest email, it sounds like what you are looking for can be obtained from our Table Browser. You referenced hg16 in your initial email, so I assume hg16 will be sufficient. Human assemblies prior to hg16 are no longer available in the Table Browser. If you are unfamiliar with the Table Browser, please also refer to http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html.

You can use the Table Browser to retrieve protein sequences from our Known Genes track (called UCSC Genes in later human assemblies). You also mentioned that you would prefer only the primary/longest transcript. This is available with our knownCanonical table, but you can only retrieve genomic sequence from the knownCanonical table – not protein sequence. What you can do is retrieve protein sequence from the knownGene table, retrieve the list of canonical identifiers from the knownCanonical table, and then use that list of canonical identifiers to filter your list of protein sequences.

Another thing to consider is that a query of protein sequences for the entire genome will be extremely large and will likely cause the Table Browser to time out. You might want to consider breaking it up by chromosome.

To retrieve the protein sequences, perform the following steps:

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: July 2003 (NCBI34/hg16)
Group: Genes and Gene Predictions
Track: Known Genes
Table: knownGene
Region: Select the “position” radio button and enter a coordinate range (e.g., “chr1” for all of chromosome 1)
Output: sequence

3. Click the “get output” button

4. Select the “protein” radio button

5. Click the “submit” radio button

To obtain the list of canonical IDs, perform the following steps:

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: July 2003 (NCBI34/hg16)
Group: Genes and Gene Predictions
Track: Known Genes
Table: knownCanonical
Region: Select the “genome” radio button
Output: selected fields from primary and related tables

3. Click the “get output” button

4. Here, you can select the fields that you would like in your output. For a list of the canonical IDs, check only the “transcript” checkbox.

5. Click the “get output” button

If this is not sufficient or not what you were looking for, you may also consider looking at UniProt (http://www.uniprot.org/).

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Reply all

Reply to author

Forward