Hello Johnathan,
Thank you for your question about obtaining gene coordinates from the mm9 genome assembly. There are several gene tracks available for that assembly, including UCSC Genes, Ensembl Genes, and RefSeq Genes. Is there one in particular that you are trying to use?
You can create a file with those fields using the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables using the following steps:
1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options:
Clade: Mammal
Genome: Mouse
Assembly: July 2007 (NCBI37/mm9)
Group: Genes and Gene Predictions
3. Choose the gene track that you wish to use. For this example, select UCSC Genes. The primary table will be selected by default (knownGene for UCSC Genes, ensGene for Ensembl Genes, refGene for RefSeq Genes).
4. Click the "describe table schema" button for the chosen table.
5. On the new page, there should be a description of each of the table fields. For UCSC Genes, you probably want name, chrom, txStart, and txEnd. There is also a special case for UCSC Genes - we store the related gene symbols in a secondary table named kgXref in the field "geneSymbol".
6. Click the back button on your browser to return to the main hgTables pages. Set the following options:
Region: genome
Output Format: selected fields from primary and related tables
7. Click "get output".
8. Click the checkboxes next to the desired fields (name, txStart, txEnd, and chrom). If you would like to have the gene symbols, check that box as well from the linked mm9.kgXref fields.
9. Click "get output".
You should be presented with the data that you are looking for.
You can also find the data from each of these tables on our download server at http://hgdownload.soe.ucsc.edu. Click the Mouse link, then scroll down to the Jul. 2007 (mm9) section and follow the "Annotation database" link. The data for each of these tables is stored in a file by the same name (knownGene in knownGene.txt.gz, ensGene in ensGene.txt.gz, etc.). You can then extract the desired columns of data from these tables on your own. Please note that this method makes it harder to get the gene symbols correctly identified for the UCSC Genes track, however, as they are stored in a separate table.
Please note also that these files include a separate entry for each gene transcript, which may result in multiple entries for some genes. You can restrict this list for the UCSC Genes track by using the knownCanonical table instead of knownGene; knownCanonical includes only a single representative transcript for each gene.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--