long descriptions from RefSeq

23 views
Skip to first unread message

Anton Kratz

unread,
Apr 11, 2022, 1:10:40 PM4/11/22
to gen...@soe.ucsc.edu
Dear UCSC team,  I have a list of a few hundred gene symbols, like this: BRCA1, OSBPL8, ALDH3A2, COL1A1...

I would like to get the long names, and the long descriptions as provided by RefSeq for them, using the table browser.

For example, for BRCA1 I am aiming to get a tab-delimited line with three entries:
1.) BRCA1
2.) BRCA1 DNA repair associated
3.) This gene encodes a 190 kD nuclear phosphoprotein that plays a role in maintaining genomic stability [...] [provided by RefSeq, May 2020].

How can I get this information? I am somewhat familiar with how to use the table browser (...from selected fields...), but I do not know in which tables to find this information, and how to string them together.

Best,
Anton

Daniel Schmelter

unread,
Apr 11, 2022, 8:31:34 PM4/11/22
to Anton Kratz, UCSC Genome Browser Support

Hello Anton,

Thank you for contacting Genome Browser support with your question about RefSeq fields.

The method for obtaining specific information from a list of gene symbols is as follows:
1. Navigate to Table Browser, selecting your assembly (hg19 or hg38) and data table (Gencode, knownGene).

https://genome.ucsc.edu/cgi-bin/hgTables

2. Select "genome and enter your identifiers (gene symbols) either as an upload or by pasting text
3. Change your output format to "selected fields from primary and related tables"
4. Click the button "get output" and you are brought to a field and related table page
5. Select the following fields from "hg38.kgXref fields" table
  • geneSymbol
  • description
6. Select the following related table, "hgFixed | refSeqSummary" and click "allow selection...". Then select the following field:
  • summary

7. Then select "get output", where you will get your final data. If you would like to get it as a file, write your desired filename in the "Output filename" field after step 3.

Another way to summarize this is as follows, in database.table.field format:

hg38.kgXref.geneSymbol
hg38.kgXref.description
hgFixed.refSeqSummary.summary

I hope this was helpful. If you have any more questions, please reply-all to our public support email at gen...@soe.ucsc.edu. For private communication, please reply-all to genom...@soe.ucsc.edu.
All the best,

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAAhY1SVDB-VAdk1129SVy2xerFuP8KVZQiWhjjOHcMoqJP%2B8kA%40mail.gmail.com.

Anton Kratz

unread,
Jun 8, 2022, 4:27:29 PM6/8/22
to gen...@soe.ucsc.edu
Is it possible to somehow reduce this to genes, not transcripts?

When I follow these steps, a list of originally a few hundred gene symbols inflates to several thousand descriptions, because of all these "transcript variant" strings.

These strings are very hard to parse and filter out systematically.

What I want to achieve is: Given a list of n gene symbols such BRCA1 for example,  I am aiming to get one tab-delimited, unambiguous line with three entries for each gene symbol:
1.) BRCA1
2.) BRCA1 DNA repair associated
3.) This gene encodes a 190 kD nuclear phosphoprotein that plays a role in maintaining genomic stability [...] [provided by RefSeq, May 2020].

If there have been n symbols there should be n lines. But not much more than n answers for each gene symbol which differ very slightly in the string because of transcript variants... I do not know how to filter those sub-strings out.

Is there a way to achieve that in the table browser?

Anton

Jairo Navarro Gonzalez

unread,
Jun 10, 2022, 6:32:15 PM6/10/22
to Anton Kratz, UCSC Genome Browser Discussion List

Hello,

Thank you for using the UCSC Genome Browser and sending your follow-up message.

You can use the knownCanonical table instead of the knownGene table to limit the number of transcripts to one per gene. You can learn more about this table on the following help page, https://genome.ucsc.edu/FAQ/FAQgenes.html#singledownload. Also described on the help page is another table with one transcript per gene, the NCBI RefSeq Select track.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

Want to share the Browser with colleagues?
Host a workshop: https://bit.ly/ucscTraining


Reply all
Reply to author
Forward
0 new messages