Full set of symbols

102 views
Skip to first unread message

Oren Ben-Kiki

unread,
Jul 1, 2024, 1:54:44 PM (yesterday) Jul 1
to UCSC Genome Browser Public Support
TL;DR: Is there any API or CSV file I can access that gives the properties of *any* known symbol?

For example, what is the deal with AC008770 ?



- It doesn't appear anywhere in the output of mysql --user=genome --host=genome-mysql.cse.ucsc.edu --disable-auto-rehash -e "SELECT name,alignID,mRNA,geneSymbol,refseq,protAcc FROM hg38.knownGene INNER JOIN hg38.kgXref ON knownGene.name = kgXref.kgID"

So, I would think it simply doesn't appear in the UCSC database (to be fair, it seems to not exists in the HGNC database at all).

However, if I search for AC008770 in https://genome.ucsc.edu/
I get two hits in hg38:

Which give a lot of information about this, for example associated it with several ZNF genes and ENSG genes.

So this symbol *is* known *somewhere* in the UCSC databases (it isn't an HGNC symbol, but it is "known").

This isn't an issue with just one or few symbols - around 1/3rd of the "genes" in some of my data sets use "symbols" which are "known" but are not HGNC symbol names and aren't listed in the aboveCSV files / query results.

So - am I simply looking in the wrong database tables / CSV files? If so, where should I look, or is there some other programmatic way to access the data for such symbols?

Christopher Lee

unread,
Jul 1, 2024, 2:13:19 PM (yesterday) Jul 1
to Oren Ben-Kiki, UCSC Genome Browser Public Support
Hi Oren,

If you would like to programmatically search all of the tracks for a database you can use the following API URL (using your AC00877 as an example here):

from the result:
{ "downloadTime": "2024:07:01T18:00:35Z", "downloadTimeStamp": 1719856835, "genome": "hg38", "positionMatches": [ { "name": "gold", "trackName": "gold", "description": "Assembly from Fragments", "vis": "pack", "matches": [ { "position": "chr19:11869899-12021083", "hgFindMatches": "AC008770.7", "posName": "AC008770.7", "highlight": null, "canonical": false} , { "position": "chr19:12021639-12034729", "hgFindMatches": "AC008770.7", "posName": "AC008770.7", "highlight": null, "canonical": false} ] } ] }

the positionMatches array will contain a list of hits for various tracks, although in this case the only hits are to the Gold track, which is simply just the contig names that make up the assembly. None of the hits are to genes. Here is a help page for using the API, scrolled to the search function, although you will probably find the whole page useful:

Thanks,
Christopher Lee
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/d818037a-75ec-4290-9a83-9553229804a7n%40soe.ucsc.edu.

Maximilian Haeussler

unread,
3:35 AM (20 hours ago) 3:35 AM
to Christopher Lee, Oren Ben-Kiki, UCSC Genome Browser Public Support
Hi Oren,

when you write "symbol", then it sounds you mean "gene symbol", that's how the word is usually used. However, the identifier that you mentioned as an example, "AC008770" is not a gene, it's a clone, the name of a piece of DNA that was assembled and then placed onto a chromosomes in the early 2000s. These days, these clone names are not used anymore, and this type of assembly technology is not used anymore at all. 
This may explain why the gene-based searches that you mentioned, and almost all databases will never find this "symbol", as it's not a gene symbol, but the Genbank accession of an assembly clone.

Hope this helps, let us know if you have other questions,

best
Max

Maximilian Haeussler

unread,
8:35 AM (15 hours ago) 8:35 AM
to Oren Ben-Kiki, UCSC Genome Browser Discussion List
Hi Oren, 

to download the list of all clones of an assembly, go to the track configuration of the "Assembly" track and click "Data schema/format description and download". It will show the fields and their descriptions. You'll also see a link to the download server MySQL directory. The filename is gold.txt.gz, it's a tab-separated file. 

Alternatively, you can go to the table browser and select the "Assembly" track, click "Define region of interest" = "genome", "output format" = "all fields from selected table" and click the "get output" button.

Hope this helps, let us know if you have other questions,

Max 

On Tue, Jul 2, 2024 at 11:54 AM Oren Ben-Kiki <or...@ben-kiki.org> wrote:
For sure the use of these identifiers "should" be phased out, but I still have to deal with data sets that use them. It is therefore great that searching for them still brings up the data.

Given that these are clones and not genes, it does explain why I didn't find them in the genes tables; that said, since the search does find them, there must be a clones database table somewhere - is it possible to download a single CSV file (or single mysql query) that lists all clones and their information?

Oren Ben-Kiki

unread,
12:18 PM (11 hours ago) 12:18 PM
to Christopher Lee, UCSC Genome Browser Public Support
Thanks for the links.

Yes, I can use this web API to get positions in the genome, and then query https://rest.ensembl.org/overlap/region/human/{chromosome}:{start}:{end}?feature=gene;content-type=application/json to get a list of all the genes in that region.
This is a very slow process (I think the web APIs are throttled - perfectly understandable, but a PITA when I need to do this for several thousand names).

I am confused however when you say "None of the hits are to genes". Isn't  ZNF700 a gene? It appears in https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/wgEncodeGencodeAttrsV46.txt.gz for example.

Is there some database table / CSV file where I could download "all" of the "known" names (including ZNF* and AC*) and their positions in one fell swoop? If so I could locate the overlaps myself without having to make tens of thousands of web API calls...

Oren Ben-Kiki

unread,
12:27 PM (11 hours ago) 12:27 PM
to Maximilian Haeussler, UCSC Genome Browser Discussion List
Many thanks - gold.txt.gz eliminates the need to make all these web API calls (at least for getting the chromosome positions).
Reply all
Reply to author
Forward
0 new messages