Hello Ivan,
Thank you for writing to us to report your experience with the conservation data.
I have added your request to update the reference GENCODE file to our ticketing system, but it is not something I anticipate being completed in the near future. For reference, could you provide an example gene or session that shows this mismatch?
You should be able to work around this by displaying our current knownGene, which is GENCODE v36, and zooming in enough to see the GENCODE track's amino acid displays. I understand that this may not be ideal, but hopefully, it is sufficient to interpret the amino acid change.
Thank you for writing in and for your suggestion. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
Daniel Schmelter
UCSC Genome Browser
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/141f5567-2393-ace4-e0af-8f5975918d2e%40rics.bwh.harvard.edu.
Hello Ivan,
Thanks again for contacting us regarding these files that were not up-to-date with knownGene.
We have re-built the download files with the latest knownGene version (Gencode v36). You can access them at the following link:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/alignments/
Let us know if you have any further concerns or difficulties.
All the best,
Daniel Schmelter
UCSC Genome Browser
Hi Dan,
Thank you for the prompt response. Looks like I was not clear about my problem, sorry. I was talking about downloadable tracks, not the ones displayed by Genome Browser website. Specifically, there is a file containing Multiz 99 vertebrate species alignments against hg38 human genome for the CDS sequences of all protein coding transcripts annotated in the knownGene track, found here:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/alignments/knownGene.exonAA.fa.gz
Other files in this directory include knownCanonical subset of the above as well as the corresponding nucleotide sequence alignments.
These files were last updated in October 2019 and hence are using contemporary knownGene annotations, which at the time were apparently based upon GENCODE v29, although this is not specified anywhere in the documentation. Current version of (downloadable) hg38 knownGene track is likely based upon GENCODE v36/v37 (also not documented). Anyway, two versions of the transcript annotations are now at least 5% diverged (total fraction of new/updated/removed transcript annotations). I normally use latest available version of knownGene track (we have a local SQL mirror of UCSC database) but unfortunately, it now produces lots of errors when trying to map CDS sequences from knownGene.exonAA.fa.gz file to the exons definitions in the updated knownGene track. One example is ENST00000617334.1, which encodes AA sequence 3 times longer in the current version compared to the one in the knownGene.exonAA.fa.gz file. There are also numerous transcripts for which version number has been incremented, transcripts removed, etc. Mismatched transcripts with the same ID and version number like ENST00000617334.1 are the worst case since I cannot easily filter them out. But filtering is also not a proper way to deal with this problem since it is introducing sampling bias into the set of filtered transcripts.
If you know of some way to version control knownGene track (downloadable one) and move back in time to October 2019 that would work as a solution too. Unfortunately, we do not keep copies of our local SQL mirror of UCSC database for that long.
Once again , I am not interested in web version of the tracks, I am talking about UCSC provided downloads. You have actually done this kind of update for the Multiz 100 downloadable files at least once before after their initial release in 2018, so hopefully this should be not too hard to repeat. How to solve this discrepancy problem in the long run, however, I do not know.
Thanks,
Ivan
Hello Ivan,
Thanks for your patience with this response.
The knownGene track is a periodically updating dataset that uses a few different sources to generate the details page for each item and speed up the track loading itself. These sources include the SQL table you mentioned and a bigBed file. Each source has slightly different columns but relies on the same core data, currently GENCODE v36 for hg38. Each source has its own advantages and disadvantages; each might be suitable for different use cases. The knownGene SQL table has the basic data, while the bigBed has even more information. In general, we are transitioning away from SQL tables and towards bigBed files. Both should be accurate and usable depending on your use case.
If you want to get a Table Schema for the knownGene bigBed file, you can use the bigBedInfo command-line utility:
bigBedInfo knownGene.bb -as
We appreciate you letting us know about that SQL version issue. We certainly want our public SQL server to work with the latest version of MySQL. I'll file a fix-it ticket with our engineers, but I cannot predict when it may be fixed.
I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,
Daniel Schmelter
UCSC Genome Browser
Hi Daniel,
The last update of the MultiZ100 downloadable files did fix a lot of errors I was experiencing due to out of sync transcript annotations but I am still struggling to debug a number of remaining discrepancies between the UCSC database SQL dumps, UCSC Table Browser and UCSC Genome browser (web UI). Apparently, all three tools are now using different sources of transcript annotations and these sources are not in sync. My understanding, hg38 assembly now uses .bb format for its GENCODE/knownGene transcript annotations track. Unfortunately, it looks like its format is now different from the downloadable SQL schema of the same table still available in the SQL dump directory here:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/
See knownGene.sql file for table schema. It is not clear which file/format I should use for debugging. Comparing all four different sources is a bit too much, not really feasible. What would you suggest? Which downloadable file(s) will be supported in the future? We've been using UCSC database SQL dumps in our projects for many years with great success but looks like they are now being gradually superseded by other more flexible formats?
Also, when I try to query remote SQL database at UCSC MariaDB server as per instructions found here:
https://genome.ucsc.edu/goldenPath/help/mysql.html
I receive garbage in hg38.known.gene.exonStarts and hg38.known.gene.exonEnds columns, for instance:
$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 187750
Server version: 5.5.5-10.3.27-MariaDB MariaDB Server
Copyright (c) 2000, 2021, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> use hg38
Database changed
mysql> select * from knownGene where name = 'ENST00000617334.1'\G
*************************** 1. row ***************************
name: ENST00000617334.1
chrom: chr8
strand: -
txStart: 99960939
txEnd: 100106116
cdsStart: 99962438
cdsEnd: 100105921
exonCount: 30
exonStarts: 0x39393936303933392C39393936323339332C39393936323638362C39393936323838342C39393936353333342C39393937373931362C39393938313933362C39393938373435372C39393939363436312C39393939393236312C3130303030323230312C3130303030323332302C3130303030333133372C3130303030333932352C3130303030363031362C3130303030383337342C3130303033383933302C3130303033393936312C3130303034313830312C3130303034373436322C3130303035323830312C3130303036323539302C3130303036333431352C3130303036363136362C3130303037313430342C3130303037323134342C3130303038303133332C3130303039333434362C3130303130353337332C3130303130353839362C
exonEnds: 0x39393936313139362C39393936323434332C39393936323736372C39393936323937382C39393936353433302C39393937383037352C39393938323131362C39393938373631392C39393939363533302C39393939393432302C3130303030323331382C3130303030323336312C3130303030333133392C3130303030343039382C3130303030363130392C3130303030383536392C3130303033393033322C3130303034303038372C3130303034313931362C3130303034373539362C3130303035323937362C3130303036323735322C3130303036343034332C3130303036363239362C3130303037313533372C3130303037323233302C3130303038303335352C3130303039333530392C3130303130353430322C3130303130363131362C
proteinID: A0A087WV61
alignID: uc064ozd.1
1 row in set (0.07 sec)I am using the following OS and mysql client versions:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
$ mysql --version
mysql Ver 8.0.25-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))
I suspect this is due to some sort of incompatibility between MySQL v8 (I am using) and MariaDB hosted by UCSC. Do you know if it could be fixed, other than by installing MariaDB on my computer?
Thanks,
Ivan
On 7/13/21 10:33 PM, Adzhubey, Ivan A,Ph.D. wrote:
Hi Dan,
Thank you for such a prompt fix! And please pass my thanks to the whole support team, I really appreciate your incredible work. I have downloaded updated files and will be testing them here tomorrow.
Best,
Ivan