Multiz 100 hg38 CDS alignments transcript versions mismatch

Adzhubey, Ivan A,Ph.D.

unread,

Jul 6, 2021, 2:16:12 PM7/6/21

to gen...@soe.ucsc.edu

Hi,

Any plans to update Multiz 100 alignments for hg38 CDS regions?

The version available for downloading now was compiled using GENCODE v29
I believe, while the current hg38 knownGene track is based upon GENCODE
v37 (and counting). There are numerous discrepancies between the two set
of transcripts, making CDS Multiz alignments hard to use without
introducing significant bias.

Thanks,

Ivan

--
Ivan Adzhubey, Ph.D.
Instructor
Division of Genetics, Dept of Medicine
Brigham & Women's Hospital
Harvard Medical School
New Research Building, Room 0464C
77 Avenue Louis Pasteur
Boston, MA 02115
tel.: (617) 525-4728
fax: (617) 525-4705
web: http://genetics.bwh.harvard.edu/wiki/sunyaevlab/
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at http://www.massgeneralbrigham.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.

Dan Schmelter

unread,

Jul 6, 2021, 8:44:46 PM7/6/21

to Adzhubey, Ivan A,Ph.D., gen...@soe.ucsc.edu

Hello Ivan,

Thank you for writing to us to report your experience with the conservation data.

I have added your request to update the reference GENCODE file to our ticketing system, but it is not something I anticipate being completed in the near future. For reference, could you provide an example gene or session that shows this mismatch?

You should be able to work around this by displaying our current knownGene, which is GENCODE v36, and zooming in enough to see the GENCODE track's amino acid displays. I understand that this may not be ideal, but hopefully, it is sufficient to interpret the amino acid change.

Thank you for writing in and for your suggestion. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
Daniel Schmelter
UCSC Genome Browser

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/141f5567-2393-ace4-e0af-8f5975918d2e%40rics.bwh.harvard.edu.

Dan Schmelter

unread,

Jul 13, 2021, 3:48:51 PM7/13/21

to Adzhubey, Ivan A,Ph.D., UCSC Genome Browser Support

Hello Ivan,

Thanks again for contacting us regarding these files that were not up-to-date with knownGene.

We have re-built the download files with the latest knownGene version (Gencode v36). You can access them at the following link:

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/alignments/

Let us know if you have any further concerns or difficulties.

All the best,

Daniel Schmelter
UCSC Genome Browser

On Tue, Jul 6, 2021 at 6:59 PM Adzhubey, Ivan A,Ph.D. <iadz...@rics.bwh.harvard.edu> wrote:

Hi Dan,

Thank you for the prompt response. Looks like I was not clear about my problem, sorry. I was talking about downloadable tracks, not the ones displayed by Genome Browser website. Specifically, there is a file containing Multiz 99 vertebrate species alignments against hg38 human genome for the CDS sequences of all protein coding transcripts annotated in the knownGene track, found here:

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/alignments/knownGene.exonAA.fa.gz

Other files in this directory include knownCanonical subset of the above as well as the corresponding nucleotide sequence alignments.

These files were last updated in October 2019 and hence are using contemporary knownGene annotations, which at the time were apparently based upon GENCODE v29, although this is not specified anywhere in the documentation. Current version of (downloadable) hg38 knownGene track is likely based upon GENCODE v36/v37 (also not documented). Anyway, two versions of the transcript annotations are now at least 5% diverged (total fraction of new/updated/removed transcript annotations). I normally use latest available version of knownGene track (we have a local SQL mirror of UCSC database) but unfortunately, it now produces lots of errors when trying to map CDS sequences from knownGene.exonAA.fa.gz file to the exons definitions in the updated knownGene track. One example is ENST00000617334.1, which encodes AA sequence 3 times longer in the current version compared to the one in the knownGene.exonAA.fa.gz file. There are also numerous transcripts for which version number has been incremented, transcripts removed, etc. Mismatched transcripts with the same ID and version number like ENST00000617334.1 are the worst case since I cannot easily filter them out. But filtering is also not a proper way to deal with this problem since it is introducing sampling bias into the set of filtered transcripts.

If you know of some way to version control knownGene track (downloadable one) and move back in time to October 2019 that would work as a solution too. Unfortunately, we do not keep copies of our local SQL mirror of UCSC database for that long.

Once again , I am not interested in web version of the tracks, I am talking about UCSC provided downloads. You have actually done this kind of update for the Multiz 100 downloadable files at least once before after their initial release in 2018, so hopefully this should be not too hard to repeat. How to solve this discrepancy problem in the long run, however, I do not know.

Thanks,

Ivan

Dan Schmelter

unread,

Jul 29, 2021, 8:43:07 PM7/29/21

to Adzhubey, Ivan A,Ph.D., UCSC Genome Browser Support

Hello Ivan,

Thanks for your patience with this response.

The knownGene track is a periodically updating dataset that uses a few different sources to generate the details page for each item and speed up the track loading itself. These sources include the SQL table you mentioned and a bigBed file. Each source has slightly different columns but relies on the same core data, currently GENCODE v36 for hg38. Each source has its own advantages and disadvantages; each might be suitable for different use cases. The knownGene SQL table has the basic data, while the bigBed has even more information. In general, we are transitioning away from SQL tables and towards bigBed files. Both should be accurate and usable depending on your use case.

If you want to get a Table Schema for the knownGene bigBed file, you can use the bigBedInfo command-line utility:

bigBedInfo knownGene.bb -as

We appreciate you letting us know about that SQL version issue. We certainly want our public SQL server to work with the latest version of MySQL. I'll file a fix-it ticket with our engineers, but I cannot predict when it may be fixed.

I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.

All the best,

Daniel Schmelter
UCSC Genome Browser

On Wed, Jul 21, 2021 at 9:54 PM Adzhubey, Ivan A,Ph.D. <iadz...@rics.bwh.harvard.edu> wrote:

Hi Daniel,

The last update of the MultiZ100 downloadable files did fix a lot of errors I was experiencing due to out of sync transcript annotations but I am still struggling to debug a number of remaining discrepancies between the UCSC database SQL dumps, UCSC Table Browser and UCSC Genome browser (web UI). Apparently, all three tools are now using different sources of transcript annotations and these sources are not in sync. My understanding, hg38 assembly now uses .bb format for its GENCODE/knownGene transcript annotations track. Unfortunately, it looks like its format is now different from the downloadable SQL schema of the same table still available in the SQL dump directory here:

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

See knownGene.sql file for table schema. It is not clear which file/format I should use for debugging. Comparing all four different sources is a bit too much, not really feasible. What would you suggest? Which downloadable file(s) will be supported in the future? We've been using UCSC database SQL dumps in our projects for many years with great success but looks like they are now being gradually superseded by other more flexible formats?

Also, when I try to query remote SQL database at UCSC MariaDB server as per instructions found here:

https://genome.ucsc.edu/goldenPath/help/mysql.html

I receive garbage in hg38.known.gene.exonStarts and hg38.known.gene.exonEnds columns, for instance:

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 187750
Server version: 5.5.5-10.3.27-MariaDB MariaDB Server

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> use hg38
Database changed
mysql> select * from knownGene where name = 'ENST00000617334.1'\G
*************************** 1. row ***************************
      name: ENST00000617334.1
     chrom: chr8
    strand: -
   txStart: 99960939
     txEnd: 100106116
cdsStart: 99962438
    cdsEnd: 100105921
exonCount: 30
exonStarts: 0x39393936303933392C39393936323339332C39393936323638362C39393936323838342C39393936353333342C39393937373931362C39393938313933362C39393938373435372C39393939363436312C39393939393236312C3130303030323230312C3130303030323332302C3130303030333133372C3130303030333932352C3130303030363031362C3130303030383337342C3130303033383933302C3130303033393936312C3130303034313830312C3130303034373436322C3130303035323830312C3130303036323539302C3130303036333431352C3130303036363136362C3130303037313430342C3130303037323134342C3130303038303133332C3130303039333434362C3130303130353337332C3130303130353839362C
exonEnds: 0x39393936313139362C39393936323434332C39393936323736372C39393936323937382C39393936353433302C39393937383037352C39393938323131362C39393938373631392C39393939363533302C39393939393432302C3130303030323331382C3130303030323336312C3130303030333133392C3130303030343039382C3130303030363130392C3130303030383536392C3130303033393033322C3130303034303038372C3130303034313931362C3130303034373539362C3130303035323937362C3130303036323735322C3130303036343034332C3130303036363239362C3130303037313533372C3130303037323233302C3130303038303335352C3130303039333530392C3130303130353430322C3130303130363131362C
proteinID: A0A087WV61
   alignID: uc064ozd.1
1 row in set (0.07 sec)

I am using the following OS and mysql client versions:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

$ mysql --version
mysql Ver 8.0.25-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))

I suspect this is due to some sort of incompatibility between MySQL v8 (I am using) and MariaDB hosted by UCSC. Do you know if it could be fixed, other than by installing MariaDB on my computer?

Thanks,

Ivan

On 7/13/21 10:33 PM, Adzhubey, Ivan A,Ph.D. wrote:

Hi Dan,

Thank you for such a prompt fix! And please pass my thanks to the whole support team, I really appreciate your incredible work. I have downloaded updated files and will be testing them here tomorrow.

Best,

Ivan

Reply all

Reply to author

Forward