Mapping hg19 multiz100way alignments to Ensembl transcripts

12 views

Skip to first unread message

Laura Ponting

unread,

Dec 20, 2022, 6:33:16 PM12/20/22

to gen...@soe.ucsc.edu

Hi,

I am using the multiz100way alignment file for 37 here http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/alignments/knownGene.exonAA.fa.gz

I want to associate the alignments with Ensembl transcripts but to do this I need to map between the UCSC IDS and ENSTs.

I was thinking to use the knownToEnsembl table from here https://genome.ucsc.edu/cgi-bin/hgTables to do this but in some cases I get multiple matches for a single ENST, with slightly different alignments.

For example

uc002vwo.2 and uc010znj.1 both point to ENST00000353578 in the knownToEnsembl file

but the alignments for the same region are slightly different.

>uc010znj.1_hg19_3_40 191 1 1 chr2:238285415-238285987-

(...)

>uc010znj.1_falChe1_3_40 191 1 1 KB397932:117752-118318+

-SKKDILFLIDGSANLLGSFPAVRDFVHKVISDLNVGPDATRVAVAQFSDTIQVEFDFAELPSKQDMLLKVKRMKIKTGKQLNIGAALDEAIRRLFVKEAGSRIEEGVPQFLVLLVAGRSTDDAEQPSDALKQAGVVTFAIKAKNADSAELERIVYAPQFILNVDSLPRISELQPNIVNLLKTI------T

>uc010znj.1_xipMac1_3_40 191 1 1 JH558540:21194-21248-;AGAJ01045604:8200-8668-;JH558540:21742-21749-

-P--DVVFLLDGSDDSRNGLPAFREFVRRMAEELDVGKDGVRLAVVQYSDDATVYFNLATHKTKKAVIYAIRALRHKGGRTRNTGAALEFVRKHVFSATSGS---QGVPQVLVVLTGGTSSDDVSSAALDLKQVGVFSFVIGMKDADQEELE-IASSSRFL-----------PZSK--NLI----NR----

>uc002vwo.2_hg19_5_42 191 1 1 chr2:238285415-238285987-

(...)

>uc002vwo.2_falChe1_5_42 191 1 1 KB397932:117752-118318+

ESKKDILFLIDGSANLLGSFPAVRDFVHKVISDLNVGPDATRVAVAQFSDTIQVEFDFAELPSKQDMLLKVKRMKIKTGKQLNIGAALDEAIRRLFVKEAGSRIEEGVPQFLVLLVAGRSTDDAEQPSDALKQAGVVTFAIKAKNADSAELERIVYAPQFILNVDSLPRISELQPNIVNLLKTI------T

>uc002vwo.2_xipMac1_5_42 191 1 1 JH558540:21194-21248-;AGAJ01045604:8200-8668-;JH558540:21742-21749-

GP--DVVFLLDGSDDSRNGLPAFREFVRRMAEELDVGKDGVRLAVVQYSDDATVYFNLATHKTKKAVIYAIRALRHKGGRTRNTGAALEFVRKHVFSATSGS---QGVPQVLVVLTGGTSSDDVSSAALDLKQVGVFSFVIGMKDADQEELE-IASSSRFL-----------PZSK--NLI----NR----

How do I decide which entry to use for each Ensembl transcript?

Thanks,

Laura

The information in this e-mail and any attachments is confidential and is intended for the legitimate addressee only. If you receive this e-mail in error, please contact the sender forthwith and then delete the e-mail. We have taken all reasonable precautions to check this e-mail and any attachments for viruses, but we cannot accept any liability for any damage sustained as a result of any virus, worm or other malicious software. Congenica Ltd is registered in England and Wales No.08273616.

Luis Nassar

unread,

Dec 23, 2022, 8:23:08 PM12/23/22

to Laura Ponting, gen...@soe.ucsc.edu

Hello, Laura.

Thank you for your interest in the Genome Browser.

There isn't a simple answer to your question. Both the UCSC IDs and the ENST IDs are transcript annotations, so the genes often have multiple isoforms in each region. Here is a session of that area (http://genome.ucsc.edu/s/Lou/RM30410), you'll see that there are 6 UCSC transcripts and 5 ensemble transcripts. When the knownToEnsembl table was built, it looked at sequence similarity to delcare the association. However, since UCSC annotations and Ensembl annotations were independently generated there is no perfect 1-1 match every time.

One solution, if you are able to work with hg38 data, is to use the more recent alignment which was built directly with ensembl transcripts. The hg19 version of that track was built in 2014, whereas the hg38 was 100-way was generated in 2019 after we began using ENST IDs as our default gene annotations.

You can find the equivalent file here (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/alignments/knownCanonical.exonAA.fa.gz) and a look into it will display the ENST transcripts directly in the file, e.g.

>ENST00000641515.2_hg38_1_2 3 0 0 chr1:65565-65573+
MKK
>ENST00000641515.2_panTro4_1_2 3 0 0 chrUn_GL393541:146907-146915+
MKK

I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/LO2P123MB222333F8E29FC465AD590C14F0EA9%40LO2P123MB2223.GBRP123.PROD.OUTLOOK.COM.

Reply all

Reply to author

Forward

0 new messages