Duplicate transcripts in hg19 knowGene track

20 views
Skip to first unread message

Adzhubey, Ivan A.

unread,
Feb 13, 2025, 1:37:07 AM2/13/25
to gen...@soe.ucsc.edu
Hi,

I was doing some QC on recently downloaded SQL files with GB hg19 tracks
and noticed that knownGene track contains some identical transcript
names mapped to different loci, e.g.:

$ zgrep ENST00000577553.1 ~/lab/data/ucsc/hg19/database/knownGene.txt.gz
ENST00000577553.1       chrX    -       425315  425416  425315 425315 
1       425315, 425416,         uc326fjf.1
ENST00000577553.1       chrY    -       375315  375416  375315 375315 
1       375315, 375416,         uc326fjf.1

Are those genuine two-loci gene duplicates or are they GENCODE liftOver
bugs? Sorry if this is a stupid question, I understand in this
particular example the second mapping could be in the PAR region of
chrY. I just did not expect GB/GENCODE to use exactly the same
identifier (transcript name) for the two distinct physical copies of the
gene. After all, even if both copies have identical (or highly similar)
sequence, they are still two different genes, not one. I could not find
any documentation on how transcript names are assigned in such cases.
Could you enlighten me?

Thank you,

Ivan


--
Ivan Adzhubey, Ph.D.
Research Associate
Dept of Biomedical Informatics
Harvard Medical School
10 Shattuck Street, Suite 514
Boston, MA 02115
tel: (617) 432-2144
fax: (617) 432-0693
web: https://sunyaevlab.hms.harvard.edu/wiki/!web

Adzhubey, Ivan A.

unread,
Feb 13, 2025, 7:51:31 PM2/13/25
to gen...@soe.ucsc.edu
Hi,

I've done some additional debugging on the knownGene table and here's
what I found.

It looks like there are only 7 such duplicate transcripts from the whole
set of 381,937 rows in the table:

mysql> SELECT name, COUNT(*) FROM knownGene GROUP BY name HAVING
COUNT(*) > 1;
+-------------------+----------+
| name              | COUNT(*) |
+-------------------+----------+
| ENST00000577553.1 |        2 |
| ENST00000577896.1 |        2 |
| ENST00000578699.1 |        2 |
| ENST00000580266.1 |        2 |
| ENST00000580687.1 |        2 |
| ENST00000581137.1 |        2 |
| ENST00000583047.1 |        2 |
+-------------------+----------+
7 rows in set (0.17 sec)

Also, all seven of them are mapped to both chrX and chrY:

mysql> SELECT DISTINCT g1.name, g1.chrom FROM knownGene g1 JOIN
knownGene g2 ON g1.name = g2.name AND g1.chrom <> g2.chrom ORDER BY
g1.name;
+-------------------+-------+
| name              | chrom |
+-------------------+-------+
| ENST00000577553.1 | chrX  |
| ENST00000577553.1 | chrY  |
| ENST00000577896.1 | chrX  |
| ENST00000577896.1 | chrY  |
| ENST00000578699.1 | chrX  |
| ENST00000578699.1 | chrY  |
| ENST00000580266.1 | chrX  |
| ENST00000580266.1 | chrY  |
| ENST00000580687.1 | chrX  |
| ENST00000580687.1 | chrY  |
| ENST00000581137.1 | chrX  |
| ENST00000581137.1 | chrY  |
| ENST00000583047.1 | chrX  |
| ENST00000583047.1 | chrY  |
+-------------------+-------+
14 rows in set (5.89 sec)

No other multi-loci mapped transcripts found in knownGene hg19. Note,
searching Ensembl Browser v113 shows these transcripts as obsolete and
removed. They are also missing from knownGene hg38.

Looks like a bug to me.

Best,

Ivan

Galt Barber

unread,
Feb 13, 2025, 7:55:35 PM2/13/25
to Adzhubey, Ivan A., gen...@soe.ucsc.edu
Hi, Ivan!

As you guessed these are due to the PAR regions at the ends of the X and Y chromosomes.

Galt Barber
UCSC Genome Browser staff.



Ar Déar 13 Feabh 2025 ag 16:51, scríobh Adzhubey, Ivan A. <ivan_a...@hms.harvard.edu>:
--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/15647ac8-0e4a-4b57-b1fd-0681366699d6%40hms.harvard.edu.

Gerardo Perez

unread,
Feb 14, 2025, 7:46:20 PM2/14/25
to Adzhubey, Ivan A., gen...@soe.ucsc.edu

Hello, Ivan.

Thank you again for reaching out to the UCSC Genome Browser support team and bringing this to our attention.

We wanted to follow up and let you know that the duplicate transcripts in the hg19 KnownGene track are due to a data processing issue and can be safely ignored. We have reported this issue to EBI.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Gerardo Perez
UCSC Genomics Institute


Reply all
Reply to author
Forward
0 new messages