Repeated transcripts ids

FRANCISCO RODRIGUEZ MARTIN

unread,

Feb 20, 2025, 12:00:50 PM2/20/25

to gen...@soe.ucsc.edu

Good morning,

I write to you because I am finding problems to use the gtf annotation file generated by Table Browser with next characteristics:

-Assembly: GRCh38/hg38

-Group: Repeats

-Track: RepeatMasker

I do not why, but many transcripts ids are duplicated. In this way, there are for example many copies of "AluY_dup1000", located in different genomic locations and that are variable in length. This is a problem to perform the following analysis because when it is necessary to quantify, reads that are assigned to a specific transcript id can be any of those reads, and it is not possible to know exactly which. I also tried to download this file: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz, but it cannot be imported to R in a valid format for RSubread package.

Best regards and thanks for your attention,

Fran

Gerardo Perez

unread,

Mar 1, 2025, 4:14:23 PM3/1/25

to FRANCISCO RODRIGUEZ MARTIN, gen...@soe.ucsc.edu

Hello Fran.

Thank you for using the UCSC Genome Browser and for sending your inquiry.

To better assist you, could you share more details about why you need a GTF file for the RepeatMasker track data? Since this track consists of repetitive elements, encountering repeated IDs is expected. The RepeatMasker track does not include transcripts. You can find more information about the RepeatMasker track on its description page: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&position=default&g=rmsk#TRACK_HTML

Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Gerardo Perez
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/DB7PR01MB4918901310B1C39DE0E0D64FF1C42%40DB7PR01MB4918.eurprd01.prod.exchangelabs.com.

FRANCISCO RODRIGUEZ MARTIN

unread,

Mar 7, 2025, 12:23:27 PM3/7/25

to Gerardo Perez, gen...@soe.ucsc.edu

Good morning,

I will show you the table header. There is one "gene_id" column and one "transcript_id" column. When I analyse both columns, I can see that there are some "gene_id" with a lot of "transcript_ids", and there seem to be variants of that repetitive sequences, located in different locations of the genome and with different lengths.

head(repeat_annotation_gtf)
  seqid    source type    start      end score strand phase gene_id transcript_id
1  chr1 hg38_rmsk exon 67108754 67109046  1892      +    NA    L1P5          L1P5
2  chr1 hg38_rmsk exon  8388316  8388618  2582      -    NA    AluY          AluY
3  chr1 hg38_rmsk exon 25165804 25166380  4085      +    NA   L1MB5         L1MB5
4  chr1 hg38_rmsk exon 33554186 33554483  2285      -    NA   AluSc         AluSc
5  chr1 hg38_rmsk exon 41942895 41943205  2451      -    NA    AluY     AluY_dup1
6  chr1 hg38_rmsk exon 50331337 50332274  1587      +    NA    HAL1          HAL1

Best regards,

Francisco

De: Gerardo Perez <gpe...@ucsc.edu>
Enviado: sábado, 1 de marzo de 2025 22:14
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Cc: gen...@soe.ucsc.edu <gen...@soe.ucsc.edu>
Asunto: Re: [genome] Repeated transcripts ids

Jairo Navarro Gonzalez

unread,

Mar 13, 2025, 6:55:21 PM3/13/25

to FRANCISCO RODRIGUEZ MARTIN, Gerardo Perez, gen...@soe.ucsc.edu

Hello,

Thank you for using the UCSC Genome Browser and sending your follow-up inquiry.

Can you explain a bit more about what you are trying to do with the UCSC Genome Browser? The repeats in the RepeatMasker track are composed of different classes of repeats and are found throughout the genome. These repetitive regions are not transcripts and can be variable in length. Here is a wiki page with more information about repetitive regions in the genome:

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)

The complete list of repeat classes shown on the RepeatMasker track can be found on the track description page:

https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=rmsk

If you are trying to map reads to transcripts, we recommend that you use a gene track such as NCBI RefSeq or GENCODE for hg38.

https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=refSeqComposite
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=knownGene

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.

If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

To view this discussion visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/VI1PR01MB4926252C810C21CB3A8E16DCF1D52%40VI1PR01MB4926.eurprd01.prod.exchangelabs.com.

FRANCISCO RODRIGUEZ MARTIN

unread,

Apr 8, 2025, 12:05:59 PM4/8/25

to gen...@soe.ucsc.edu

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: martes, 8 de abril de 2025 14:35
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids

Good afternoon Jairo,

I forgot to answer you. The GTF files I have downloaded contain a specific transcript-id for each feature. In other words, each feature has a gene-id and a transcript-id associated.

I have another doubt regarding the format: all the sequences are associated to exons regions?

Best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: miércoles, 19 de marzo de 2025 23:26
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>; genome-www list <genom...@soe.ucsc.edu>

Asunto: Re: [genome] Repeated transcripts ids

Hello,

Thank you for sending your follow-up with more information.

It sounds like you want to map the RNA-seq reads to the various repeats in the genome using the RepeatMasker track. The GTF that we provide via the Table Browser will not have unique IDs. However, you can make the IDs unique by adding the chromosome position to the ID using a Python script or command-line tools like AWK. For example, you can create a script to edit and generate the following GTF:

seqid source type start end score strand phase gene_id transcript_id
1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5_chr1_67108754_67109046
2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY_chr1_8388316_8388618
3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5_chr1_25165804_25166380
4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc_chr1_33554186_33554483
5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1_chr1_41942895_41943205
6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1_chr1_50331337_50332274

Unfortunately, assistance with scripting is beyond the scope of this mailing list. You may also want to post your question on other bioinformatic forums, such as BioStars (https://www.biostars.org/), to get assistance from other members of the scientific community. You can also ask chatGPT to generate a script to combine the fields into the transcript ID to make them unique.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

On Fri, Mar 14, 2025 at 1:04 AM FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es> wrote:

I did not mention it, but my data is derived from a RNA-seq study.

Fran

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: viernes, 14 de marzo de 2025 8:27
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids

Good morning,

What I am trying to do is to map repetitive sequences displayed in my fastq files using as reference the human genome. When I use the rmsk track as RepeatMasker GTF annotation file to quantify these reads after mapping them, I obtain "gene_id" and "transcript_id", as I showed in the previous mail. I guess that, for example AluY, can have different lengths and can be located in different locations of the genome, being those "AluY_dupX". In the following table AluY_dup1 is shown, for example:

head(repeat_annotation_gtf)

seqid source type start end score strand phase gene_id transcript_id

1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5

2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY

3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5

4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc

5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1

6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1

Thanks for your attention and best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: jueves, 13 de marzo de 2025 23:55

Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>

Cc: Gerardo Perez <gpe...@ucsc.edu>; gen...@soe.ucsc.edu <gen...@soe.ucsc.edu>

FRANCISCO RODRIGUEZ MARTIN

unread,

Apr 8, 2025, 2:27:02 PM4/8/25

to gen...@soe.ucsc.edu

I sent this question to Jairo, but he told me to resend the question to this email.

Best regards,

Francisco

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: martes, 8 de abril de 2025 14:35
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids

Good afternoon Jairo,

I forgot to answer you. The GTF files I have downloaded contain a specific transcript-id for each feature. In other words, each feature has a gene-id and a transcript-id associated.

I have another doubt regarding the format: all the sequences are associated to exons regions?

Best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: miércoles, 19 de marzo de 2025 23:26

Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>; genome-www list <genom...@soe.ucsc.edu>

Asunto: Re: [genome] Repeated transcripts ids

Hello,

Thank you for sending your follow-up with more information.

It sounds like you want to map the RNA-seq reads to the various repeats in the genome using the RepeatMasker track. The GTF that we provide via the Table Browser will not have unique IDs. However, you can make the IDs unique by adding the chromosome position to the ID using a Python script or command-line tools like AWK. For example, you can create a script to edit and generate the following GTF:

seqid source type start end score strand phase gene_id transcript_id
1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5_chr1_67108754_67109046
2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY_chr1_8388316_8388618
3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5_chr1_25165804_25166380
4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc_chr1_33554186_33554483
5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1_chr1_41942895_41943205
6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1_chr1_50331337_50332274

Unfortunately, assistance with scripting is beyond the scope of this mailing list. You may also want to post your question on other bioinformatic forums, such as BioStars (https://www.biostars.org/), to get assistance from other members of the scientific community. You can also ask chatGPT to generate a script to combine the fields into the transcript ID to make them unique.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

On Fri, Mar 14, 2025 at 1:04 AM FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es> wrote:

I did not mention it, but my data is derived from a RNA-seq study.

Fran

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: viernes, 14 de marzo de 2025 8:27
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids

Good morning,

What I am trying to do is to map repetitive sequences displayed in my fastq files using as reference the human genome. When I use the rmsk track as RepeatMasker GTF annotation file to quantify these reads after mapping them, I obtain "gene_id" and "transcript_id", as I showed in the previous mail. I guess that, for example AluY, can have different lengths and can be located in different locations of the genome, being those "AluY_dupX". In the following table AluY_dup1 is shown, for example:

head(repeat_annotation_gtf)

seqid source type start end score strand phase gene_id transcript_id

1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5

2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY

3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5

4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc

5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1

6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1

Thanks for your attention and best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: jueves, 13 de marzo de 2025 23:55

Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>

Cc: Gerardo Perez <gpe...@ucsc.edu>; gen...@soe.ucsc.edu <gen...@soe.ucsc.edu>

Matthew Speir

unread,

Apr 9, 2025, 12:47:57 PM4/9/25

to FRANCISCO RODRIGUEZ MARTIN, gen...@soe.ucsc.edu

Hello, Francisco.

You can safely ignore that column in this case. This happens because of the way the data for the RepeatMasker track is structured in our database. The GTF output from the Table Browser simply labels each item as an "exon", not necessarily that those items are associated with exons.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---

Matthew Speir

UCSC Genome Browser, User Support

To view this discussion visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/DB7PR01MB4918AB81D6C8ED30A4B2F20AF1B52%40DB7PR01MB4918.eurprd01.prod.exchangelabs.com.

Reply all

Reply to author

Forward