Repeated transcripts ids

53 views
Skip to first unread message

FRANCISCO RODRIGUEZ MARTIN

unread,
Feb 20, 2025, 12:00:50 PM2/20/25
to gen...@soe.ucsc.edu
Good morning,

I write to you because I am finding problems to use the gtf annotation file generated by Table Browser with next characteristics:

-Assembly: GRCh38/hg38
-Group: Repeats
-Track: RepeatMasker

I do not why, but many transcripts ids are duplicated. In this way, there are for example many copies of "AluY_dup1000", located in different genomic locations and that are variable in length. This is a problem to perform the following analysis because when it is necessary to quantify, reads that are assigned to a specific transcript id can be any of those reads, and it is not possible to know exactly which. I also tried to download this file: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz, but it cannot be imported to R in a valid format for RSubread package.

Best regards and thanks for your attention,

Fran

Gerardo Perez

unread,
Mar 1, 2025, 4:14:23 PM3/1/25
to FRANCISCO RODRIGUEZ MARTIN, gen...@soe.ucsc.edu

Hello Fran.

Thank you for using the UCSC Genome Browser and for sending your inquiry.

To better assist you, could you share more details about why you need a GTF file for the RepeatMasker track data? Since this track consists of repetitive elements, encountering repeated IDs is expected. The RepeatMasker track does not include transcripts. You can find more information about the RepeatMasker track on its description page: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&position=default&g=rmsk#TRACK_HTML

Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Gerardo Perez
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/DB7PR01MB4918901310B1C39DE0E0D64FF1C42%40DB7PR01MB4918.eurprd01.prod.exchangelabs.com.

FRANCISCO RODRIGUEZ MARTIN

unread,
Mar 7, 2025, 12:23:27 PM3/7/25
to Gerardo Perez, gen...@soe.ucsc.edu
Good morning,

I will show you the table header. There is one "gene_id" column and one "transcript_id" column. When I analyse both columns, I can see that there are some "gene_id" with a lot of "transcript_ids", and there seem to be variants of that repetitive sequences, located in different locations of the genome and with different lengths. 

head(repeat_annotation_gtf)  seqid    source type    start      end score strand phase gene_id transcript_id 1  chr1 hg38_rmsk exon 67108754 67109046  1892      +    NA    L1P5          L1P5 2  chr1 hg38_rmsk exon  8388316  8388618  2582      -    NA    AluY          AluY 3  chr1 hg38_rmsk exon 25165804 25166380  4085      +    NA   L1MB5         L1MB5 4  chr1 hg38_rmsk exon 33554186 33554483  2285      -    NA   AluSc         AluSc 5  chr1 hg38_rmsk exon 41942895 41943205  2451      -    NA    AluY     AluY_dup1 6  chr1 hg38_rmsk exon 50331337 50332274  1587      +    NA    HAL1          HAL1


Best regards,

Francisco 



De: Gerardo Perez <gpe...@ucsc.edu>
Enviado: sábado, 1 de marzo de 2025 22:14
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Cc: gen...@soe.ucsc.edu <gen...@soe.ucsc.edu>
Asunto: Re: [genome] Repeated transcripts ids
 

Jairo Navarro Gonzalez

unread,
Mar 13, 2025, 6:55:21 PM3/13/25
to FRANCISCO RODRIGUEZ MARTIN, Gerardo Perez, gen...@soe.ucsc.edu

Hello,

Thank you for using the UCSC Genome Browser and sending your follow-up inquiry.

Can you explain a bit more about what you are trying to do with the UCSC Genome Browser? The repeats in the RepeatMasker track are composed of different classes of repeats and are found throughout the genome. These repetitive regions are not transcripts and can be variable in length. Here is a wiki page with more information about repetitive regions in the genome:

https://en.wikipedia.org/wiki/Repeated_sequence_(DNA)

The complete list of repeat classes shown on the RepeatMasker track can be found on the track description page:

https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=rmsk

If you are trying to map reads to transcripts, we recommend that you use a gene track such as NCBI RefSeq or GENCODE for hg38.

https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=refSeqComposite
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=knownGene

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.


If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser


FRANCISCO RODRIGUEZ MARTIN

unread,
Apr 8, 2025, 12:05:59 PM4/8/25
to gen...@soe.ucsc.edu


De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: martes, 8 de abril de 2025 14:35
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids
 
Good afternoon Jairo,

I forgot to answer you. The GTF files I have downloaded contain a specific transcript-id for each feature. In other words, each feature has a gene-id and a transcript-id associated. 

I have another doubt regarding the format: all the sequences are associated to exons regions?

Best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: miércoles, 19 de marzo de 2025 23:26
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>; genome-www list <genom...@soe.ucsc.edu>

Asunto: Re: [genome] Repeated transcripts ids
 

Hello,

Thank you for sending your follow-up with more information.

It sounds like you want to map the RNA-seq reads to the various repeats in the genome using the RepeatMasker track. The GTF that we provide via the Table Browser will not have unique IDs. However, you can make the IDs unique by adding the chromosome position to the ID using a Python script or command-line tools like AWK. For example, you can create a script to edit and generate the following GTF:

seqid source type start end score strand phase gene_id transcript_id
1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5_chr1_67108754_67109046
2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY_chr1_8388316_8388618
3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5_chr1_25165804_25166380
4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc_chr1_33554186_33554483
5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1_chr1_41942895_41943205
6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1_chr1_50331337_50332274

Unfortunately, assistance with scripting is beyond the scope of this mailing list. You may also want to post your question on other bioinformatic forums, such as BioStars (https://www.biostars.org/), to get assistance from other members of the scientific community. You can also ask chatGPT to generate a script to combine the fields into the transcript ID to make them unique.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser


On Fri, Mar 14, 2025 at 1:04 AM FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es> wrote:
I did not mention it, but my data is derived from a RNA-seq study.

Fran

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: viernes, 14 de marzo de 2025 8:27
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids
 
Good morning, 

What I am trying to do is to map repetitive sequences displayed in my fastq files using as reference the human genome. When I use the rmsk track as RepeatMasker GTF annotation file to quantify these reads after mapping them, I obtain "gene_id" and "transcript_id", as I showed in the previous mail. I guess that, for example AluY, can have different lengths and can be located in different locations of the genome, being those "AluY_dupX". In the following table AluY_dup1 is shown, for example:

head(repeat_annotation_gtf)
  seqid    source type    start      end score strand phase gene_id transcript_id
1  chr1 hg38_rmsk exon 67108754 67109046  1892      +    NA    L1P5          L1P5
2  chr1 hg38_rmsk exon  8388316  8388618  2582      -    NA    AluY          AluY
3  chr1 hg38_rmsk exon 25165804 25166380  4085      +    NA   L1MB5         L1MB5
4  chr1 hg38_rmsk exon 33554186 33554483  2285      -    NA   AluSc         AluSc
5  chr1 hg38_rmsk exon 41942895 41943205  2451      -    NA    AluY     AluY_dup1
6  chr1 hg38_rmsk exon 50331337 50332274  1587      +    NA    HAL1          HAL1

Thanks for your attention and best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: jueves, 13 de marzo de 2025 23:55

Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>

FRANCISCO RODRIGUEZ MARTIN

unread,
Apr 8, 2025, 2:27:02 PM4/8/25
to gen...@soe.ucsc.edu
I sent this question to Jairo, but he told me to resend the question to this email.

Best regards,

Francisco

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: martes, 8 de abril de 2025 14:35
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids
 
Good afternoon Jairo,

I forgot to answer you. The GTF files I have downloaded contain a specific transcript-id for each feature. In other words, each feature has a gene-id and a transcript-id associated. 

I have another doubt regarding the format: all the sequences are associated to exons regions?

Best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: miércoles, 19 de marzo de 2025 23:26
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>; genome-www list <genom...@soe.ucsc.edu>

Asunto: Re: [genome] Repeated transcripts ids
 

Hello,

Thank you for sending your follow-up with more information.

It sounds like you want to map the RNA-seq reads to the various repeats in the genome using the RepeatMasker track. The GTF that we provide via the Table Browser will not have unique IDs. However, you can make the IDs unique by adding the chromosome position to the ID using a Python script or command-line tools like AWK. For example, you can create a script to edit and generate the following GTF:

seqid source type start end score strand phase gene_id transcript_id
1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5_chr1_67108754_67109046
2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY_chr1_8388316_8388618
3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5_chr1_25165804_25166380
4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc_chr1_33554186_33554483
5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1_chr1_41942895_41943205
6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1_chr1_50331337_50332274

Unfortunately, assistance with scripting is beyond the scope of this mailing list. You may also want to post your question on other bioinformatic forums, such as BioStars (https://www.biostars.org/), to get assistance from other members of the scientific community. You can also ask chatGPT to generate a script to combine the fields into the transcript ID to make them unique.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.


All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

On Fri, Mar 14, 2025 at 1:04 AM FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es> wrote:
I did not mention it, but my data is derived from a RNA-seq study.

Fran

De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: viernes, 14 de marzo de 2025 8:27
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids
 
Good morning, 

What I am trying to do is to map repetitive sequences displayed in my fastq files using as reference the human genome. When I use the rmsk track as RepeatMasker GTF annotation file to quantify these reads after mapping them, I obtain "gene_id" and "transcript_id", as I showed in the previous mail. I guess that, for example AluY, can have different lengths and can be located in different locations of the genome, being those "AluY_dupX". In the following table AluY_dup1 is shown, for example:

head(repeat_annotation_gtf)
  seqid    source type    start      end score strand phase gene_id transcript_id
1  chr1 hg38_rmsk exon 67108754 67109046  1892      +    NA    L1P5          L1P5
2  chr1 hg38_rmsk exon  8388316  8388618  2582      -    NA    AluY          AluY
3  chr1 hg38_rmsk exon 25165804 25166380  4085      +    NA   L1MB5         L1MB5
4  chr1 hg38_rmsk exon 33554186 33554483  2285      -    NA   AluSc         AluSc
5  chr1 hg38_rmsk exon 41942895 41943205  2451      -    NA    AluY     AluY_dup1
6  chr1 hg38_rmsk exon 50331337 50332274  1587      +    NA    HAL1          HAL1

Thanks for your attention and best regards,

Fran

De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: jueves, 13 de marzo de 2025 23:55
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>

Matthew Speir

unread,
Apr 9, 2025, 12:47:57 PM4/9/25
to FRANCISCO RODRIGUEZ MARTIN, gen...@soe.ucsc.edu
Hello, Francisco.

You can safely ignore that column in this case. This happens because of the way the data for the RepeatMasker track is structured in our database. The GTF output from the Table Browser simply labels each item as an "exon", not necessarily that those items are associated with exons. 

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.


---

Matthew Speir

UCSC Genome Browser, User Support


Reply all
Reply to author
Forward
0 new messages