Hello Fran.
Thank you for using the UCSC Genome Browser and for sending your inquiry.
To better assist you, could you share more details about why you need a GTF file for the RepeatMasker track data? Since this track consists of repetitive elements, encountering repeated IDs is expected. The RepeatMasker track does not include transcripts. You can find more information about the RepeatMasker track on its description page: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&position=default&g=rmsk#TRACK_HTML
Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Gerardo Perez
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/DB7PR01MB4918901310B1C39DE0E0D64FF1C42%40DB7PR01MB4918.eurprd01.prod.exchangelabs.com.
head(repeat_annotation_gtf) seqid source type start end score strand phase gene_id transcript_id 1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P5 2 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY 3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB5 4 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc 5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup1 6 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1
Hello,
Thank you for using the UCSC Genome Browser and sending your follow-up inquiry.
Can you explain a bit more about what you are trying to do with the UCSC Genome Browser? The repeats in the RepeatMasker track are composed of different classes of repeats and are found throughout the genome. These repetitive regions are not transcripts and can be variable in length. Here is a wiki page with more information about repetitive regions in the genome:
The complete list of repeat classes shown on the RepeatMasker track can be found on the track description page:
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=rmsk
If you are trying to map reads to transcripts, we recommend that you use a gene track such as NCBI RefSeq or GENCODE for hg38.
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=refSeqComposite
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr7&g=knownGene
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Jairo Navarro
UCSC Genome Browser
Hello,
Thank you for sending your follow-up with more information.
It sounds like you want to map the RNA-seq reads to the various repeats in the genome using the RepeatMasker track. The GTF that we provide via the Table Browser will not have unique
IDs. However, you can make the IDs unique by adding the chromosome position to the ID using a Python script or command-line tools like AWK. For example, you can create a script to edit and generate the following GTF:
Unfortunately, assistance with scripting is beyond the scope of this mailing list. You may also want to post your question on other bioinformatic forums, such as BioStars (https://www.biostars.org/), to get assistance from other members of the scientific community. You can also ask chatGPT to generate a script to combine the fields into the transcript ID to make them unique.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Jairo Navarro
UCSC Genome Browser
I did not mention it, but my data is derived from a RNA-seq study.
Fran
De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: viernes, 14 de marzo de 2025 8:27
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts idsGood morning,
What I am trying to do is to map repetitive sequences displayed in my fastq files using as reference the human genome. When I use the rmsk track as RepeatMasker GTF annotation file to quantify these reads after mapping them, I obtain "gene_id" and "transcript_id", as I showed in the previous mail. I guess that, for example AluY, can have different lengths and can be located in different locations of the genome, being those "AluY_dupX". In the following table AluY_dup1 is shown, for example:
head(repeat_annotation_gtf)seqid source type start end score strand phase gene_id transcript_id1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P52 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB54 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup16 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1
Thanks for your attention and best regards,
Fran
De: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Enviado: jueves, 13 de marzo de 2025 23:55
Hello,
Thank you for sending your follow-up with more information.
It sounds like you want to map the RNA-seq reads to the various repeats in the genome using the RepeatMasker track. The GTF that we provide via the Table Browser will not have unique
IDs. However, you can make the IDs unique by adding the chromosome position to the ID using a Python script or command-line tools like AWK. For example, you can create a script to edit and generate the following GTF:
Unfortunately, assistance with scripting is beyond the scope of this mailing list. You may also want to post your question on other bioinformatic forums, such as BioStars (https://www.biostars.org/), to get assistance from other members of the scientific community. You can also ask chatGPT to generate a script to combine the fields into the transcript ID to make them unique.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Jairo Navarro
UCSC Genome Browser
I did not mention it, but my data is derived from a RNA-seq study.
Fran
De: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
Enviado: viernes, 14 de marzo de 2025 8:27
Para: Jairo Navarro Gonzalez <jnav...@ucsc.edu>
Asunto: RE: [genome] Repeated transcripts ids
Good morning,
What I am trying to do is to map repetitive sequences displayed in my fastq files using as reference the human genome. When I use the rmsk track as RepeatMasker GTF annotation file to quantify these reads after mapping them, I obtain "gene_id" and "transcript_id", as I showed in the previous mail. I guess that, for example AluY, can have different lengths and can be located in different locations of the genome, being those "AluY_dupX". In the following table AluY_dup1 is shown, for example:
head(repeat_annotation_gtf)seqid source type start end score strand phase gene_id transcript_id1 chr1 hg38_rmsk exon 67108754 67109046 1892 + NA L1P5 L1P52 chr1 hg38_rmsk exon 8388316 8388618 2582 - NA AluY AluY3 chr1 hg38_rmsk exon 25165804 25166380 4085 + NA L1MB5 L1MB54 chr1 hg38_rmsk exon 33554186 33554483 2285 - NA AluSc AluSc5 chr1 hg38_rmsk exon 41942895 41943205 2451 - NA AluY AluY_dup16 chr1 hg38_rmsk exon 50331337 50332274 1587 + NA HAL1 HAL1
Thanks for your attention and best regards,
Fran
Para: FRANCISCO RODRIGUEZ MARTIN <frodrig...@us.es>
---
Matthew Speir
UCSC Genome Browser, User Support