rmsk annotation GRCh38.p13

500 views
Skip to first unread message

Schorn, Andrea

unread,
Feb 8, 2022, 2:51:03 PM2/8/22
to gen...@soe.ucsc.edu
Hi there,

Is the latest repeatmask annotation
https://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/rmskOutCurrent.txt.gz

compatible with GRCh38.p13 ?
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/

I understand the repeat mask coordinates on the main chromosomes have not changed, however rmskOutCurrent has 455 chromosome entries, GRCh38.p13 has 640. I found this google groups entry
https://groups.google.com/a/soe.ucsc.edu/g/genome/c/59-sqdMHnCs

but couldn't figure from it whether there is a complete annotation for the GRCh38.p13 assembly? Or how to patch it up?

Thank you,
Andrea

Brian Lee

unread,
Feb 10, 2022, 10:04:07 AM2/10/22
to Schorn, Andrea, gen...@soe.ucsc.edu

Dear Andrea,

Thank you for using the UCSC Genome Browser and your question about rmsk annotation that will include GRCh38.p13.

The short answer is that you can find most recent patch 13 annotation data in the rmsk track here: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz

This file will have the annotations for patch sequences, for instance if you load this session, you will see an example: http://genome.ucsc.edu/s/brianlee/chr19_ML143376v1_fix

This session has chr19_ML143376v1_fix for the sequence https://www.ncbi.nlm.nih.gov/nuccore/ML143376.1 found in GCA_000001405.28_GRCh38.p13_genomic.fna.gz highlighted in the middle, where annotations for both genes and the Repeating Elements by RepeatMasker are highlighted.

While the file you referenced is larger, rmskOutCurrent.txt.gz 2018-10-28 03:34 at 162M, the newer rmsk.txt.gz file (2021-09-03 14:58) at 147M has 640 sequence names reflecting new patches. Your question has sparked us to create a work ticket to update the rmskOutCurrent file.

To get the information for all the genome we recommend extracting the new patch sequence annotations from the above referenced rmsk file and adding those annotations to the existing rmskOutCurrent file, in order to get combined and relatively up-to-date annotations. These steps would require filtering out rows from rmsk if the genoName (chrom) is also found in rmskOutCurrent.

Here are some steps that could be used:

gunzip -c rmskOutCurrent.txt.gz | cut -f 6 | uniq > seqsInRmskOutCurrent.txt
gunzip -c rmskOutCurrent.txt.gz > rmskOutCurrentCombined.txt
gunzip -c rmsk.txt.gz| grep -vFwf seqsInRmskOutCurrent.txt >> rmskOutCurrentCombined.txt

In case you may ever wish to see the processes taken to build these files we do document our run steps inside our source tree and we can provide a link to this information if it would be helpful. However, we probably think you may just be interested in the RepeatMasker and library versions used summarized below. For rmskOutCurrent, and for sequences added to rmsk in patch releases through p12, these versions of RepeatMasker and its libraries were used:

  • February 01 2017 (open-4-0-7) 1.331 version of RepeatMasker
  • Dfam_Consensus RELEASE 20170127
  • RepBase RELEASE 20170127

For sequences added to rmsk in patch release p13, these versions were used:

  • February 01 2017 (open-4-0-8) 1.332 version of RepeatMasker
  • Dfam_Consensus RELEASE 20181026
  • RepBase RELEASE 20181026

Thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further public questions, please send new questions to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum to help others find answers to similar questions. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu, which is a private internal list to our support team.

All the best,


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/32810B05-82CF-47AA-8E52-114E4A659615%40cshl.edu.
Reply all
Reply to author
Forward
0 new messages