A Few Questions Regarding 100 Vertebrate Alignments Dataset

28 views
Skip to first unread message

Jordan Twynham

unread,
Jul 27, 2021, 9:52:08 AM7/27/21
to gen...@soe.ucsc.edu, Mary O'Connell
Hi team,

I am currently running a comparative genomics project and plan to leverage your 100 vertebrates dataset to identify conserved non-protein coding regions for us to focus on. I have a few questions regarding the dataset that I would like to get some clarification on if possible.
  • Does the data set include protein coding regions?
  • Some sequence blocks in MAF files contain a mixture of lower-case and upper-case nucleotides - what is the reason for this?
  • The alignment folder contains knownGenes, refGenes and knownCannonical files. What do these files names refer to? Could you direct me towards some documentation regarding these files?
Thank you for your great work - I look forward to hearing from you.

Best wishes,
Jordan

This message and any attachment are intended solely for the addressee
and may contain confidential information. If you have received this
message in error, please contact the sender and delete the email and
attachment. 

Any views or opinions expressed by the author of this email do not
necessarily reflect the views of the University of Nottingham. Email
communications with the University of Nottingham may be monitored 
where permitted by law.



Matthew Speir

unread,
Jul 27, 2021, 3:52:41 PM7/27/21
to Jordan Twynham, gen...@soe.ucsc.edu, Mary O'Connell
Hello, Jordan.

Thank you for your questions about 100-way vertebrate alignment data for the human assembly hg19.

The multiple alignment consists of various pairwise whole-genome alignments to the hg19 assembly that have then been stitched together into one large, multi-species alignment. The description page for this track contains more details about how it was generated as well as some references for some of the tools we use (e.g lastz and multiz). Since these are whole-genome alignments, both protein-coding and non-protein-coding regions are included. 

The differences in capitalization in these files is due to their "soft repeat masking", which uses lower case letters to indicate that a region was predicted as a repeat. This is as opposed to "hard repeat masking" where repeats are replaced with Ns. 

Finally, these files are described in the README in that directory, which states:

The "alignments" directory contains compressed FASTA alignments
for the UCSC Known Gene CDS regions of the human genome (hg19/GRCh37, Feb. 2009)
aligned to the assemblies.

So, we extract the alignments in the original MAF files corresponding to the coding regions in some of the gene tracks we host. In this case:
  • "knownGene" corresponds to our UCSC Genes track
  • "knownCanonical" is a set of transcripts derived from knownGene that attempts to identify one "canonical" transcript per gene loci (typically the one with the longest isoform at a loci)
  • "refGene" corresponds to the UCSC RefSeq track which is UCSC's realignment of certain RefSeq RNAs to the genome using BLAT
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/VI1PR06MB4958F3E5AF4764905F95BFC5BEE99%40VI1PR06MB4958.eurprd06.prod.outlook.com.
Reply all
Reply to author
Forward
0 new messages