A Few Questions Regarding 100 Vertebrate Alignments Dataset

Jordan Twynham

Jul 27, 2021, 9:52:08 AMJul 27
to gen...@soe.ucsc.edu, Mary O'Connell
I am currently running a comparative genomics project and plan to leverage your 100 vertebrates dataset to identify conserved non-protein coding regions for us to focus on. I have a few questions regarding the dataset that I would like to get some clarification on if possible.
  • Does the data set include protein coding regions?
  • Some sequence blocks in MAF files contain a mixture of lower-case and upper-case nucleotides - what is the reason for this?
  • The alignment folder contains knownGenes, refGenes and knownCannonical files. What do these files names refer to? Could you direct me towards some documentation regarding these files?
Matthew Speir

Jul 27, 2021, 3:52:41 PMJul 27
to Jordan Twynham, gen...@soe.ucsc.edu, Mary O'Connell
Thank you for your questions about 100-way vertebrate alignment data for the human assembly hg19.

The multiple alignment consists of various pairwise whole-genome alignments to the hg19 assembly that have then been stitched together into one large, multi-species alignment. The description page for this track contains more details about how it was generated as well as some references for some of the tools we use (e.g lastz and multiz). Since these are whole-genome alignments, both protein-coding and non-protein-coding regions are included. 

The differences in capitalization in these files is due to their "soft repeat masking", which uses lower case letters to indicate that a region was predicted as a repeat. This is as opposed to "hard repeat masking" where repeats are replaced with Ns. 

Finally, these files are described in the README in that directory, which states:

The "alignments" directory contains compressed FASTA alignments
for the UCSC Known Gene CDS regions of the human genome (hg19/GRCh37, Feb. 2009)
aligned to the assemblies.

So, we extract the alignments in the original MAF files corresponding to the coding regions in some of the gene tracks we host. In this case:
  • "knownGene" corresponds to our UCSC Genes track
  • "knownCanonical" is a set of transcripts derived from knownGene that attempts to identify one "canonical" transcript per gene loci (typically the one with the longest isoform at a loci)
  • "refGene" corresponds to the UCSC RefSeq track which is UCSC's realignment of certain RefSeq RNAs to the genome using BLAT
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

