Thank you for your questions about 100-way vertebrate alignment data for the human assembly hg19.
The multiple alignment consists of various pairwise whole-genome alignments to the hg19 assembly that have then been stitched together into one large, multi-species alignment. The description page
for this track contains more details about how it was generated as well as some references for some of the tools we use (e.g lastz and multiz). Since these are whole-genome alignments, both protein-coding and non-protein-coding regions are included.
The differences in capitalization in these files is due to their "soft repeat masking", which uses lower case letters to indicate that a region was predicted as a repeat. This is as opposed to "hard repeat masking" where repeats are replaced with Ns.
Finally, these files are described in the README in that directory, which states:
The "alignments" directory contains compressed FASTA alignments
for the UCSC Known Gene CDS regions of the human genome (hg19/GRCh37, Feb. 2009)
aligned to the assemblies.
So, we extract the alignments in the original MAF files corresponding to the coding regions in some of the gene tracks we host. In this case:
- "knownGene" corresponds to our UCSC Genes track
- "knownCanonical" is a set of transcripts derived from knownGene that attempts to identify one "canonical" transcript per gene loci (typically the one with the longest isoform at a loci)
- "refGene" corresponds to the UCSC RefSeq track which is UCSC's realignment of certain RefSeq RNAs to the genome using BLAT
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu
. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu
UCSC Cell Browser, Quality Assurance and Data Wrangler
Human Cell Atlas, User Experience Researcher
UCSC Genome Browser, User Support
UC Santa Cruz Genomics Institute
Revealing life’s code.