T2T CHM13 v2.0 now available in the Genome Browser

1,433 views
Skip to first unread message

Luis Nassar

unread,
Apr 12, 2022, 6:56:49 PM4/12/22
to genome-...@soe.ucsc.edu
Click here to see this announcement on our website.

The Genome Browser has a rich history intricately connected to human genomic research. We have provided display to almost two dozen human genomes beginning with the first drafts in the year 2000. Nearly 22 years later, the T2T consortium has published the most complete human haploid genome sequence to date, having added just about all of the 200 million bases (8%) missing from the current reference. We are proud of all the scientists involved, including our colleagues in the UCSC Genomics Institute, that played a role in this release. We strive to facilitate omics research and thus would like to announce our expanded support for the T2T-CHM13 v2.0 browser.

What is T2T-CHM13 v2.0?

T2T-CHM13 v2.0 was produced by sequencing the CHM13hTERT human cell line from a hydatiform mole, which is haploid, meaning it contains nearly uniform homozygosity. It also employed recent technologies such as HiFi and nanopore sequencing. The result is a 3.055 billion base pair genome that includes gapless assemblies for all main chromosomes and introduces nearly 200Mbp of novel sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes. A Y chromosome was added from Genome in a Bottle's HG002 sample.

CHM13 removes 1.2Mbp of duplicated sequence in hg38, and 263 GENCODE genes from hg38 are absent in CHM13 as well as 3604 genes in CHM13 are absent in hg38, mostly in the centromeres. Variant calling using CHM13 reduces the numbers of false positives in certain medically relevant genes, and CHM13 also resolves duplications collapsed in hg38 that affect 48 protein-coding genes (e.g. KCNJ18, KCNJ12, KMT2C, MAP2K3), so it is more representative of human copy-number variation than hg38.

It is also important to recognize, however, that while this assembly's chromosome sequences are more complete than the main chromosomes of the hg38 reference genome, it is not "hg39" as it is an alternate or companion assembly, not a primary reference assembly for the Genome Reference Consortium and NCBI. It does not contain any alternative haplotypes, and most genome annotation tracks now are based on the hg19 and hg38 coordinates. Hundreds of human genomes at a similar accuracy as CHM13 are expected to be released over the next 1-2 years, and therefore T2T CHM13 is the foundation of the future human pangenome reference genome.

How to access this assembly in the Genome Browser?

As with many of our assemblies, there are a few different ways to gain access. We have added CHM13 to our Genomes drop-down menu, which provides direct access from almost anywhere on our site. Also, like most of our other genomes, it can be found by searching our Gateway page.

Finding CHM13
in the Genomes menu dropdown. Searching CHM13
on the Gateway page.

CHM13 is a part of our Genome Archive (GenArk) system, and thus exists as an assembly hub. GenArk assemblies can always be reached directly via their shortlink URL corresponding to their GCA accession, e.g. CHM13: https://genome.ucsc.edu/h/GCA_009914755.4

What annotations are currently available on the CHM13 browser?

Some notable annotations currently available on the CHM13 are listed below. Additional annotations will continue to be added as they become available.

Gene and mRNA annotations:

  • CAT/Liftoff Genes - Gene models generated using the CAT software filling in from the LiftOff mappings when needed. The reference annotations are from GENCODE V35.
  • CHM13 PROseq - CHM13 cell line PRO-seq Bowtie2 alignments to CHM13v2.0 (minus chrY) and unique genome-wide 21mer filtering (stranded).
  • CHM13 RNA-Seq - CHM13 cell line RNA-seq Bowtie2 alignments to CHM13v2.0 (minus chrY) and unique genome-wide 21mer filtering (unstranded).

Clinical annotations:

  • ClinVar Variants - Lifted ClinVar data from the hg38 March 13th, 2022 release.
  • dbSNP 155 - Lifted dbSNP 155 variants from the hg38 release.
  • GWAS Variants - GWAS catalog variants lifted from hg38.

Comparative genomics:

  • CHM13 unique - Regions unique to the T2T-CHM13 v2.0 assembly as compared to the hg38 and hg19 reference assemblies.
  • Human liftOver - Contains one to one Nextflow LiftOver pipeline alignments between CHM13 and hg19/hg38.
  • Chain/Net Track - Alignment track between CHM13 and four other human genomes that shows rearrangements in our usual chains (=alignable) and net (=synteny) display formats. Other genomes are hg19, hg38, HG002pat, and HG002mat.

How to display my data in CHM13?

We have added support for CHM13 to our hgConvert tool. This allows region conversion of the current viewing window between hg19/hg38 to CHM13 and vice versa. We will also be adding support for conversion of data using our hgLiftOver tool at our next version release on May 3rd. In the meantime, the command line version of liftOver in combination with the proper chain file can be used to lift annotations.

Using hgConvert tool
to see coordinates between hg38 and CHM13.

Custom tracks and track hubs can also be used to display annotations on CHM13. In the case of track hubs, using genome GCA_009914755.4 is sufficient to declare the assembly. We have also expanded our support of variable chromosome names, so data can be loaded using either UCSC ("chr1"), NCBI ("CP068277.2") or Ensembl ("1") sequence identifiers. There should no longer be a need to convert sequence names.

It is worth noting that GenArk assemblies are functionally hubs, which means all data is stored in binary files, not MySQL databases. If your existing data pipelines do not work because our data formats have changed compared to hg19/hg38, please do not hesitate to contact us. Most formats are very similar to the MySQL tables and we have command line tools that can perform the conversions.

Where to download CHM13 data?

All GenArk hubs are hosted on our download server. This means that all settings information and data for displaying this browser can be found there: https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/

We also provide FASTA files there with two different sequence identifiers (the "chr1" format and Genbank accessions), gene annotations in GFF and other formats and assembly indexes with either Genbank or "chr1" sequence names for the aligners bwa-mem2, bowtie2, hisat2 and minimap2. Detailed download instructions can be found in the README and on our assembly description page

All liftOver files, including files to/from hg19/hg38 and CHM13 can also be found on our download server: https://hgdownload.gi.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/liftOver/

Acknowledgements

We would like to thank the T2T Consortium for this landmark accomplishment. We would like to extend an additional kudos to our fellow UCSC Genomics Institute members who are part of the consortium, Karen Miga, Benedict Paten, Kishwar Shafin, Mark Diekhans, and Miten Jain. Lastly, to the engineers and QA members of the Genome Browser for the rapid development and release of CHM13 data and features.

Reply all
Reply to author
Forward
0 new messages