New hg38 HPRC track group and data

13 views

Skip to first unread message

Luis Nassar

unread,

Jan 22, 2024, 6:39:01 PMJan 22

to genome-...@soe.ucsc.edu

We are proud to announce the release of four new tracks and a new track group on hg38 dedicated to the NIH's Human Pangenome Reference Consortium (HPRC) data.

No single reference genome such as hg19 or hg38 can accurately represent human genetic diversity. The HPRC's goal is to improve this by sequencing thousands of human genomes at high quality and building new tools to improve working with them. The first data release from this project consists of 47 phased, diploid assemblies, more than 99% accurate at the structural and base pair levels. We obtained alignments of these new genomes to hg38 from the HPRC analysis groups and have created new Genome Browser annotation tracks that visualize the differences between the established hg38 reference and the new 94 pan-genome assemblies. The new tracks are grouped into short and structural variants, with the latter further split by type (insertion, deletion, inversion, duplication, etc). We plan to update these and add other tracks as soon as more HPRC data is released.

hg38 session
visualizing the new HPRC tracks.

Session on hg38 near the MHC region where the reference sequence upstream of HLA-F contains an insertion present in only a few of the HPRC populations. Few HPRC and no common dbSNP155 variants are annotated, further evidencing this is a region on the reference that is likely an uncommon insertion in the global population. Click on the image to explore the session further.

In this first HPRC data release, we are adding four new tracks to this new track group. Details on each of the tracks are as follows:

Feature and Variation Tracks

The Short Variants container track shows tracks of short nucleotide variants of a few base pairs when aligning HPRC genomes to the hg38 reference assembly using the Minigraph-cactus approach. Short variants have been used in population genetics to investigate population-specific allele frequencies and genetic diversity, and have been used in the association of diseases. The track consists of three subtracks:

HPRC All Variants: HPRC variants decomposed from hprc-v1.0-mc.grch38.vcfbub.a100k.wave.vcf.gz (Liao et al 2023), no size filtering
HPRC Variants ≥ 3bp: HPRC VCF variants filtered for items size ≥ 3bp
HPRC Variants > 3bp: HPRC VCF variants filtered for items size > 3bp

The Rearrangements container track shows various rearrangements in the HPRC assemblies with respect to hg38. The types include indels, duplications, inversions, and other more complicated rearrangements.

There are five tracks in the Rearrangement composite track:

Insertions: Deletions in hg38 = Insertion in the HPRC assemblies
Deletions: Insertions in hg38 = Deletion in the HPRC assemblies
Inversions: Inversions with respect to hg38 in HPRC assemblies
Duplications: Duplications with respect to hg38 in HPRC assemblies
Other Rearrangements: Other Rearrangements: Unalignable sequences in both assemblies (inversions, partial transpositions)

Many of these features are unique to this dataset, although overlap can be found with other structural variant databases such as DGV. Potential applications of these rearrangements could be data validation for new and existing data and a better understanding of the prevalence of rearrangements in diverse populations, many of which are underrepresented in current clinical and genomic databases.

Alignment and Conservation tracks

The Chain/Net track shows regions of the human genome that are alignable between the HPRC genomes as well as hg38 and T2T-CHM13. A total of 176 maternal and paternal haplotypes were used in this analysis. The configuration page for this track sorts the haplotypes into 14 subpopulations as follows:

T2T
HAPMAP
Yoruba Nigeria
Esan Nigeria
Gambian
Mende Sierra Leone
Afr Carib Barabdos
African SW USA
Puerto Rico
Peru Lima
Columbia Medellin
Han SoChina
Vietnam Kinh
Punjabo Pakis

The 90-way Multiple Alignment track contains multiple alignments of 90 human genomes generated by the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments. This method builds graphs containing all forms of genetic variation while allowing the use of current mapping and genotyping tools. The confirmation page sorts the Maternal and Paternal haplotypes by the same 14 subpopulations described above.

Acknowledgments

We are always looking for feedback, if you would like to see other HPRC data, or the data presented differently, please contact us at gen...@soe.ucsc.edu. Likewise, if you find this data useful and see potential improvements, we would be interested in hearing from you.

We would like to thank the Human Pangenome Reference Consortium for taking on this genomics challenge and providing these data. In particular, we would like to thank Benedict Paten, Heng Li, and Glenn Hickey for their help in putting these Browser tracks together.

Reply all

Reply to author

Forward

0 new messages