[genome-announce] Non-canonical ORFs track collection on hg38

1 view
Skip to first unread message

Jairo Navarro Gonzalez

unread,
Jun 23, 2026, 5:17:57 PM (7 days ago) Jun 23
to genome-...@soe.ucsc.edu
Hello everyone,

We are pleased to announce a new Non-canonical ORFs track collection for the human genome assembly (GRCh38/hg38), bringing together several public databases of open reading frames (ORFs) that fall outside annotated protein-coding genes. While the human genome has roughly 20,000 annotated protein-coding genes, ribosome profiling (Ribo-seq) and proteomics have revealed widespread translation of ORFs in regions long considered non-coding, including 5' and 3' UTRs, long non-coding RNAs, pseudogenes, and alternative reading frames of known genes.

These non-canonical ORFs include upstream ORFs (uORFs) in 5' UTRs, which can regulate translation of the downstream coding sequence; small ORFs (sORFs), generally under 100 codons, many of which produce functional micropeptides; downstream ORFs (dORFs) in 3' UTRs; out-of-frame ORFs that overlap known coding sequence in an alternative frame; and ORFs in transcripts annotated as non-coding RNAs or pseudogenes. The collection gathers the following datasets as individual subtracks:

  • UTRannotator uORFs – 44,435 curated uORFs in human 5' UTRs from the UTRannotator VEP plugin (Whiffin lab), useful for placing a VEP prediction in genomic context.
  • GENCODE ncORFs – the GENCODE / TransCODE Phase I reference set (7,264 ATG-initiated ncORFs with Ribo-seq and peptide evidence), plus Phase II primary and comprehensive sets that extend the catalog to shorter and non-AUG ORFs.
  • 5ULTRA uORFs – 22,567 ATG-initiated uORFs mapped to MANE Select transcripts, compiled by the 5ULTRA project for prioritizing 5' UTR variants.
  • nuORFdb – 229,251 non-canonical ORFs with ribosome-profiling evidence from the Broad Institute's nuORFdb v1.2.
  • MetamORF – 664,558 small ORFs consolidated from many primary sources by the MetamORF meta-database.
  • OpenProt – 921,170 reference proteins, isoforms, and alternative proteins from OpenProt v2.2, with a mass-spectrometry-supported subset (≥2 unique peptides).

Every ORF in every subtrack is annotated with the strength of its Kozak sequence, the sequence context around the start codon that governs how efficiently translation initiates. Features are colored by a categorical Kozak label:

  •  strong – ATG start
  •  moderate – ATG start
  •  weak – ATG start
  •  near-cognate – non-ATG start, shown separately

Each subtrack offers filters for the start codon, Kozak strength, and a numeric Kozak translational efficiency score, along with dataset-specific filters such as ORF type and evidence category.

See the Non-canonical ORFs collection page and the individual subtrack description pages for per-dataset methods, item counts, download URLs, and references.

We would like to thank the data providers who made these resources publicly available: Xiaolei Zhang, Nicola Whiffin, and the UTRannotator team at Imperial College London; Jonathan Mudge, Jorge Ruiz-Orera, John Prensner, Sebastiaan van Heesch, and the GENCODE / TransCODE consortium; Matthieu Chaldebas and the 5ULTRA team; Tamara Ouspenskaia, Travis Law, Karl Clauser, and colleagues at the Broad Institute of MIT and Harvard for nuORFdb; the MetamORF team at the TAGC laboratory, Aix-Marseille University; and Xavier Roucou and the OpenProt team at the Université de Sherbrooke. We also thank Eric Malekos (UCSC) for suggesting nuORFdb, and the VuTR authors (Whiffin lab) for the Kozak-strength implementation. Finally, we would like to thank Max Haeussler and Jairo Navarro for creating and releasing these UCSC Genome Browser tracks.

Reply all
Reply to author
Forward
0 new messages