how different is hg38.fa.gz from GCA_000001405.15_GRCh38_full_analysis_set.fna.gz ?

306 views
Skip to first unread message

Ahuno, Samuel

unread,
Mar 8, 2024, 3:10:27 PM3/8/24
to gen...@soe.ucsc.edu
Dear UCSC genomes,

My name is Sam. I'm from Ghana & currently a student in MSKCC. 

Que: How different is `hg38.fa.gz`  from GCA_000001405.15_GRCh38_full_analysis_set.fna.gz human reference genome ?  

Are the coordinates and sequences the same?

in other words, if I'm comparing two files (ONT samples mapped with GCA_000001405.15_GRCh38_full_analysis_set.fna.gz) and Whole genome sequence files mapped with hg38.fa.gz do i need to realign everything to same reference genome?  or I could just go ahead ?
 
Thank you in advance,
Sam
=====================================================================

Please note that this e-mail and any files transmitted from
Memorial Sloan Kettering Cancer Center may be privileged, confidential,
and protected from disclosure under applicable law. If the reader of
this message is not the intended recipient, or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any reading, dissemination, distribution,
copying, or other use of this communication or any of its attachments
is strictly prohibited. If you have received this communication in
error, please notify the sender immediately by replying to this message
and deleting this message, any attachments, and all copies and backups
from your computer.

Disclaimer ID:MSKCC

Luis Nassar

unread,
Mar 12, 2024, 6:59:03 PM3/12/24
to Ahuno, Samuel, gen...@soe.ucsc.edu

Hi, Sam.

There are a few differences.

You can see the two README files here:

https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/README_analysis_sets.txt

Here are the two relevant parts:

The no_alt_analysis_set contains the sequences, in FASTA format, of
the chromosomes, mitochondrial genome, unlocalized scaffolds, and
unplaced scaffolds. The alternate locus scaffolds are omitted because
many Next Generation Sequence read alignment pipelines are
incompatible with the full assembly model. The two PAR regions on
chromosome Y, and duplicate copies of centromeric arrays and WGS on
chromosomes 5, 14, 19, 21 & 22, have been hard-masked with Ns for the
same reason
. The sequences for chromosomes 5, 14, 19, 21, 22 & Y
therefore differ from the INSDC sequences, although the coordinates
are preserved. An Epstein-Barr virus (EBV) sequence is included in the
analysis set, even though it is not part of the genome assembly, to
act as a sink for alignment of reads that are often present in
sequencing samples.
The definition line has a UCSC-style sequence
identifier and contains metadata in a series of space-separated
tag-value pairs.

The full_analysis_set contains the alternate locus scaffolds in
addition to all the sequences present in the no_alt_analysis_set.

So the full analysis from NCBI includes patch sequences (https://genome-blog.gi.ucsc.edu/blog/2019/02/22/patches/), hard-masked PAR and some centromeric regions, and EBV sequence as a decoy. The UCSC file you reference does not contain any of these sequences (or hard-masked PAR/centromeric regions).

That being said, the base assembly sequence is all the same, so the coordinates are all correct. The only note to consider here is that the NCBI full_analysis_set was used vs. the no_alt_analysis_set. If you are not interested in these alt sequences, or if you are not running your aligner in a way to tolerate multiple hits, this can lead to matches on alts instead of the often similar region on the main chromosome. Also, that file from UCSC (hg38.fa.gz) does not contain these sequences so those alt results would not be found there. We do offer the same file with alts also, which can be found here.

In conclusion, if the ONT sampled were mapped with an aligner that handles these patch sequences, then it should not be a problem. Otherwise, you may want to realign using the no_alt_analysis_set or using UCSC for both.

I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CY4PR1801MB1895AF9303B625A733FC3F73D5272%40CY4PR1801MB1895.namprd18.prod.outlook.com.

Ahuno, Samuel

unread,
Mar 14, 2024, 12:11:35 PM3/14/24
to Luis Nassar, gen...@soe.ucsc.edu

Thank you for the clarification!

-S

Terence Murphy

unread,
Nov 12, 2024, 7:25:52 PM11/12/24
to UCSC Genome Browser Public Support, Ahuno, Samuel, gen...@soe.ucsc.edu, Luis Nassar
Just adding a small clarification to the above:

> So the full analysis from NCBI includes patch sequences
The full analysis set does NOT include any patches, but does include alternate loci scaffolds. It represents sequences found in the original GRCh38 release in January 2014, prior to any patch releases.

That said, the conclusions are the same. If you are using an alt-aware aligner, then use the full_analysis_set sequences. Otherwise you likely want to use one of the "no_alt" sets (with or without the addition of the decoy sequences).

Ahuno, Samuel

unread,
Nov 14, 2024, 12:26:13 PM11/14/24
to Terence Murphy, UCSC Genome Browser Public Support, gen...@soe.ucsc.edu, Luis Nassar

Thank you so much for adding this information.

Best

Sam

Reply all
Reply to author
Forward
0 new messages