Hi, Sam.
There are a few differences.
You can see the two README files here:
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/README_analysis_sets.txt
Here are the two relevant parts:
The no_alt_analysis_set contains the sequences, in FASTA format, of
the chromosomes, mitochondrial genome, unlocalized scaffolds, and
unplaced scaffolds. The alternate locus scaffolds are omitted because
many Next Generation Sequence read alignment pipelines are
incompatible with the full assembly model. The two PAR regions on
chromosome Y, and duplicate copies of centromeric arrays and WGS on
chromosomes 5, 14, 19, 21 & 22, have been hard-masked with Ns for the
same reason. The sequences for chromosomes 5, 14, 19, 21, 22 & Y
therefore differ from the INSDC sequences, although the coordinates
are preserved. An Epstein-Barr virus (EBV) sequence is included in the
analysis set, even though it is not part of the genome assembly, to
act as a sink for alignment of reads that are often present in
sequencing samples. The definition line has a UCSC-style sequence
identifier and contains metadata in a series of space-separated
tag-value pairs.The full_analysis_set contains the alternate locus scaffolds in
addition to all the sequences present in the no_alt_analysis_set.
So the full analysis from NCBI includes patch sequences (https://genome-blog.gi.ucsc.edu/blog/2019/02/22/patches/), hard-masked PAR and some centromeric regions, and EBV sequence as a decoy. The UCSC file you reference does not contain any of these sequences (or hard-masked PAR/centromeric regions).
That being said, the base assembly sequence is all the same, so the coordinates are all correct. The only note to consider here is that the NCBI full_analysis_set was used vs. the no_alt_analysis_set. If you are not interested in these alt sequences, or if you are not running your aligner in a way to tolerate multiple hits, this can lead to matches on alts instead of the often similar region on the main chromosome. Also, that file from UCSC (hg38.fa.gz) does not contain these sequences so those alt results would not be found there. We do offer the same file with alts also, which can be found here.
In conclusion, if the ONT sampled were mapped with an aligner that handles these patch sequences, then it should not be a problem. Otherwise, you may want to realign using the no_alt_analysis_set or using UCSC for both.
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CY4PR1801MB1895AF9303B625A733FC3F73D5272%40CY4PR1801MB1895.namprd18.prod.outlook.com.
Thank you for the clarification!
-S
Thank you so much for adding this information.
Best
Sam