metadata tags in hg38 analysisSet FASTA file header lines

100 views
Skip to first unread message

MISHIMA, Hiroyuki

unread,
Jul 15, 2016, 11:40:42 AM7/15/16
to gen...@soe.ucsc.edu
Dear UCSC Genome Bioinformatics Group,

I have a question about "metadata tag-value pairs" in header lines of
hg38 FASTA files.

According to a document at
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/ ,
zipped FASTA files in this directory, such as files in
hg38.analysisSet.chroms.tar.gz, are supposed to have metadata, such as
AC, gi, LN, rg, rl, M5, AS, hm, and tp tags in their header lines. They
do not, however, seem to have tags.

Is it possible to download hg38 FASTA files (or only header lines)
containing metadata tags somewhere else?

I am sorry if this question is FAQ or I made some mistakes in handling
downloaded files.

Best wishes,
Hiro.

--
MISHIMA, Hiroyuki, DDS, Ph.D.
Assistant Professor
Department of Human Genetics, Nagasaki University

Brian Lee

unread,
Jul 15, 2016, 7:26:19 PM7/15/16
to MISHIMA, Hiroyuki, gen...@soe.ucsc.edu
Dear Hiro,

Thank you for using the UCSC Genome Browser and your question about the metadata file information in the header lines for fasta files.

The README_ANALYSIS_SETS section is a copy of the file located at NCBI, there is indeed some processing to generate the versions used at UCSC. For example, the displayed copied text describes a file GCA_000001405.15_GRCh38_full_analysis_set, which is a file at NCBI, however the UCSC version is hg38.fullAnalysisSet.chroms.tar.gz.

If you navigate to NCBI's files location you can access the original files like GCA_000001405.15_GRCh38_full_analysis_set. There is a URL mentioned after "UCSC copy of this file from:" but please note that in December 2015 NCBI moved these files to the following archive/old_genbank location:ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

If you access those NCBI files, mentioned in the copied README_ANALYSIS_SETS, you will find the referenced metadata information. For example:
cat GCA_000001405.15_GRCh38_full_analysis_set.fna | grep "chr22"
">chr22 AC:CM000684.2 gi:568336002 LN:50818468 rl:Chromosome M5:ac37ec46683600f808cdd41eac1d55cd AS:GRCh38 hm:multiple"

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead togeno...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute
> --
>
>
Reply all
Reply to author
Forward
0 new messages