Hello, Scott.
Thank you for your question about SARS-CoV-2 variation data in the UCSC Genome Browser.
We had to stop offering the variant file for download. SARS-CoV-2 variants displayed by Nextstrain are derived from a subset of
GISAID sequences and the GISAID
Terms and Conditions prohibit the redistribution of GISAID-derived data. They also require that the submitters of all sequences be acknowledged when the variants are used.
If you are registered with GISAID, you can access GISAID sequences and other downloadable data directly from them. They are labeled on their site as "msa_<date>". We have a program faToVcf that can extract VCF from a multi-sequence FASTA alignment such as the “nextfasta” download from GISAID. Our tool faToVcf is available for Linux and MacOSX here:
https://hgdownload.soe.ucsc.edu/admin/exe/. It requires at least 4GB of memory. You can run the program without any arguments to see the usage statement and options. Here are some steps to get started using the tool as well:
# This command enables faToVcf to be run as a program (otherwise the command would say "Permission denied")
chmod a+x faToVcf
# This command shows basic usage instructions and describes the options:
./faToVcf
# This command converts msa fasta to VCF without per-sample genotype columns:
./faToVcf -includeRef \
-ref='hCoV-19/Wuhan/Hu-1/2019|EPI_ISL_402125|2019-12-31|Asia' \
-vcfChrom=NC_045512.2 \
-noGenotypes \
msa_0925.fasta msa_0925.sites.vcf
Matthew Speir
UCSC Cell Browser, Quality Assurance and Data Wrangler
Human Cell Atlas, User Experience Researcher
UCSC Genome Browser, User Support
UC Santa Cruz Genomics Institute
Revealing life’s code.