Query about combining hg38 snp and gene data

17 views
Skip to first unread message

Viv

unread,
Jun 5, 2017, 2:41:54 PM6/5/17
to gen...@soe.ucsc.edu
Hi,

I am trying to put together a dataset of hg38 human genome SNP data (CDS only) that contains information on:

  • SNP chromosomal location
  • Transcript coordinates
  • Allele frequencies
  • Allele change 
  • Codon change
  • Peptide change
  • Variant consequence
  • Gene within which SNP is located
  • Allele count
  • Allele frequency
Using the table browser I can obtain everything I need using the variation tables, but I cannot get information on which gene each SNP is located in. I have tried to do this manually with a bit of code to map SNP coordinates to trancript coordinates. This has lead to each SNP being found in multiple genes (possibly due to transcript overlap?) 

I would like to know if it is possible to get this information straight off the table browser, and if not what the best approach for this problem would be. Many thanks.

Kind regards,

Viv

Sent with ProtonMail Secure Email.

Christopher Lee

unread,
Jun 15, 2017, 1:20:30 PM6/15/17
to Viv, gen...@soe.ucsc.edu

Hi Viv,

Thank you for following up from your original BioStars question and for asking about obtaining hg38 gene and snp info. Also my apologies for taking so long to get back to you. Unfortunately there is really just no way to 'easily' get the info you want. The Variant Annotation Integrator (VAI), is able to get all of the information you want, but is limited to 100,000 variants at a time, which means not even all of chromosome 1 can be annotated at a time:
http://genome.ucsc.edu/cgi-bin/hgVai

If you were to go with the VAI approach, the best way to get this info is to load one of the VCF's from NCBI:
https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/

as a custom track, then navigate to the VAI, and limit your output to one chromosome (or sections of a chromosome) at a time. However this approach is time consuming and not really the best option.

You can run queries against our public MySQL server, but from my experience while investigating this question, the query will time out for a genome-wide search. You could however put the query in a bash loop where you loop over the chromosome names, which will work but may time out depending on how far away you are from our MySQL server:

for chr in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y M; \
do mysql --host=genome-mysql.soe.ucsc.edu --user=genome -A -Ne "select a.chrom, \
a.chromStart, a.chromEnd, a.name, a.transcript, a.alleles, a.codons, a.peptides, \
b.refNCBI, b.observed, b.func, b.alleleFreqs, refGene.name2 from snp147CodingDbSnp a \
join snp147 b on b.name=a.name join refGene on a.transcript=refGene.name where \
a.chrom='chr${chr}'" hg38 >> hg38.snpData; done

Given these caveats, I think the approach from your BioStars question is probably the best for you since you already know it works. One of our engineers notes that dbNSFP has precomputed all sorts of things for every possible CDS SNP in the GENCODE v22 gene set, but it is enormous and would require some work to intersect SNP coords and alleles with its data:
https://sites.google.com/site/jpopgen/dbNSFP

I hope this is helpful, please let us know if you have any other questions.

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/boI6aVK4_S32ucgj5HwslzhurAvua41ZLR6sAmuG3iTQftLuoIaO4TNmWe9XrylVEeaqp6a68x4hdr8IWhFLsS0QEjmNAttBeal_Utihj4g%3D%40protonmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages