Looking for all the SNPs in a gene

21 views
Skip to first unread message

Alexander Floren

unread,
Mar 4, 2020, 7:01:47 PM3/4/20
to gen...@soe.ucsc.edu
Hi,

I’m working on a program for my final project for Dr. Bernick’s BME 160 course involving using personal genomic data (like you would get from a 23andMe ‘raw data’ download) in order to create alignments against the reference genome. I’m hoping for a little bit of assistance in obtaining the information I need in order to make this project work.

I want to take in the name of a gene (say, “FTO”) and obtain a list of all the known SNPs (in rsID form) in that gene. I was surprised to find that the NCBI Entrez E-Utilities do not have an easy way to do this. If you have any ideas, I would greatly appreciate the help.

Once I have a list of the rsIDs in given gene, I should be able to find the individual’s genotype for each SNP and reconstruct a unique sequence for that individual, which could then be aligned to the reference genome for comparison.

Thanks,
Alexander Floren

Brian Lee

unread,
Mar 5, 2020, 6:53:03 PM3/5/20
to Alexander Floren, UCSC Genome Browser Mailing List

Dear Alexander,

If you are looking for all the SNPs in a gene, such as FTO, one option is to use the coordinate range to extract the data from the Table Browser.

For example, the RefSeq MANE transcript NM_001080432.3 for FTO spans chr16:53,704,156-54,121,941.  If you go to the Table Browser and limit the "position" to "chr16:53,704,156-54,121,941 " and set the group to "Variation" and track to "dbSNP 153" and click "get output" you can get all the SNPS for that region.
 
For readers of this mailing-list that might wish to do this extraction programmatically, you can use our tool bigBedToBed and use the same coordinate range to extract the tool from a binary-indexed data file into a file called FTO_SNPS as output:

bigBedToBed -chrom=chr16 -start=53704156 -end=54121941 http://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/dbSnp153Common.bb FTO_SNPS

Upon reading your question, it sounds like, however, you may be looking for the gene name for a list of SNPS in the 23andMe data format that looks like the following:

# rsid  chromosome      position        genotype
rs4477212       1       82154   AA
rs3094315       1       752566  AA
rs1365874384       1       69644601  AA

If that is the case you can turn these rows into our BED format and then run an intersection on a genes track to find the gene names for each of these SNPs. 

Here is a script to generate the four-column BED data from the 23andMe file:

unzip -c genome_Me.zip | grep -v ^# | awk '{print "chr" $2, $3-1, $3, $1;}' > 23AndMe.bed

You can then load this track as a custom track and use the Data Integrator tool to discover where your source SNPs overlap with genes.  First, you would add your Custom Track under the "My Data" menu where you could paste your transformed rows as BED custom tracks:

chr1 82153 82154 rs4477212
chr1 752565 752566 rs3094315
chr10 69644600 69644601 rs1365874384

Then use the Tools menu to navigate to the "Data Integrator" and click "Add" next to your new Custom Track (do this for the hg19 assembly in this example). Next, you can add a Genes track to intersect your custom rsIds against. Select the "track group" Genes and Gene Predictions and click "Add" for "UCSC Genes (knownGene)" to bring it up after the Custom Track in the section titled "Configure Data Sources."  Next, click the "Choose fields" option and only select the "name" from your custom track, and clear all but the "name" for the knownGene table as well. Click the "Add table" and select the "geneSymbol" from the hg19.kgXref table.  Click "Done" and then "Get output" and you will get the rsIDs that intersect with a gene name.  Be sure that you have the "region to annotate" set to "genome" if you get no results. 

Here is some example output, where only the SNPs that overlap with genes are returned:

#ct_UserTrack_3545.name    knownGene.name    knownGene.kgXref_geneSymbol
rs4477212        
rs3094315        
rs1365874384    uc001jnd.3    SIRT1

There are also conversion tools out there that will turn 23andMe files into VCF files that can be loaded onto the UCSC Genome Browser.  Other tools will take a VCF file to a FASTA file, you may want to investigate for your project's goals. 

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further public questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee

Reply all
Reply to author
Forward
0 new messages