Dear Chris and whom it may concern at UCSC,
Thanks so much for your reply on Biostars—(see below Re: finding matching rs numbers)—
I am new to data matching and so would really appreciate further instruction—I’ve attached the data I have available,
Thanking-you in advance,
Kind regards,
Fleur Garton
Question: Find matching rs number for a list of gene/HGVS/cDNA data
HI! I would like to get an rs number using (any) of the following information that I have in a data spreadsheet Gene name (e.g. EXOSC3)/ HGVS cDNA (old nomenclature) (e.g. c.92G>C)/ HGVS protein (p.G31A for protein)/ cDNA_reference (e.g. NM_016042.2) I was thinking that this would be relatively straightforward - i.e. download similar data with rs number - then merge the files - but can't find this! Would be extremely grateful if anyone can help, Thanks so much, Fleur
Anwer: Find matching rs number for a list of gene/HGVS/cDNA data
You can use the UCSC Genome Browser to get this information, but the process depends on the specific HGVS terms you have.
The general idea is to split your list into two lists, one with just your different HGVS terms, and one with just your gene names. Then you can use a combination of the Table Browser and Data Integrator to get the rsID's (if any) corresponding to your gene names, and the Variant Annotation Integrator to get the rsID's (if any) corresponding to your HGVS terms.
We can provide more specific steps if you email this question along with some example data to our mailing list gen...@soe.ucsc.edu, that way our entire team will see the question and can help with a solution.
Thanks,
ChrisL from the UCSC Genome Browser
Fleur Garton, PhD
Institute for Molecular Bioscience | University of Queensland | St Lucia, QLD, Australia, 4072
E: f.ga...@uq.edu.au P: 0423 210 114
This email (including any attached files) is intended solely for the addressee. It may contain private or confidential information. If you are not the intended addressee, you must take no action based on it, nor show a copy to anyone. Kindly notify the sender by reply email. Opinions and information in this email which do not relate to the official business of The University of Queensland shall be understood as neither given nor endorsed by the University. CRICOS Provider Number 00025B.
Hi Fleur,
Thanks for sending your message to our list and sharing some example data. There are a couple problems with the format of your HGVS terms. Firstly, it appears that your HGVS terms are gene specific rather than transcript specific, which presents a problem, as something like "c.127G>A" can indicate different coordinates depending on which transcript of the ACADM gene you are referring to. Secondly, a term like ABCC8:c.3989-9G>A(3992-9G>A), even if formatted into a transcript specific term like NM_000352:c.3989-9G>A(3992-9G>A), is not valid HGVS because of the "(3992-9G>A)" at the end.
Thus you will need to first clean up your list of HGVS terms before using the Variant Annotation Integrator (http://genome.ucsc.edu/cgi-bin/hgVai)
in order to find the associated rsID's for your terms. To start, here
is an example command to clean up HGVS terms like the one above
(NM_000352:c.3989-9G>A(3992-9G>A)):sed -re 's/\([^)]+\)$//;'
Secondly, you will need to convert your gene names into transcript
ID's so they can be used in the VAI. You can find the accession's
associated with your genes with the Table Browser:
1. Navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables, or Tools->Table Browser from the top blue menu bar.
2. Choose your assembly of interest (either hg19 or hg38)
3. Make the following selections from the "group", "track", and "table" dropdowns:
- Genes and Gene Predictions
- If you are working with hg19 data, choose the RefSeq Genes track and the refGene table
- If you are working with hg38 data, choose the NCBI RefSeq track, and the RefSeq Curated (ncbiRefSeqCurated) table
4.
Next to "identifiers", click "paste list", and copy and paste only the
gene names (the gene column from Gene_HGVS_GARTON.xlsx) you are
interested in into the box. Alternatively, if you have a file of only
gene names, you can upload the list with the "upload list" button.
5. In the output format dropdown, choose "selected fields from primary and related tables"
6. Click "get output"
7. On the "Select Fields from " page, uncheck everything except for the name and name2 fields.
8. Click "get output"
This leads to output like the following:
NM_001171811.1 GBA NM_001171812.1 GBA NM_001177520.2 ALPL NM_001243564.1 DHDDS NM_001243565.1 DHDDS NM_001243766.1 POMGNT1 NM_001278188.1 TPM3 NM_001278189.1 TPM3 NM_001278190.1 TPM3 NM_001278191.1 TPM3
If you save the above Table Browser result to a file (called
nmToGene.txt below), you can use the following commands to find the
genes with multiple transcripts so you can more easily choose the
appropriate one:
$ tail -n +2 nmToGene.txt | awk '{print $2 "\t" $1;}' | sort > geneToNm.txt $ cut -f 1 geneToNm.txt | uniq -c | awk '$1 > 1' > multiNmGenes.txt
You can then replace your list of gene names with NM transcript accessions, form a file with valid HGVS terms like "NM_000352:c.3989-9G>A", and use the VAI to get the rsID's associated with your HGVS terms:
1. Navigate to the Variant Annotation Integrator: http://genome.ucsc.edu/cgi-bin/hgVai, or Tools->Variant Annotation Integrator from the top blue menu bar.
2. Choose your assembly of interest from the "clade", "genome", and "assembly" dropdowns.
3. Select "genome" under the "region to annotate" dropdown if not already selected.
4.
In the "Select Variants" section, select "HGVS terms" from the
"variants" dropdown. Once the text box pops up, paste your newly formed
HGVS terms into the text box.
5. In the "Select Genes" section,
select either "RefSeq Genes" or "NCBI RefSeq genes, curated subset
(NM_*, NR_*, and YP_*)", depending on your assembly of interest.
6. Click the "clear all" button in the "Select More Annotations (optional)" section.
7. Ensure "Include dbSNP rs# ID if one exists" is selected in the "Known variation" section.
8.
Finally, make sure "Variant Effect Predictor (tab-separated text)" is
selected from the output format dropdown, and enter a name for the
output file in the "output file" text box, and click "Get results".
9. Agree to the disclaimer that the VAI does not constitute medical advice.
Once the file finishes downloading, you will have output like the following (I only used one HGVS term in this example):
## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator) ## Output produced at 2017-04-25 12:14:59 ## Connected to UCSC database hg38 ## Variants: HGVS terms ## Transcripts: GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) (hg38.knownGene) ## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 147) (/gbdb/hg38/vai/snp147.bed4.bb) Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra NM_000352:c.3989-9G>A chr11:17397055 T ABCC8 uc057zmr.1 Transcript upstream_gene_variant - - - - - rs151344623 DISTANCE=1275 NM_000352:c.3989-9G>A chr11:17397055 T ABCC8 uc057zms.1 Transcript upstream_gene_variant - - - - - rs151344623 DISTANCE=1170 NM_000352:c.3989-9G>A chr11:17397055 T ABCC8 uc001mnc.4 Transcript intron_variant - - - - - rs151344623 INTRON=32/38 ... ... ...
$ awk 'FNR > 7 {print $1, $13}' outputFileName | uniq > termAndRsID.txt NM_000352:c.3989-9G>A rs151344623
Now that we have the associated rsID's for your HGVS terms, we can move on to finding all the rsID's associated with your list of Gene Names and Entrez ID's via the Table Browser and Data Integrator.
1. Follow the same Table Browser steps as previously mentioned,
except this time paste in your list of gene names (the gene column) from
the genelist_GARTON.xlsx file as identifiers, and on the "Select fields
from " page, choose chrom, txStart, txEnd, and name2.
2. You should now have output like the following:
#chrom txStart txEnd name2 chr1 201283451 201332993 PKP1 chr1 201283451 201332993 PKP1 chr1 92246397 92299009 GLMN chr1 92246397 92299009 GLMN
This output will be of the following form (depending on the fields you selected in step 10):
chr1 1020122 1056119 AGRN rs115173026 chr1 1020122 1056119 AGRN rs201073369 chr1 1020122 1056119 AGRN rs199682825
A genome-wide query on Common SNPs 147 will return a ~3MB gzipped file, while a chromosome 1 only search on All SNPs 147 will return a ~17MB gzipped file, so you can understand how things would time out on a genome-wide query, and how many more SNPs are in the snp147 table compared to the snp147Common table.
Please let us know if you have any further questions!
Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.
Christopher Lee
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/85F5103C-4D61-44E1-AF16-E05C05FB091E%40uq.edu.au.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.
HI Chris,
Thanks so much for this,
Stuck at step 3….
Step 1. Convert your gene names into transcript ID's so they can be used in the VAI.
Using hg19, the transcript does not come with a reference number..but that is ok— I have both--
Step 2. Ok
Step 3. Use the VAI to get the rsID's associated with your HGVS terms:
I just get a blank output.. even if I use one example that you were able to do ‘NM_000352:c.3989-9G>A’
Maybe it is something obvious—once I select everything – do I wait ? or click get results? – there seems to be a ‘thinking bar’ still going when the text asks me if I want to view..
Additionally if I have the version number of the transcript—do need to use hg38?
Thanks again, really appreciate all the help,
Sorry couldn’t get it(!)
Fleur
## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-05-04 23:48:49
## Connected to UCSC database hg19
## Variants: HGVS terms
## Transcripts: RefSeq Genes (hg19.refGene)
## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 147) (/gbdb/hg19/vai/snp147.bed4.bb)
## Keys for Extra column items:
## REFSEQ_STATUS: <A HREF="http://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_status_codes" TARGET=_BLANK>RefSeq status</A> of the transcript
## GENCODE_TAG: <A HREF="http://www.gencodegenes.org/gencode_tags.html" TARGET=_BLANK>GENCODE tags</A> for the transcript
Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra
NM_000352.3 |
: |
c.560T>A |
NM_000016.4 |
: |
c.127G>A |
NM_000016.4 |
: |
c.199T>C |
NM_000016.4 |
: |
c.250C>T |
This email (including any attached files) is intended solely for the addressee. It may contain private or confidential information. If you are not the intended addressee, you must take no action based on it, nor show a copy to anyone. Kindly notify the sender by reply email. Opinions and information in this email which do not relate to the official business of The University of Queensland shall be understood as neither given nor endorsed by the University. CRICOS Provider Number 00025B.
Hello Fleur,
Thank you for using the UCSC Genome Browser and reporting your issues.
It looks like the reason you are not seeing any output for your HGVS terms on hg19 is because our database has only the latest versions of RefSeq transcripts, and the transcripts in the four example terms you included are all for older versions of RefSeq transcripts. For example, a term uses NM_000352.3 but the latest RefSeq version is NM_000352.4. You could strip the versions (e.g. "NM_000352.3" --> "NM_000352") to use the RefSeq version we have for your given assembly. However, there is the possibility that there are differences between older and newer NM_ versions such as changes to sequence length or the addition of indels.
Please try the example HGVS term again using the following settings in VAI:
clade: Mammal
genome: Human
assembly: Feb 2009 (GRCh37/hg19)
region to annotate: genome
variants: HGVS terms
Inside the text box, paste the example HGVS term: NM_000352:c.3989-9G>A.
Under the Select More Annotations (optional) section, make sure only the Include dbSNP rs# ID if one exists option checked.
Leaving the output file: option empty, click Get results.
Here is a session with all of these inputs already selected for you.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Jairo Navarro
UCSC Genomics Institute
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/C88E2446-741C-4FD3-BA88-D370F151F901%40uq.edu.au.