Find matching rs number for a list of gene/HGVS/cDNA data

Fleur Garton

unread,

Apr 25, 2017, 10:08:30 AM4/25/17

to gen...@soe.ucsc.edu

Dear Chris and whom it may concern at UCSC,

Thanks so much for your reply on Biostars—(see below Re: finding matching rs numbers)—

I am new to data matching and so would really appreciate further instruction—I’ve attached the data I have available,

Thanking-you in advance,

Kind regards,

Fleur Garton

Question: Find matching rs number for a list of gene/HGVS/cDNA data

HI! I would like to get an rs number using (any) of the following information that I have in a data spreadsheet Gene name (e.g. EXOSC3)/ HGVS cDNA (old nomenclature) (e.g. c.92G>C)/ HGVS protein (p.G31A for protein)/ cDNA_reference (e.g. NM_016042.2) I was thinking that this would be relatively straightforward - i.e. download similar data with rs number - then merge the files - but can't find this! Would be extremely grateful if anyone can help, Thanks so much, Fleur

Anwer: Find matching rs number for a list of gene/HGVS/cDNA data

You can use the UCSC Genome Browser to get this information, but the process depends on the specific HGVS terms you have.

The general idea is to split your list into two lists, one with just your different HGVS terms, and one with just your gene names. Then you can use a combination of the Table Browser and Data Integrator to get the rsID's (if any) corresponding to your gene names, and the Variant Annotation Integrator to get the rsID's (if any) corresponding to your HGVS terms.

We can provide more specific steps if you email this question along with some example data to our mailing list gen...@soe.ucsc.edu, that way our entire team will see the question and can help with a solution.

Thanks,

ChrisL from the UCSC Genome Browser

Fleur Garton, PhD

Institute for Molecular Bioscience | University of Queensland | St Lucia, QLD, Australia, 4072

E: f.ga...@uq.edu.au P: 0423 210 114

This email (including any attached files) is intended solely for the addressee. It may contain private or confidential information. If you are not the intended addressee, you must take no action based on it, nor show a copy to anyone. Kindly notify the sender by reply email. Opinions and information in this email which do not relate to the official business of The University of Queensland shall be understood as neither given nor endorsed by the University. CRICOS Provider Number 00025B.

Gene_HGVS_GARTON.xlsx

genelist_GARTON.xlsx

Christopher Lee

unread,

May 1, 2017, 10:30:03 AM5/1/17

to Fleur Garton, gen...@soe.ucsc.edu

Hi Fleur,

Thanks for sending your message to our list and sharing some example data. There are a couple problems with the format of your HGVS terms. Firstly, it appears that your HGVS terms are gene specific rather than transcript specific, which presents a problem, as something like "c.127G>A" can indicate different coordinates depending on which transcript of the ACADM gene you are referring to. Secondly, a term like ABCC8:c.3989-9G>A(3992-9G>A), even if formatted into a transcript specific term like NM_000352:c.3989-9G>A(3992-9G>A), is not valid HGVS because of the "(3992-9G>A)" at the end.

Thus you will need to first clean up your list of HGVS terms before using the Variant Annotation Integrator (http://genome.ucsc.edu/cgi-bin/hgVai) in order to find the associated rsID's for your terms. To start, here is an example command to clean up HGVS terms like the one above (NM_000352:c.3989-9G>A(3992-9G>A)):
sed -re 's/$[^)]+$$//;'

Secondly, you will need to convert your gene names into transcript ID's so they can be used in the VAI. You can find the accession's associated with your genes with the Table Browser:
1. Navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables, or Tools->Table Browser from the top blue menu bar.
2. Choose your assembly of interest (either hg19 or hg38)
3. Make the following selections from the "group", "track", and "table" dropdowns:
- Genes and Gene Predictions
- If you are working with hg19 data, choose the RefSeq Genes track and the refGene table
- If you are working with hg38 data, choose the NCBI RefSeq track, and the RefSeq Curated (ncbiRefSeqCurated) table
4. Next to "identifiers", click "paste list", and copy and paste only the gene names (the gene column from Gene_HGVS_GARTON.xlsx) you are interested in into the box. Alternatively, if you have a file of only gene names, you can upload the list with the "upload list" button.
5. In the output format dropdown, choose "selected fields from primary and related tables"
6. Click "get output"
7. On the "Select Fields from " page, uncheck everything except for the name and name2 fields.
8. Click "get output"

This leads to output like the following:

NM_001171811.1    GBA
NM_001171812.1    GBA
NM_001177520.2    ALPL
NM_001243564.1    DHDDS
NM_001243565.1    DHDDS
NM_001243766.1    POMGNT1
NM_001278188.1    TPM3
NM_001278189.1    TPM3
NM_001278190.1    TPM3
NM_001278191.1    TPM3

You will need to sort this file and choose the appropriate transcript you are interested in for your terms. If there is any confusion over which transcript to choose for an associated gene, you can paste your HGVS term into the Genome Browser for your assembly of interest, turn accession labels on for the NCBI RefSeq/RefSeq Genes track, and choose the correct transcript that displays the correct variation. Here is a session showing the results of a search for ACADM:c.127G>A on the hg38 assembly with NM labels turned on:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg38trackuibugs

If you save the above Table Browser result to a file (called nmToGene.txt below), you can use the following commands to find the genes with multiple transcripts so you can more easily choose the appropriate one:

$ tail -n +2 nmToGene.txt | awk '{print $2 "\t" $1;}' | sort > geneToNm.txt 
$ cut -f 1 geneToNm.txt | uniq -c | awk '$1 > 1' > multiNmGenes.txt

You can then replace your list of gene names with NM transcript accessions, form a file with valid HGVS terms like "NM_000352:c.3989-9G>A", and use the VAI to get the rsID's associated with your HGVS terms:

1. Navigate to the Variant Annotation Integrator: http://genome.ucsc.edu/cgi-bin/hgVai, or Tools->Variant Annotation Integrator from the top blue menu bar.
2. Choose your assembly of interest from the "clade", "genome", and "assembly" dropdowns.
3. Select "genome" under the "region to annotate" dropdown if not already selected.
4. In the "Select Variants" section, select "HGVS terms" from the "variants" dropdown. Once the text box pops up, paste your newly formed HGVS terms into the text box.
5. In the "Select Genes" section, select either "RefSeq Genes" or "NCBI RefSeq genes, curated subset (NM_*, NR_*, and YP_*)", depending on your assembly of interest.
6. Click the "clear all" button in the "Select More Annotations (optional)" section.
7. Ensure "Include dbSNP rs# ID if one exists" is selected in the "Known variation" section.
8. Finally, make sure "Variant Effect Predictor (tab-separated text)" is selected from the output format dropdown, and enter a name for the output file in the "output file" text box, and click "Get results".
9. Agree to the disclaimer that the VAI does not constitute medical advice.

Once the file finishes downloading, you will have output like the following (I only used one HGVS term in this example):

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)
## Output produced at 2017-04-25 12:14:59
## Connected to UCSC database hg38
## Variants: HGVS terms
## Transcripts: GENCODE v24 Comprehensive Transcript Set (only Basic displayed by default) (hg38.knownGene)
## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 147) (/gbdb/hg38/vai/snp147.bed4.bb)
Uploaded Variation    Location    Allele    Gene    Feature    Feature type    Consequence    Position in cDNA    Position in CDS    Position in protein    Amino acid change    Codon change    Co-located Variation    Extra
NM_000352:c.3989-9G>A    chr11:17397055    T    ABCC8    uc057zmr.1    Transcript    upstream_gene_variant    -    -    -    -    -    rs151344623    DISTANCE=1275
NM_000352:c.3989-9G>A    chr11:17397055    T    ABCC8    uc057zms.1    Transcript    upstream_gene_variant    -    -    -    -    -    rs151344623    DISTANCE=1170
NM_000352:c.3989-9G>A    chr11:17397055    T    ABCC8    uc001mnc.4    Transcript    intron_variant    -    -    -    -    -    rs151344623    INTRON=32/38
...
...
...

The Co-located Variation field indicates the rsID's found for this HGVS term (if any). You can use standard file manipulation tools like awk or grep to extract the terms and rsID's:

$ awk 'FNR > 7 {print $1, $13}' outputFileName | uniq > termAndRsID.txt
NM_000352:c.3989-9G>A rs151344623

Now that we have the associated rsID's for your HGVS terms, we can move on to finding all the rsID's associated with your list of Gene Names and Entrez ID's via the Table Browser and Data Integrator.

1. Follow the same Table Browser steps as previously mentioned, except this time paste in your list of gene names (the gene column) from the genelist_GARTON.xlsx file as identifiers, and on the "Select fields from " page, choose chrom, txStart, txEnd, and name2.
2. You should now have output like the following:

#chrom    txStart    txEnd    name2
chr1    201283451    201332993    PKP1
chr1    201283451    201332993    PKP1
chr1    92246397    92299009    GLMN
chr1    92246397    92299009    GLMN

3. Open a new tab in your web browser, then navigate to the custom tracks page: http://genome.ucsc.edu/cgi-bin/hgCustom.
4. Make sure the correct assembly is selected, and then paste the following line: 'track name="bed4 of gene IDs" type=bed', followed by the Table Browser output into the "Paste URLs or data: " section, and click "submit".
5. From the Manage Custom Tracks page, change the "view in" dropdown to "Data Integrator" and click "go".
6. From the Data Integrator page, again make sure the correct assembly is selected, then change the "region to annotate" dropdown to "genome".
7. The custom track we just created should be selected, but just in case, make sure "Custom Tracks" and "bed4 of gene IDs (ct_bed4ofgeneids_someNumber)" are selected in the "track group" and "track" dropdowns and click "add".
8. Now select "Variation" and "Common SNPs(147) (snp147Common)" from the "track group" and "track" dropdowns and click "add".
9. Under Output Options, check the box for "Send output to file", enter a name, and then you will probably want to check the option for gzip compression, but it is optional.
10. Click the "Choose fields..." button and select your fields of interest, which is likely all the fields from our custom track and at least the name field from Common SNPs 147. When you are finished choosing fields click "Done"
11. Lastly click "Get output".

This output will be of the following form (depending on the fields you selected in step 10):

chr1    1020122    1056119    AGRN    rs115173026
chr1    1020122    1056119    AGRN    rs201073369
chr1    1020122    1056119    AGRN    rs199682825

Each line will contain the line from our custom track, and, if it exists, an rsID. There is an unfortunate caveat to this procedure, which is that we cannot use the All SNPs 147 table for annotation, since there are too many SNPs, which causes the Data Integrator to time out for a genome wide query. You can however, run the above procedure in chromosome length batches, and obtain all SNPs that overlap your gene set that way. The only change to the steps is that in Step 6, change the "region to annotate" dropdown to "position of search term", and enter chr1 (or chr2, or chr3,...).

A genome-wide query on Common SNPs 147 will return a ~3MB gzipped file, while a chromosome 1 only search on All SNPs 147 will return a ~17MB gzipped file, so you can understand how things would time out on a genome-wide query, and how many more SNPs are in the snp147 table compared to the snp147Common table.

Please let us know if you have any further questions!

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/85F5103C-4D61-44E1-AF16-E05C05FB091E%40uq.edu.au.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Fleur Garton

unread,

May 5, 2017, 12:03:13 PM5/5/17

to Christopher Lee, gen...@soe.ucsc.edu

HI Chris,

Thanks so much for this,

Stuck at step 3….

Step 1. Convert your gene names into transcript ID's so they can be used in the VAI.

Using hg19, the transcript does not come with a reference number..but that is ok— I have both--

Step 2. Ok

Step 3. Use the VAI to get the rsID's associated with your HGVS terms:

I just get a blank output.. even if I use one example that you were able to do ‘NM_000352:c.3989-9G>A’

Maybe it is something obvious—once I select everything – do I wait ? or click get results? – there seems to be a ‘thinking bar’ still going when the text asks me if I want to view..

Additionally if I have the version number of the transcript—do need to use hg38?

Thanks again, really appreciate all the help,

Sorry couldn’t get it(!)

Fleur

## ENSEMBL VARIANT EFFECT PREDICTOR format (UCSC Variant Annotation Integrator)

## Output produced at 2017-05-04 23:48:49

## Connected to UCSC database hg19

## Variants: HGVS terms

## Transcripts: RefSeq Genes (hg19.refGene)

## dbSNP: Simple Nucleotide Polymorphisms (dbSNP 147) (/gbdb/hg19/vai/snp147.bed4.bb)

## Keys for Extra column items:

## REFSEQ_STATUS: <A HREF="http://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_status_codes" TARGET=_BLANK>RefSeq status</A> of the transcript

## GENCODE_TAG: <A HREF="http://www.gencodegenes.org/gencode_tags.html" TARGET=_BLANK>GENCODE tags</A> for the transcript

Uploaded Variation Location Allele Gene Feature Feature type Consequence Position in cDNA Position in CDS Position in protein Amino acid change Codon change Co-located Variation Extra

NM_000352.3	:	c.560T>A
NM_000016.4	:	c.127G>A
NM_000016.4	:	c.199T>C
NM_000016.4	:	c.250C>T

 

This email (including any attached files) is intended solely for the addressee. It may contain private or confidential information. If you are not the intended addressee, you must take no action based on it, nor show a copy to anyone. Kindly notify the sender by reply email. Opinions and information in this email which do not relate to the official business of The University of Queensland shall be understood as neither given nor endorsed by the University. CRICOS Provider Number 00025B.

Jairo Navarro Gonzalez

unread,

May 5, 2017, 6:35:24 PM5/5/17

to Fleur Garton, UCSC Genome Browser Mailing List

Hello Fleur,

Thank you for using the UCSC Genome Browser and reporting your issues.

It looks like the reason you are not seeing any output for your HGVS terms on hg19 is because our database has only the latest versions of RefSeq transcripts, and the transcripts in the four example terms you included are all for older versions of RefSeq transcripts. For example, a term uses NM_000352.3 but the latest RefSeq version is NM_000352.4. You could strip the versions (e.g. "NM_000352.3" --> "NM_000352") to use the RefSeq version we have for your given assembly. However, there is the possibility that there are differences between older and newer NM_ versions such as changes to sequence length or the addition of indels.

Please try the example HGVS term again using the following settings in VAI:

clade: Mammal
genome: Human
assembly: Feb 2009 (GRCh37/hg19)
region to annotate: genome
variants: HGVS terms

Inside the text box, paste the example HGVS term: NM_000352:c.3989-9G>A.
Under the Select More Annotations (optional) section, make sure only the Include dbSNP rs# ID if one exists option checked.
Leaving the output file: option empty, click Get results.

Here is a session with all of these inputs already selected for you.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.

If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genomics Institute

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/C88E2446-741C-4FD3-BA88-D370F151F901%40uq.edu.au.

Reply all

Reply to author

Forward