Hello, Shanker.
Could you please provide me with some sample lines of your data file which include some lines that do have the correct rs IDs and some that do not?
Please contact us again at gen...@soe.ucsc.edu if you have any further questions.
---
Steve Heitner
UCSC Genome Bioinformatics Group
--
Hello, Shanker.
It is possible to replace the kgp IDs in your file with rs IDs, but the solution is slightly complicated. The general strategy is the following:
1. Create a BED file of genomic coordinates from your map file
2. Use the UCSC Table Browser and the BED file created in step 1 to create a new BED file with the proper rs IDs
3. Use the new BED file created in step 2 to identify which rs IDs cross reference with which kgp IDs
From the sample data you have shown me from your map file, it appears that the “1” preceding the SNP ID is the chromosome number. The last number of the second line is the chromosomal coordinate. So your first entry is:
1 kgp499505
5.8106 4158540
A G
The corresponding entry in your resulting BED file should be:
chr1 4158539 4158540 kgp499505
Using this convention, the first 3 lines of your BED file should be:
chr1 4158539 4158540 kgp499505
chr1 4158954 4158955 rs10915428
chr1 4160903 4160904 rs7523426
To obtain this BED file, if you are comfortable with writing basic scripts, you can write a script to parse your map file and output the appropriate BED file. If not, there are tools available at Galaxy (https://main.g2.bx.psu.edu/) that can help you with this. See specifically the tools under “Text Manipulation”. Please contact Galaxy support (http://wiki.galaxyproject.org/Support) for any questions related to Galaxy tools.
Once you have obtained this BED file, you will use our Table Browser to obtain the list of SNP IDs corresponding with the regions defined in the BED file. If you are unfamiliar with the Table Browser, please see the User’s Guide at http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html.
Perform the following steps:
1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Variation and Repeats
Track: All SNPs(137)
Table: snp137
Output format: selected fields from primary and related tables
3. On the “region” line, click the “define regions” button. The only caveat with defining regions is that you are limited to 1,000 regions at a time.
4. Click the “Browse” button to select the BED file you created earlier
5. Click the “submit” button
6. You can enter a filename on the “output file” line or leave it blank to see the results on the screen
7. Click the “get output” button
8. In the “Select Fields from hg19.snp137” section, check the chrom, chromStart, chromEnd and name checkboxes
9. Click the “get output” button
At this point, if you only care about having the correct rs IDs and you don’t care about which rs ID goes with which kgp ID, you are done. If you want to associate rs IDs with kgp IDs, again, you can either write a custom script to compare the two files or you can use Galaxy tools. See specifically the “Operate on Genomic Intervals/Join” tool.
Considering the limitation of defining 1,000 regions at a time, if you have an overwhelmingly large number of regions, you may consider only adding the kgp IDs to your initial BED file. If that is still prohibitive, we may need to consider alternate solutions.
Please contact us again at gen...@soe.ucsc.edu if you have any further questions.
---
Steve Heitner
UCSC Genome Bioinformatics Group