from hg18 to GRCh38

1,010 views
Skip to first unread message

Alexandra Vatsiou

unread,
Feb 26, 2014, 11:34:00 AM2/26/14
to gen...@soe.ucsc.edu
Hello,

I would like to ask one question about conversion from hg18 to GRCh38. Is it possible to convert SNP coordinates from hg18 to GRCh38 with the chain file http://hgdownload-test.sdsc.edu/goldenPath/hg18/liftOver/hg18ToHg38.over.chain.gz. And if yes, what exactly is the format of the chain file?

Regards,
Alexandra


Matthew Speir

unread,
Feb 26, 2014, 5:36:26 PM2/26/14
to Alexandra Vatsiou, gen...@soe.ucsc.edu
Hi Alexandra,

Thank you for your question about mapping SNP coordinates from hg18 to hg38. You may find the LiftOver utility useful for this task. This utility is available for UNIX-based OSes, and can be downloaded from the following location: http://hgdownload.soe.ucsc.edu/admin/exe/. Once you have downloaded the utility, you can type "./liftOver" with no arguments at the command line to see the usage statement. This usage statement also contains information on acceptable input formats for liftOver. More information on the chain format can be found on the following help page: http://genome.ucsc.edu/goldenPath/help/chain.html.

One of our engineers had the following suggestions to consider when carrying out these Lift Overs:

"Mapping polymorphisms correctly is challenging even when directly mapping sequence to a reference assembly. No guarantees, but something like this might help to increase the confidence in mappings: instead of mapping single-nucleotide positions, widen out the hg18 positions (for example by 50 or 100 bases upstream and downstream of the SNP) and require that the whole surrounding region maps to a single location in hg38. As with any bioinformatics process, comparing the results of different methods might provide some insight.

If your SNP coordinates are from dbSNP, here are a couple additional suggestions:
1) Use hg19 SNP 138 (hg18 SNP 130 is very old now).
2) Filter out any dbSNP items that map to multiple locations in hg18/hg19.
3) Take a look at the red flags described in the snp138ExceptionsDesc table and consider excluding SNPs that have some of those conditions as well."

If you choose to use the hg18 SNP coordinates, then the file referenced in your email, http://hgdownload-test.sdsc.edu/goldenPath/hg18/liftOver/hg18ToHg38.over.chain.gz, will work just fine. If you decide to use the hg19 snp138 coordinates, you will need an hg19 to hg38 liftOver file which can be found at http://hgdownload-test.sdsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz. Please note that these files are on our test server and have not undergone our standard QA process yet.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--
 

Alexandra Vatsiou

unread,
Feb 27, 2014, 10:52:10 AM2/27/14
to gen...@soe.ucsc.edu


---------- Forwarded message ----------
From: Alexandra Vatsiou <alex.v...@gmail.com>
Date: Thu, Feb 27, 2014 at 4:37 PM
Subject: Re: [genome] from hg18 to GRCh38
To: Matthew Speir <msp...@soe.ucsc.edu>


Hi Matthew,

Thank you for your very informative email. 
I have though two more question, if it is possible to answer me. 
You mentioned that the chain file is sufficient to convert from one assembly to another the SNP coordinates. 

1. I suppose the reference sequence is the hg18 and the query is the hg38 in the hg19ToHg38.over.chain file. Right?
  
2. However, I can't see how exactly I will do the conversion of SNP coordinates between the two assemblies. The chain file contains 
a) the size of the ungapped alignment and
b) the difference between the end of this block and the beginning of the next block in the reference and the query sequence. 

What I have is something like the following:
SNP              chr      genetic dist     physical dist
rs12565286 1 0.000881697 711153 C G
rs11804171 1 0.0058782819 713682 A T
rs2977670 1 0.0060252705 713754 C G
rs2977656 1 0.018904623 719811 C T
rs12138618 1 0.0633989027 740098 A G
rs3094315 1 0.0665172688 742429 A G
rs3131968 1 0.0679866253 744055 A G

where the fourth column is the SNP coordinates of each SNP.
For example according to the chain file what will be the coordinate of the first SNP?

Look forward to your reply.

Regards,
Alexandra

Pauline Fujita

unread,
Mar 6, 2014, 6:05:13 PM3/6/14
to Alexandra Vatsiou, UCSC Genome Browser discussion list
Hello Alexandra,

What operating system are you using? We recommend the liftOver utility
suggested by my colleague Matt. If you need help converting your SNP
data to BED format for liftOver, here is a suggestion from one of our
developers:

awk '$1 ~ /^rs/ {print "chr" $2, $4-1, $4, $1;}' myFile.txt > myFile.hg18.bed

Then myFile.hg18.bed will look like this:
chr1 711152 711153 rs12565286
chr1 713681 713682 rs11804171
chr1 713753 713754 rs2977670

If you want to keep the alleles, they can be concatenated to the name
field like this:

awk '$1 ~ /^rs/ {print "chr" $2, $4-1, $4, $1 "_" $5 "_" $6;}'
myFile.txt > myFile.hg18.bed

Then myFile.hg18.bed will look like this:
chr1 711152 711153 rs12565286_C_G
chr1 713681 713682 rs11804171_A_T
chr1 713753 713754 rs2977670_C_G

Then you can run liftOver like this:

./liftOver myFile.hg18.bed hg18ToHg38.over.chain.gz myFile.hg19.bed
myFile.hg18.cantMap

Or... if you don't have too many rsIDs, then you can upload
myFile.hg19.bed to http://genome-preview.ucsc.edu/cgi-bin/hgLiftOver

Finally, if you would like to try the suggestion of adding bases
upstream and downstream to require a larger region to map cleanly, you
can try this sequence:

awk '$1 ~ /^rs/ {print "chr" $2, $4-1-50, $4+50, $1 "_" $5 "_" $6;}'
myFile.txt > myFile.flank50.hg18.bed

./liftOver myFile.flank50.hg18.bed hg18ToHg38.over.chain.gz
myFile.flank50.hg19.bed myFile.flank50.hg18.cantMap

This command turns the mapped 101-bp regions back into 1-bp regions,
and discards mappings to
regions smaller than 101 bp:

awk '{s = $2+50; e = $3-50; if (e > s) {print $1, s, e, $4;} }'
myFile.flank50.hg19.bed > myFile.mapped50.hg19.bed

Hopefully this gets you the data you are looking for. If you have any
further questions, please reply to gen...@soe.ucsc.edu. All messages
sent to that address are archived on a publicly-accessible Google
Groups forum. If your question includes sensitive data, you may send
it instead to genom...@soe.ucsc.edu.

Best regards,

Pauline Fujita
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu
> --
>
Reply all
Reply to author
Forward
0 new messages