genomic position lifted from hg38 to hg19

27 views
Skip to first unread message

Kae Koganebuchi

unread,
Apr 9, 2018, 12:21:14 PM4/9/18
to gen...@soe.ucsc.edu
Hi,

I am trying to liftOver a vcf file from hg38 to hg19 using a chain file hg38ToHg19.over.chain.gz and a Picard tool. I have a question about discordance between lifted positions and dbSNP positions.

For example, rs758162168 is lifted from chr2:91432412 (hg38) to chr2:90403482 (hg19). However, dbSNP says that rs758162168 places at chr2:91624970 in hg 19.  chr2:90403482 (hg19) has another rs ID (rs4005751). UCSC liftOver outputs the same position with the lifted vcf using Picard.

I want to get a file that has correct genomic positions. Could you let me know how to solve this problem? I am quite new to all of this. I would appreciate it if you could advise me anything.

Sincerely,

Kae Koganebuchi

Hiram Clawson

unread,
Apr 9, 2018, 5:25:31 PM4/9/18
to Kae Koganebuchi, gen...@soe.ucsc.edu
Good Afternoon Kae:

When a liftOver position is uncertain, there are several considerations
to take into account to evaluate the uncertainty.

In this case, on your hg38 genome browser, there is a track named
"Hg19 Diff" in the first section of browser tracks under
"Mapping and Sequencing". If you turn this track on, at your
position of interest, you will discover that the contig used to
construct the hg38 assembly for this area is a new contig to the
assembly and was not used in the hg19 assembly. This does not
necessarily mean the actual sequence can not be found in hg19,
but it does raise concern about the mappability from this region
to hg19.

To determine if your region of interest is actually in the hg19
assembly, move to a position in the browser that includes approximately
100 or 200 bases around the single SNP of interest. For example
position: chr2:91,432,313-91,432,512

Use the pull-down menu 'View' on the blue navigation bar at the top
to select 'DNA', press the 'get DNA' button to see:

>hg38_dna range=chr2:91432313-91432512 5'pad=0 3'pad=0 strand=+ repeatMasking=none
AAAATAGTCTTGAAAAAAGAAAACATATTAGGATAATTCACACCCCCGTG
CTCCAAACCTTACGGCAAGGCATCAGTAATCAAGACAACACAATACTGAT
GAAAGAAAAATATATAGATTGATGGAAGAGAATTGAGAGTCCATATATAA
AACTATGTGTCTATAGTCAATGGATTCTTACAGTGGTGCCATGTGCAATT

If you copy and paste this sequence into the Tools->Blat service,
select either the hg38 or hg19 assembly, and press 'submit'
the blat will show you where these 200 bases can be found (in
either assembly). For example, in hg38, note the two locations
found: chr2:91432313-91432512 (this identical sequence) and
chr2:90370455-90370654 an almost identical match nearby indicating
this 200 base sequence is a duplicated region. There are many
other near perfect match regions, indicating this is a repeating
set of sequence.

On hg19, the result is similar, many near perfect match regions,
the two highest scoring: chr2:91624871-91625070 and
chr2:90403383-90403582.

Therefore, SNPs of interest in this region can not necessarily
be mapped 'correctly' to the other genome due to the repetitive
nature of this region of sequence. You would need to find other
supporting evidence of how this SNP relates to other genome
annotations to determine exactly where it may be found in
other assemblies.

I hope this is helpful. If you have any further questions, please reply to
gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible
Google Groups forum.
If your question includes sensitive data, you may send it instead to
genom...@soe.ucsc.edu.

Hiram Clawson
U.C. Santa Cruz Genomics Institute

Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining
Reply all
Reply to author
Forward
0 new messages