HG38 Update

Stephen Tsou

unread,

Jan 24, 2023, 2:43:52 AM1/24/23

to Jairo Navarro Gonzalez, cc: Gerardo Perez, Luis Nassar, UCSC Genome Browser Discussion List

Hello all,

So thank all you for the very kind correspondence and help last year. Our biostatistics paper on Alzheimer's was accepted and we are now submitted into ISMB on the strength of our machine learning work with FMRI images. We hope that it will help the biostatistics and epidemiology community.

So now, the next project turns to actually getting the RSIDs updated in the correct format for our next submission. Even after some continued correspondence with NIH and UCSC Genome Browser, I have been unable to update successfully the plink file rsids in HG18 assembly to HG38 based on chromosome positions. We corresponded some last year and unfortunately even after your courteous correspondence along with NIH, my PI relayed the message that the results in the updated plink files did not change any of the results. From that, I've decided to start over.

Would anybody be able to walk me through how to update the rsids to hg38? I have culled these lines from our correspondence last semester which certainly seem to be correct in getting a list of unique ids.

Start quote:

The epsilon 4 allele of APOE is the strongest known genetic risk factor for AD with a two- to three-fold increased risk for AD in people with one epsilon 4 allele rising to about 12-fold in those with two alleles. APOE genotyping was performed at the time of participant enrollment and included in the ADNI database. The two SNPs (rs429358, rs7412) that define the epsilon 2, 3, and 4 alleles, are not on the Human610-Quad BeadChip, and therefore were genotyped using DNA extracted by Cogenics from a 3 mL aliquot of EDTA blood.

If you are looking at the BeadChip data, then you won't find the two SNPs (rs429358, rs7412) because they are not on the BeadChip.

The problem with the error you shared is that rs28615667 is on both chrX and chrY and plink --update-name wants everything in the file to be unique. Our engineer shares some bash commands to get only single-nucleotide variants (not insertions/deletions) and to remove duplicates:

curl -O https://hgdownload.gi.ucsc.edu/goldenPath/hg38/database/snp151Common.txt.gz
gunzip -c snp151Common.txt.gz | cut -f 2,4,5,12,17 | grep single.exact | cut -f 1-3 > onlySNPs.tsv
sort -k3 -u onlySNPs.tsv | sort -k1,2 -u > onlySNPs.uniqLocAndId.tsv

You can check for the locations of rs429358 and rs7412 in that file:

grep -w rs429358 onlySNPs.uniqLocAndId.tsv
chr19   44908684        rs429358
grep -w rs7412 onlySNPs.uniqLocAndId.tsv
chr19   44908822        rs7412

:End quote.

Would anybody here be able to step me through this? I have an HG18 .bim, .bed and .map file. I believe (although I may be mistaken) that there are only chromosome end positions and no start positions. I was informed to assume that all chromosome start positions should then be 1. I was also told that an initial update should be performed, followed by a liftover for rsids who couldn't be updated initially. Please correct me if any of those instructions are incorrect and if they are incorrect, please let me know the correct instructions.

I would deeply appreciate it.

best,

Stephen

Gerardo Perez

unread,

Jan 26, 2023, 8:36:03 PM1/26/23

to Stephen Tsou, UCSC Genome Browser Discussion List

Hello, Stephen.

Thank you for your interest in the UCSC Genome Browser.

Regarding the updated plink files results not changing, what changes were you expecting? The rs# IDs are meant to be stable across human genome assembly versions, although in practice there can be exceptions, especially in indel variants. For extremely well-studied common SNPs, it is not surprising for the rsID in hg18 to match the rsID in hg38. Could you provide an example of an expected change in rs# so we could look into it more closely?

I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.

Gerardo Perez
UCSC Genomics Institute

Stephen Tsou

unread,

Jan 27, 2023, 12:15:01 PM1/27/23

to Gerardo Perez, UCSC Genome Browser Discussion List

Thanks Gerardo,

I was under the impression that the chromosome start and end positions corresponded with different rsids in different assemblies (HG18 and HG38 for instance here). I was talking with a student in statistical genetics who believed this to be true as well. Unfortunately, as I am obviously not trained in genetic preprocessing I am unfortunately at the mercy of the help of others and what they say. Is this not the case? Any help would be appreciated. A student mentioned QC, multiple imputation and updating all being necessary when all I was tasked to do was updating rsids.

I am also happy to work on this on my own if you know of a good resource I can look at so I can teach myself the standard pipeline (hopefully there is one?). I have preprocessed microelectrode data before so the idea of proper preprocessing is not foreign to me. Preprocessing SNP data, what I have been tasked with, unfortunately is new to me though.

best,

Stephen

Galt Barber

unread,

Jan 30, 2023, 2:33:30 AM1/30/23

to Stephen Tsou, Gerardo Perez, UCSC Genome Browser Discussion List

Hi, Stephen!

RSIds are stable across different major human genome releases, and minor ones (patch releases),

so the RSIds should exist in all versions. There is a database table or bigBed file for both hg19 and for hg38 that will have

the position keyed by SNP RSIds. These RSIds have already been aligned to the reference genome

and have a well known position included in those variant records. Rather than trying to lift them or re-align them, the position is given.

Every once in a while you may see that dbSNP, the organization that provides these, have merged or renamed

some SNPs while cleaning up their system and avoiding duplicates which were accidentally created for

historical reasons, although it is not common. The system they have now is a nice improvement.

There may well be some hg18 SNP IDs which were found in hg18 but you cannot find them in hg38.

You might have to try to align these based on liftover or re-alignment, but it should be a minority.

Galt Barber

UCSC Genome Browser Staff.

Ar Aoine 27 Ean 2023 ag 09:15, scríobh Stephen Tsou <stephe...@gmail.com>:

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAF%3DP_nJKywNweaMOYGy477ymrb82UWKC7P0MerqdtNpOVTX3cw%40mail.gmail.com.

Jairo Navarro Gonzalez

unread,

Feb 9, 2023, 8:11:50 PM2/9/23

to Galt Barber, Stephen Tsou, Gerardo Perez, UCSC Genome Browser Discussion List

Hello,

Thank you for sending your follow-up inquiry.

The genomic sequence changes from one assembly version to the next, so we expect the same genomic position, like "chr1:1234567", to represent a different sequence in hg18 vs. hg38. However, rs# IDs are supposed to be stable across genome assemblies despite the changing genomic positions. Regardless of the genome assembly version, "rs429358" refers to the same polymorphic variant in the 4th exon of the ApoE gene with T as the major allele and C as the minor allele. Its genomic position in hg18 is chr19:50103781, and its genomic position in hg38 is chr19:44908684.

If you start with the hg18 position chr19:50103781 and successfully map it to the corresponding position in hg38, then the result should be chr19:44908684. You should find the same rs# ID (rs429358) in both assemblies at the corresponding positions. For more context, dbSNP started assigning rs#s in the 1990s, before the human genome was assembled. At that point, SNPs were defined by flanking sequences with the varying base(s) in the middle -- local sequence context, with no genomic location because there was no genome.

The mapping from one genome assembly to the next is automatically generated from sequence alignments and is not complete nor perfect, just a best attempt. The stability of rs# IDs is why we recommend mapping across assemblies using rs# IDs first and then falling back to using cross-assembly mappings to look up the positions of items that didn't map using that method. When an rs# ID in hg18 is not found in the hg38 dbSNP data, that rs# ID has been withdrawn or replaced by a different rs# ID. In that case, mapping the hg18 position of the disappeared rs# to hg38 and looking for a new rs# at the corresponding position may help to find the new rs# (if there is one).

To help summarize the previous steps in this email thread:

1. Translate the plink file that has hg18 genomic coordinates and rs IDs to UCSC BED4 (https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
2. Make a file of all rs IDs
3. Use the Table Browser to map that file of rs IDs to hg38 coordinates and rs IDs (BED4 output fields)
4. Make a file containing rsIDs that were not mapped by the Table Browser
5. Use the BED file from step 4 of rs IDs to extract (from the BED4 file created in the first step) a UCSC BED4 file of hg18 positions of rs IDs that were not mapped by the Table Browser
6. Run LiftOver on that file to get hg38 positions (BED4) of rs IDs that were not mapped by the Table Browser
7. Use the Data Integrator to map those hg38 positions to new rs IDs where possible (BED4 but discard the original name/rs ID and use the name/rs ID from hg38 dbSNP)
8. Combine the earlier Table Browser rs ID-mapped hg38 BED4 with the liftOver/Data Integrator-mapped BED4
9. Beware duplicates that will cause downstream problems with plink -- you need to decide whether to remove duplicates as unreliable or resolve duplicates.
10. After resolving/removing duplicates and carefully checking, translate the final BED4 back to the plink format with hg38 genomic coordinates

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro 
UCSC Genome Browser

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAJQGhA1OUnvzs3w_wxh25wEvyYYX1QDPrZzegfXJ-fEFASb%3D4A%40mail.gmail.com.

Reply all

Reply to author

Forward