awk '{print $1"\t"$4-1"\t"$4"\t"$2"\t.\t."}' plink.map > plink_hg18.bed
liftOver plink_hg18.bed hg18ToHg38.over.chain.gz plink_hg38.bed plink_nomap
or ./liftOver plink_hg18.bed hg18ToHg38.over.chain.gz plink_hg38.bed plink_nomap
.... I am getting the following error messages:
-bash: liftOver: command not found
OR
-bash: ./liftOver: Permission denied
Would you know what I may be doing wrong? I ran this successfully previously (though with empty lifted over files previously).
Thank you.
best,
Stephen
awk '{print $1"\t"$4"\t0\t"$3}' plink_hg38.bed > plink_hg38.map
liftOver plink_hg18.bed hg18ToHg38.over.chain.gz plink_hg38.bed plink_nomap
#Deleted in new
1 696230 696231 rs12029736 . .
#Deleted in new
1 742428 742429 rs3094315 . .
#Deleted in new
1 743267 743268 rs3115860 . .
#Deleted in new
1 744196 744197 rs3131967 . .
#Deleted in new
1 751009 751010 rs3115850
"head plink_hg18.bed" gives me this:
1 696230 696231 rs12029736 . .
1 742428 742429 rs3094315 . .
1 743267 743268 rs3115860 . .
1 744196 744197 rs3131967 . .
1 751009 751010 rs3115850 . .
1 758310 758311 rs12562034 . .
1 766408 766409 rs12124819 . .
1 794402 794403 rs11240778 . .
1 820043 820044 rs28444699 . .
1 836670 836671 rs4475691 .
The endgame after all of this is to replace the rsids from hg18 files with updated rsids in hg38 based on chromosome position.
Hello, Stephen.
Unfortunately, we do not support the plink software, so we cannot give you advice on how to use the tool. You may want to contact the plink support team for any issues with running the software (https://zzz.bwh.harvard.edu/plink/contact.shtml#probs). You could also post your question to other bioinformatic forums, such as BioStars (https://www.biostars.org/), for advice from other bioinformaticians and scientists in your field.
We do not recommend using LiftOver to convert the SNP, and you can read more about this in our FAQ:
https://genome.ucsc.edu/FAQ/FAQreleases.html#snpConversion
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Gerardo Perez
UCSC Genomics Institute
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAF%3DP_n%2BCdqCGki4-rc%2BCPjO2F0X8Qbm3yX8Jg1t_r%2B5kNtxq5A%40mail.gmail.com.
Hi, Stephen.
In order to convert from hg18 to hg38 using the identifiers, you will want to create a stripped list of all of the rsIDs. You can use awk for this to extract the single field in the file.
As we describe in the FAQ post (https://genome.ucsc.edu/FAQ/FAQreleases.html#snpConversion) you will then go to the Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) and make the following selections:
Make sure the region is set to genome, and select paste list on the identifiers option. On this page you can paste all of your rsIDs (or alternatively you can choose to upload a file of just these rsIDs).
From there you can get output or make a few additional modifications such as which specific fields you would like to extract (by default you get all the fields in dbSNP track).
As an example, if I follow these steps with the rsID rs12029736 I see the following result:
If you have some non-matching rsIDs you can also change the table from Common dbSNP155 to All dbSNP155, although we find most variants of interest are present in the common set.
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAF%3DP_nKZVQTxH5N1iHonXLc4gaf2YkDXHXyK8PNN8sg%2BUFZ5EQ%40mail.gmail.com.
Hello, Stephen.
We will address your questions below:
I have tried inputting an hg18 .txt file with names including ones that begin with substrings 'rs' and 'cnvi' (620901 long).
The rs# means a dbSNP ID, which usually but not always will be found in the dbSNP track. A few rs# IDs have been removed or merged into other rs# IDs. 'cnvi' is not from dbSNP and definitely will not be found in the dbSNP track.
The returned .tsv file from UCSC is longer and the same with either input file (626956 long).
The rs# IDs in the pseudoautosomal regions (PARs, https://en.wikipedia.org/wiki/Pseudoautosomal_region) may be mapped to both chrX and chrY, and some rs# IDs might be mapped to both a main chromosome (e.g. chr10) and an alternate haplotype sequence (e.g. chr10_GL383545v1_alt), which would explain why there would be more lines of output than number of distinct rs# IDs.
I am also confused how the Table Browser automatically knows that the assembly of the input .txt file is hg18. The settings only asks for the target assembly.
If the input .txt file only has IDs (like rs#), then it is not tied to any particular assembly. Only genome coordinates are tied to a particular assembly.
Here is a quick way to find some example rs# IDs that are duplicated in the output (replace the imaginary file names with your real file names):
That should show the first 10 alphabetically sorted rs# IDs that appear more than once in the name column of BED. Then you can pick one of them (replace "rs???" with an actual ID) to investigate:
You will probably see one line of output for a regular chromosome and one or more lines of output for an alternate assembly sequence. If you want to discard all of the alternate assembly sequences, grep can help:
Then you can look for duplicates in the new file. There should be fewer duplicates, but there may still be some in chrX and chrY due to PARs.
After this update, how would I go about updating dropped ids?
Here is a way to make a list of the IDs that were not found by the Table Browser query on dbSNP 155 -- those are the ones that will require liftOver. Assuming that hg18.txt contains one ID on each line:
Then you can use grep with your file that includes hg18 coordinates (chrom, chromStart, chromEnd) to get only the lines that should be fed to liftOver:
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Gerardo Perez
UCSC Genomics Institute
Hello, Stephen.
One of our engineers has provided the following information:
Is the tablebrowseroutput.bed and hg18.bed that you mention just in the form of the chr, chromeStart and chromeEnd columns of tableBrowserOutput.tsv?
Yes: in particular the first 4 columns of BED format (https://genome.ucsc.edu/FAQ/FAQformat.html#format1): chrom, chromStart, chromEnd and name. "cut -f 4 TableBrowserOutput.bed" outputs only the 4th column, which is expected to contain rs# IDs found in dbSnp155 in this case.
hg18.bed is assumed to be 4-column BED (chrom, chromStart, chromEnd, name) containing hg18 genomic coordinates and IDs from your original data that you want to convert to hg38.
One issue that has come up in the plink files is that there is no chromeStart field in the .bim file (I believe one is to assume a chromeStart at 1 in this instance?). Is this okay or does some change have to be made?
liftOver cannot read .bim files. In order to run liftOver, the .bim must first be converted to BED. If you can send the first few lines of your .bim file then we can suggest a command pipe to convert it to BED.
Also for "comm -23 hg18.sorted.txt TableBrowserOutput.IDs.sorted.txt > IDs.notFound.txt" shouldn't rsids in HG18 and rsids in the HG38 TableBrowserOutput.bed be associated with different positions, thus making unique rsids across the two files not necessarily "not found"?
Yes, the same rsID will most likely have different genomic positions in hg18 vs. hg38. However, hg18.txt is assumed to contain one ID per line, not genomic coordinates. And due to the "cut -f 4" command, TableBrowserOutput.IDs.sorted.txt should also have one rs# ID per line with no genomic coordinates. Only the rs# IDs are being compared. "comm -23" means "tell me the values that are in the first file but not in the second file". Those are the IDs that were not found by the Table Browser, requiring the fallback approach, i.e. liftOver.
Also downstream, should I worry about the alternate haplotype rsids? I suppose it is okay to drop them or should I just worry about them later? Thank you so much.
That depends on what you will do with the hg38 genomic coordinates. For many purposes, it will be fine to drop the mappings to alternate haplotypes.
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Gerardo Perez
UCSC Genomics Institute
Hello Stephen,
From that head() command (which we see is from the R session, so it is returning the data as an array) we see that your chromosome numbers do not have 'chr' appended before them, as the Genome Browser requires. It also requires a few column rearrangements to conform to BED.
With that in mind, you should be able to run the following command directly on the command line on your bim file to output the proper format:
That file can then be used with the previous grep command to pull out only the IDs that were not found by dbSNP:
Finally, you can use the resulting hg19.forLiftOver.bed file on the liftOver page or utility. Let us know if this works. The final step will be a command to merge the lifted file with the original rsIDs found in dbSNP.
I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
Hello,
Thank you for using the UCSC Genome Browser and sending your follow-up question.
Yes, the process may take a long time, depending on the hardware you use to run the command. You should be able to see some progress if, in another terminal window, you go to the same directory and look at the number of lines in hg18.forLiftOver.bed like this:
If you rerun the command in a few minutes (or hours), you should see the number of lines increase. When done, the number of lines should be similar to the number of lines in IDs.notFound.txt.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Jairo Navarro
UCSC Genome Browser
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAF%3DP_nJMK0TAMs%3DP4kLa%3Dn8Jugng_sdzbpvXrieUvPttmPCzFw%40mail.gmail.com.
Hi Stephen,
A solitary line is a bit strange. We would expect either 0 (while the computer is reading IDs.notFound.txt into memory) and then a steady increase in output lines after. If the shortcoming is RAM then a GPU would not help.
What are the outputs of the following commands?
Lou Nassar
UCSC Genomics Institute
wc -l hg18.forLiftOver2.bed
two days ago vs. today.
47414 hg18.forLiftOver2.bed
The second reading today read:
75556 hg18.forLiftOver2.bed
Below are the command lines you suggested. Please let me know what I might be doing wrong or if there are other methods. Thank you again.
For....
I get:
598821 IDs.notFound.txt
620763 todayhg182.bed
76022 hg18.forLiftOver2.bed
1295606 total
For....
I get:
==> IDs.notFound.txt <==
rs1000000
rs10000010
==> todayhg182.bed <==
chr1 10003 10004 rs12354060
chr1 46843 46844 rs2691310
==> hg18.forLiftOver2.bed <==
chr1 10003 10004 rs12354060
chr1 97214 97215 rs4124251
For.....
I get:
Processes: 417 total, 4 running, 413 sleeping, 3177 threads
2022/10/08 15:44:02
Load Avg: 4.50, 4.54, 6.62
CPU usage: 16.13% user, 33.45% sys, 50.41% idle
SharedLibs: 286M resident, 73M data, 25M linkedit.
MemRegions: 1884086 total, 4854M resident, 43M private, 738M shared.
PhysMem: 16G used (3163M wired), 76M unused.
VM: 113T vsize, 2318M framework vsize, 585194956(0) swapins, 588440259(0) swapouts.
Swap: 9203M + 2061M free.
Purgeable: 5500K 82525(0) pages purged.
Networks: packets: 9331650/8671M in, 3377893/1028M out.
Disks: 22680311/2385G read, 23990021/2337G written.
PID COMMAND %CPU TIME #TH #WQ #PORTS MEM PURG CMPRS PGRP PPID STATE BOOSTS %CPU_ME %CPU_OTHRS UID FAULTS COW MSGSENT MSGRECV SYSBSD SYSMACH CSW PAGEINS IDLEW POWER INSTRS CYCLES USER #MREGS RPRVT VPRVT VSIZE KPRVT KSHRD
3080 Google Chrome He 0.0 20:24.02 21 1 8227 1615M 0B 1571M 474 474 sleeping *0[8] 0.00000 0.00000 501 5524905 7714 1808421 1421419 4368288 4418308 3109358 8823 95 0.0 0 0 stephentsou N/A N/A N/A N/A N/A N/A
513 Google Chrome He 0.0 04:32:42 12 1 1465 1443M 76K 381M 474 474 sleeping *23436[5] 0.00000 0.00000 501 33516775 3235 111356385 44348284 48790244 183115788 91915423 35594 550 0.0 0 0 stephentsou N/A N/A N/A N/A N/A N/A
1025 Google Chrome He 0.0 44:46.36 22 2 1002 1189M 0B 1039M 474 474 sleeping *0[9] 0.00000 0.00000 501 15556433 165452 4062463 1888438 8173050 14199963 10116014 3362 101 0.0 0 0 stephentsou N/A N/A N/A N/A N/A N/A
1103 grep 0.0 30:48:26 1/1 0 11 1078M 0B 36M 1103 990 running *0[1] 0.00000 0.00000 501 10692633 91 1057 15 2414 453 119361194 0 1 0.0 0 0 stephentsou N/A N/A N/A N/A N/A N/A
1067 Google Chrome He 0.0 49:50.96 20 2 821 952M 0B 851M 474 474 sleeping *0[9] 0.00000 0.00000 501 19604945 170533 3403874 1727320 7679152 11966849 9487312 5482 85 0.0 0 0 stephentsou N/A N/A N/A N/A N/A N/A
482 AdobeReader 0.0 21:46.75 28 5 542 907M 0B 860M 482 1 sleeping 0[2608] 0.00000 0.00000 501 19969502 94558 3698840 488920 3872863 14728605 5691182 1673354 79 0.0 0 0 stephentsou N/A N/A N/A N/A N/A N/A
Hello Stephen,
Thanks for the additional info. We are still investigating this situation. It may be related to certain commands timing-out or running out of memory. This does not normally happen, so I will ask you for some more information that may help us fix it.
Could you please send us the file containing rs# IDs that you uploaded to the Table Browser? Could you also send us the files created by the Table Browser identifier matching query and a short description of each, including any subsequent commands? If you cannot send the files, a line count of any remaining files would be helpful.
All the best,
Daniel Schmelter
UCSC Genome Browser
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAF%3DP_nJE7-HqDyj6ZDTrd3f6tQOZyJrJ4Xofx5okLmrTMW3exg%40mail.gmail.com.
Hello,
Thank you again for using the UCSC Genome Browser and sending your inquiry.
From looking at some of the rsIDs that you shared, we recommend the "All" subset of dbSNP instead of the default "Common" subset on the Table Browser.
Could you share the name of the file that has both rs# IDs and non-rs# IDs (the unfiltered version of forUCSCFiltered.txt), and the name of the file that has UCSC BED format for all 626,956 SNPs? We don't need the whole files, just the file names, but it would also be nice to see the first few lines of each file just to be sure.
With those filenames, we can suggest a sequence of commands for you to run to generate a 43k-line input file for liftOver.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Jairo Navarro
UCSC Genome Browser
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAF%3DP_nJxjEtwPPxfzj4MLRHB6hWHhuR8T7HtwXMtmCM3o0hahA%40mail.gmail.com.