I can't find a single hg19 fasta file - necessary for CrossMap conversion

458 views
Skip to first unread message

Brian Hanley

unread,
Aug 15, 2018, 7:32:54 PM8/15/18
to gen...@soe.ucsc.edu
Hello,

I've been working on this for a few days. To load a track into the ucsc
genome browser, it needs to be in hg38 format. My file I need to convert
is an exome file in vcf format.

I've got CrossMap installed. I have the hg19ToHg38.over.chain.gz file. I
have my vcf file of hg19 data.

The only thing I could find for hg19 reference file is a directory of
files for different chromosomes from UCSC. I looked on forums, and this
appears to be a common problem. Following the directions on forums, I
unzipped all the .fa files. Then I used cat to concatenate all the files
in that directory. (listed below.) Then I used bgzip to compress this
resulting file.

(First I used the directions that said to just concatenate the .gz
files. But CrossMap needs files compressed with bgzip, and complained.
Apparently the files I downloaded from UCSC were not compressed with
bgzip? Did someone compress them with gzip. So I unzipped, concatenated,
and rezipped.)

The resulting output is:

@ 2018-08-15 15:52:07: Read chain_file:  hg19ToHg38.over.chain.gz
@ 2018-08-15 15:52:07: Creating index for allHG19files.fa.gz
@ 2018-08-15 15:52:28: Updating contig field ...
@ 2018-08-15 15:52:29: Total entries: 351912
@ 2018-08-15 15:52:29: Failed to map: 351912

I think this means that the file I made, allHG19files.fa.gz, has
something wrong with it. But I don't know what is wrong, and the forums
are no help. I have no idea how to proceed.

Can you tell me how to create an hg19 fasta reference file for the human
genome build? (Or is it available somewhere I couldn't find?)

- This is the command I used to create my allHG19files.fa file.

cat chr1.fa    chr1_gl000191_random.fa chr1_gl000192_random.fa   
chr2.fa    chr3.fa    chr4.fa chr4_ctg9_hap1.fa   
chr4_gl000193_random.fa chr4_gl000194_random.fa    chr5.fa    chr6.fa
chr6_apd_hap1.fa    chr6_cox_hap2.fa    chr6_dbb_hap3.fa
chr6_mann_hap4.fa    chr6_mcf_hap5.fa    chr6_qbl_hap6.fa
chr6_ssto_hap7.fa    chr7.fa    chr7_gl000195_random.fa chr8.fa   
chr8_gl000196_random.fa    chr8_gl000197_random.fa chr9.fa   
chr9_gl000198_random.fa    chr9_gl000199_random.fa
chr9_gl000200_random.fa    chr9_gl000201_random.fa    chr10.fa
chr11.fa    chr11_gl000202_random.fa    chr12.fa    chr13.fa chr14.fa   
chr15.fa    chr16.fa    chr17.fa chr17_ctg5_hap1.fa   
chr17_gl000203_random.fa chr17_gl000204_random.fa   
chr17_gl000205_random.fa chr17_gl000206_random.fa    chr18.fa
chr18_gl000207_random.fa    chr19.fa chr19_gl000208_random.fa   
chr19_gl000209_random.fa chr20.fa    chr21.fa   
chr21_gl000210_random.fa    chr22.fa chrM.fa    chrUn_gl000211.fa   
chrUn_gl000212.fa chrUn_gl000213.fa    chrUn_gl000214.fa   
chrUn_gl000215.fa chrUn_gl000216.fa    chrUn_gl000217.fa   
chrUn_gl000218.fa chrUn_gl000219.fa    chrUn_gl000220.fa   
chrUn_gl000221.fa chrUn_gl000222.fa    chrUn_gl000223.fa   
chrUn_gl000224.fa chrUn_gl000225.fa    chrUn_gl000226.fa   
chrUn_gl000227.fa chrUn_gl000228.fa    chrUn_gl000229.fa   
chrUn_gl000230.fa chrUn_gl000231.fa    chrUn_gl000232.fa   
chrUn_gl000233.fa chrUn_gl000234.fa    chrUn_gl000235.fa   
chrUn_gl000236.fa chrUn_gl000237.fa    chrUn_gl000238.fa   
chrUn_gl000239.fa chrUn_gl000240.fa    chrUn_gl000241.fa   
chrUn_gl000242.fa chrUn_gl000243.fa    chrUn_gl000244.fa   
chrUn_gl000245.fa chrUn_gl000246.fa    chrUn_gl000247.fa   
chrUn_gl000248.fa chrUn_gl000249.fa    chrX.fa    chrY.fa    >
allHG19files.fa


Thank you kindly, Brian

--
Brian Hanley, PhD Davis, CA 95616 (415)518-8153

Luis Nassar

unread,
Aug 16, 2018, 2:36:54 PM8/16/18
to brian.pa...@gmail.com, gen...@soe.ucsc.edu

Hello Brian.

Thank you for submitting your question, I think we can help you out.

First I’d like to comment that if you’re lifting your file from hg19 to hg38 just to view the file in the browser, you can select the hg19 assembly from the "assembly" drop-down at the top of the hgCustom page: http://genome.ucsc.edu/cgi-bin/hgCustom, and load your VCF onto hg19 directly, without lifting first.

As far as using crossmap, your reference genome used should be the sequence of the target assembly, in this case hg38. You can see some examples here (also see note #2 below example):

http://crossmap.sourceforge.net/#convert-vcf-format-files

This would mean that instead of using the “allHG19files.fa” you were putting together, you would use an equivalent for hg38 which you can find on our servers as “hg38.fa.gz":

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

This should allow crossmap to successfully lift your VCF to hg38.

There is another possibility if this doesn’t work, the sequence names in your VCF CHROM column may be formatted ‘1’, ‘2’, ‘3’, etc., while our files use the ‘chr1’, ‘chr2’, ‘chr3’ format. If so, you would simply need to change that first column to match.

If the problem persists or any other issues arise, please feel free to message us back and include a snippet of your VCF file as well to help us troubleshoot. Please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.


Lou Nassar
UCSC Genomics Institute

Training videos & resources: http://genome.ucsc.edu/training/index.html
Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/c940f506-b607-20f0-d4e0-3c8485b8337d%40gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Brian Hanley

unread,
Aug 16, 2018, 6:41:53 PM8/16/18
to Luis Nassar, gen...@soe.ucsc.edu

Sorry, I didn't explain why I need the hg19 exome file in the hg38 format. This is because the human genome file that matches it was done later with the new build. I want to have both tracks in one. So I have to convert one of them.

The first run gave me this. (I got the error when I concatenated the hg19 files I downloaded. I had to unpack them and then use bgzip. I think someone putting those out on your server made a mistake and didn't use bgzip.) 

[E::fai_build3_core] Cannot index files compressed with gzip, please use bgzip ... Could not build fai index hg38.fa.gz.fai\n'

So I unzipped, then rezipped with bgzip. Same problem.

@ 2018-08-16 15:13:09: Read chain_file:  hg19ToHg38.over.chain.gz
@ 2018-08-16 15:13:09: Creating index for hg38.fa.gz
@ 2018-08-16 15:13:31: Updating contig field ...
@ 2018-08-16 15:13:32: Total entries: 351912
@ 2018-08-16 15:13:32: Failed to map: 351912

I checked the format and it's chr(n) format.

Example of format:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    1657261
chr1    13116    .    T    G    144    PASS    AC=4;AC1=4;AF=1.00;AF1=1;AN=4;DP=40;DP4=0,0,25,0;FQ=-66;FS=0.000;MLEAC=4;MLEAF=1.00;MQ0=0;QD=30.67;VDB=2.371346e-01;VQSLOD=-3.213e+01;culprit=MQ;set=Samtools-filterInHaplotypeCaller    GT:DP:GQ:PL    1/1:12:75:85,36,0
chr1    13118    .    A    G    144    PASS    AC=4;AC1=4;AF=1.00;AF1=1;AN=4;DP=41;DP4=0,0,25,0;FQ=-66;FS=0.000;MLEAC=4;MLEAF=1.00;MQ0=0;QD=26.61;VDB=2.297144e-01;VQSLOD=-3.170e+01;culprit=MQ;set=Samtools-filterInHaplotypeCaller    GT:DP:GQ:PL    1/1:12:75:85,36,0

VCF snippet attached.

VCF-snippet.vcf

Luis Nassar

unread,
Aug 17, 2018, 1:58:25 PM8/17/18
to Brian Hanley, gen...@soe.ucsc.edu
Hello Brian,

We have seemingly narrowed down the issue to CrossMap. Using your snippet test file, all files fail to map with the latest CrossMap version. However, using your exact command with a previous release works. You can find previous releases in the following site: https://sourceforge.net/projects/crossmap/files/

We were able to successfully lift the files over with CrossMap-0.2.6.tar.gz, though I would recommend trying CrossMap-0.2.7.tar.gz as well.

If this fixes the issue I encourage you to report the problem to the CrossMap authors at Wang....@mayo.edu. If you continue to encounter problems could you run these two commands and send us the output?

$ gunzip -c hg19ToHg38.over.chain.gz | head $ head hg38.fa


Hopefully that won't be necessary though! If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

Training videos & resources: http://genome.ucsc.edu/training/index.html
Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining
Reply all
Reply to author
Forward
0 new messages