Creating an ooc file for chain file generation

Batra, Sajeev

unread,

Apr 25, 2018, 11:28:27 AM4/25/18

to gen...@soe.ucsc.edu

Dear UCSC Genome Support team,

I'd like to generate a chainfile for liftover and would like to learn about the process that you use. My understanding is that you use the following script in your pipeline: https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl. Is that right?

I have a similar pipeline to generate chain files but would like to know what all the blat parameters that you use to generate your ooc file. Since in the script above, I see 11.ooc, I believe that you use tilesize= 11 but would like to know all the blat parameters that are used for ooc file generation? I'd like to replicate this step if possible.

Thanks for all your good work!

Sajeev Batra

Thermo Fisher Scientific

Hiram Clawson

unread,

Apr 25, 2018, 12:08:43 PM4/25/18

to Batra, Sajeev, gen...@soe.ucsc.edu

Good Morning Sajeev:

Please do not use any source in the ENCODE-DCC github, that is obsolete.

For commercial use, please note the license considerations for the
kent source tools:

https://genome-store.ucsc.edu/

The doSameSpeciesLiftOver procedure uses blat:

http://www.kentinformatics.com/

You want to use the doSameSpeciesLiftOver.pl script from the UCSC github:

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/doSameSpeciesLiftOver.pl

which has recently been improved to function outside the UCSC environment.

You only use the doSameSpeciesLiftOver.pl if you are working with assemblies
of the same organism, where one assembly is an improvement of a previous assembly.
If you are working with different species, you want to use the doBlastzChainNet.pl script:
http://genomewiki.ucsc.edu/index.php/DoChainNetBlastz.pl

The procedure for using the doSameSpeciesLiftOver.pl is similar to
the doChainNetBlastz.pl description in the sense of setup of parasol and so forth.

Use the doSameSpeciesLiftOver.pl script with these types of arguments:

export target="targetSequenceName"
export query="querySequenceName"

doSameSpeciesLiftOver.pl -verbose=2 -buildDir=`pwd` \
-ooc=/path/to/${target}/${target}.ooc -localTmp="/dev/shm" \
-bigClusterHub=localhost -dbHost=localhost -workhorse=localhost \
-fileServer=localhost -query2Bit=/path/to/${query}/${query}.2bit \
-querySizes=/path/to/${query}/${query}.chrom.sizes \
-target2Bit=/path/to/${target}/${target}.2bit \
-targetSizes=/path/to/${target}/${target}.chrom.sizes ${target} ${query}

Your .ooc file is constructed with blat:
blat ${target}.2bit /dev/null /dev/null -tileSize=11 \
-makeOoc=/path/to/${target}/${target}.ooc -repMatch=1000

The value for repMatch is calculated based on the size of your target
genome sequence. To calculate your target genome size:
twoBitToFa ${target}.2bit stdout | faSize stdin
Note the number of 'real' bases (== bases that are not 'N')
Use this number to calculate this ratio:
calc $ realBases / 2861349177 $ \* 1024
(calc is a kent utility command)

This ratio is normalizing to the value of 'real' bases in
human genome sequence hg19 where we used 1024 for repMatch.

For example, let's say you measured your genome found 'real' bases
of size: 2803637675, then:
calc $ 2803637675 / 2861349177 $ \* 1024
( 2803637675 / 2861349177 ) * 1024 = 1003.346604

Round that down to the nearest 50, to 1000, use -repMatch=1000
This procedure is trying to obtain a reasonable number of overused 11-mers
in the .ooc file. You don't want too many, nor too few. The size
of 'many' or 'few' depends upon the size of the genome sequence.
For a 3 billion base sized genome, this is on the order of 30,000 to 40,000
overused 11-mers tiles.

On 4/24/18 9:34 PM, Batra, Sajeev wrote:
> Dear UCSC Genome Support team,
>
>

> I'd like to generate a chainfile for liftover and would like to learn about the process that you use. My understanding is that you use the following script in your pipeline: https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl. Is that right?<https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl>

Hiram Clawson

unread,

Apr 26, 2018, 11:42:17 AM4/26/18

to Batra, Sajeev, gen...@soe.ucsc.edu

Good Morning Sajeev:

See also:

http://genomewiki.ucsc.edu/index.php/DoSameSpeciesLiftOver.pl

--Hiram

On 4/24/18 9:34 PM, Batra, Sajeev wrote:

> Dear UCSC Genome Support team,
>
>

> I'd like to generate a chainfile for liftover and would like to learn about the process that you use. My understanding is that you use the following script in your pipeline: https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl. Is that right?<https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl>

Reply all

Reply to author

Forward