Good Morning Sajeev:
Please do not use any source in the ENCODE-DCC github, that is obsolete.
For commercial use, please note the license considerations for the
kent source tools:
https://genome-store.ucsc.edu/
The doSameSpeciesLiftOver procedure uses blat:
http://www.kentinformatics.com/
You want to use the doSameSpeciesLiftOver.pl script from the UCSC github:
http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/doSameSpeciesLiftOver.pl
which has recently been improved to function outside the UCSC environment.
You only use the doSameSpeciesLiftOver.pl if you are working with assemblies
of the same organism, where one assembly is an improvement of a previous assembly.
If you are working with different species, you want to use the doBlastzChainNet.pl script:
http://genomewiki.ucsc.edu/index.php/DoChainNetBlastz.pl
The procedure for using the doSameSpeciesLiftOver.pl is similar to
the doChainNetBlastz.pl description in the sense of setup of parasol and so forth.
Use the doSameSpeciesLiftOver.pl script with these types of arguments:
export target="targetSequenceName"
export query="querySequenceName"
doSameSpeciesLiftOver.pl -verbose=2 -buildDir=`pwd` \
-ooc=/path/to/${target}/${target}.ooc -localTmp="/dev/shm" \
-bigClusterHub=localhost -dbHost=localhost -workhorse=localhost \
-fileServer=localhost -query2Bit=/path/to/${query}/${query}.2bit \
-querySizes=/path/to/${query}/${query}.chrom.sizes \
-target2Bit=/path/to/${target}/${target}.2bit \
-targetSizes=/path/to/${target}/${target}.chrom.sizes ${target} ${query}
Your .ooc file is constructed with blat:
blat ${target}.2bit /dev/null /dev/null -tileSize=11 \
-makeOoc=/path/to/${target}/${target}.ooc -repMatch=1000
The value for repMatch is calculated based on the size of your target
genome sequence. To calculate your target genome size:
twoBitToFa ${target}.2bit stdout | faSize stdin
Note the number of 'real' bases (== bases that are not 'N')
Use this number to calculate this ratio:
calc \( realBases / 2861349177 \) \* 1024
(calc is a kent utility command)
This ratio is normalizing to the value of 'real' bases in
human genome sequence hg19 where we used 1024 for repMatch.
For example, let's say you measured your genome found 'real' bases
of size: 2803637675, then:
calc \( 2803637675 / 2861349177 \) \* 1024
( 2803637675 / 2861349177 ) * 1024 = 1003.346604
Round that down to the nearest 50, to 1000, use -repMatch=1000
This procedure is trying to obtain a reasonable number of overused 11-mers
in the .ooc file. You don't want too many, nor too few. The size
of 'many' or 'few' depends upon the size of the genome sequence.
For a 3 billion base sized genome, this is on the order of 30,000 to 40,000
overused 11-mers tiles.
On 4/24/18 9:34 PM, Batra, Sajeev wrote:
> Dear UCSC Genome Support team,
>
>
> I'd like to generate a chainfile for liftover and would like to learn about the process that you use. My understanding is that you use the following script in your pipeline:
https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl. Is that right?<
https://github.com/ENCODE-DCC/kentUtils/blob/master/src/hg/utils/automation/doSameSpeciesLiftOver.pl>