Hello Liu,
Thank you for your question about locating regions of high sequence identity. It is somewhat awkward to extract this information directly from the chain files themselves, as chains do not include information about the number of matching and mismatching bases within blocks. Fortunately, it is easy enough to convert alignments from the chain format to PSL, which includes a count of how many bases in the alignment were matches and mismatches. More information about the structure of the PSL format is available at http://genome.ucsc.edu/FAQ/FAQformat.html#format2.
In addition to the hg19.hg19.all.chain.gz file, you will also need a "sizes" file containing the sizes of the chromosomes of the hg19 assembly. You can find such a file on our download server at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes. You will also need our "chainToPsl" conversion tool, available from http://hgdownload.soe.ucsc.edu/admin/exe/ for several computer archtectures. If we do not provide a precompiled program for your computer architecture, you can also download our userApps package and build the utility from the program source code. Finally, you will need a twoBit file containing the sequence of the hg19 genome assembly. That file is available from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit.
Once you have those files, you should be able to run the following command to obtain a PSL alignment file (named output.psl) that corresponds to the self chain alignments. That PSL file will include a count of the matching bases for each alignment, from which you should be able to calculate the sequence identity.
chainToPsl hg19.hg19.all.chain.gz hg19.chrom.sizes hg19.chrom.sizes hg19.2bit hg19.2bit output.psl
You may also be interested in our Segmental Duplications track (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=genomicSuperDups), which lists regions of 1000+ bases that have 90% or higher sequence identity to another part of the assembly. The identity percentage of each region is included in the track data.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--