Get Identity Based on Self Chain Track

38 views
Skip to first unread message

l...@pku.edu.cn

unread,
May 20, 2015, 1:03:08 PM5/20/15
to Genome
Hello,

I want to extract regions with sequence identity of 95% to another genomic region based on SelfChain Track(hg19). I have downloaded chainSelfLink.sql, chainSelfLink.txt, chainSelf.sql, chainSelf.txt, hg19.hg19.all.chain.gz, but I didn't find any information about identity of regions. Could you tell me how to get the identity information?

Thank you for your time and look forward to your reply.

Regards,
Liu Fenglin


Liu Fenglin
Ph.D candidate of Bioinformatics
BIOPIC,Colleage of Life Sciences,
Peking University
Beijing,100871
China

Jonathan Casper

unread,
May 26, 2015, 8:24:36 PM5/26/15
to l...@pku.edu.cn, Genome

Hello Liu,

Thank you for your question about locating regions of high sequence identity. It is somewhat awkward to extract this information directly from the chain files themselves, as chains do not include information about the number of matching and mismatching bases within blocks. Fortunately, it is easy enough to convert alignments from the chain format to PSL, which includes a count of how many bases in the alignment were matches and mismatches. More information about the structure of the PSL format is available at http://genome.ucsc.edu/FAQ/FAQformat.html#format2.

In addition to the hg19.hg19.all.chain.gz file, you will also need a "sizes" file containing the sizes of the chromosomes of the hg19 assembly. You can find such a file on our download server at http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes. You will also need our "chainToPsl" conversion tool, available from http://hgdownload.soe.ucsc.edu/admin/exe/ for several computer archtectures. If we do not provide a precompiled program for your computer architecture, you can also download our userApps package and build the utility from the program source code. Finally, you will need a twoBit file containing the sequence of the hg19 genome assembly. That file is available from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit.

Once you have those files, you should be able to run the following command to obtain a PSL alignment file (named output.psl) that corresponds to the self chain alignments. That PSL file will include a count of the matching bases for each alignment, from which you should be able to calculate the sequence identity.

  chainToPsl hg19.hg19.all.chain.gz hg19.chrom.sizes hg19.chrom.sizes hg19.2bit hg19.2bit output.psl

You may also be interested in our Segmental Duplications track (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=genomicSuperDups), which lists regions of 1000+ bases that have 90% or higher sequence identity to another part of the assembly. The identity percentage of each region is included in the track data.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages