[Genome] How to extract a subregion from an alignment file

234 views
Skip to first unread message

张振国

unread,
Jul 8, 2008, 9:40:55 AM7/8/08
to gen...@soe.ucsc.edu
Dear Colleagues,

Now I am interested to extract a certain region from a given alignment file. I have downloaded the MAF file multiz28wayAnno.tar.gz from your ftp site, and I can extract a region using the tool mafsInRegion. For example, using command 'mafsInRegion chr22.bed chr22.out chr22.maf', I can obtain the regions of alignments specified in chr22.bed. In chr22.bed are:
chr22 14430000 14430020
chr22 14430212 14430292

The problem is that this method took a much longer time than that used in your web site. This is not tolerable because I have thousands of regions to extract.

I have searched the mailing-list for similar problems, and found that 'mafFrags' could also extract a region from the multiple-alignments. I wonder if this was better than 'mafsInRegion' in speed. Meanwhile, I did not know the proper values for 'database' and 'track' inputs. Could you give me some example files for the two inputs, please?

By the way, when pairwise alignment files used, such as hg18.panTro2.all.chain.gz or hg18.panTro2.net.gz, how to extract a certain region from these files(in chain or net format)?

Best wishes!

Sincerely,
Zhenguo
-----------------------------
张振国
中国科学院上海生命科学院健康科学研究所
中国上海市瑞金二路197号瑞金医院科教大楼1206室 200025
电话:02164370045--611206
Zhenguo Zhang
Institute of Health and Science,SIBS,CAS
Room 1206,Technological and Educational Building,Ruijin Hospital
Ruijin No. 2 Road 197,Shanghai,China.200025
Tel:02164370045-611206

Brooke Rhead

unread,
Jul 8, 2008, 7:42:20 PM7/8/08
to 张振国, gen...@soe.ucsc.edu
Hello Zhenguo,

The mafsInRegion utility can be quite slow, but it can be made faster by
requiring that the bed and maf files are in the correct order.

One of our developers has just added a utility called mafBedSubset that
takes the same arguments as mafsInRegion, but requires that the bed file
be non-overlapping and sorted by chrom and chromStart, and that the maf
file have the reference sequence in the correct order and orientation
and also non-overlapping. The maf files we provide are already in the
right condition. The mafBedSubset utility will give an error if the bed
files overlap or aren't sorted.

You will need to update your source code to get the mafBedSubset
utility. See:
http://genome.ucsc.edu/admin/cvs.html
If you leave off the -rbeta flag, you will get the most up-to-date
source tree.
> I have searched the mailing-list for similar problems, and found that 'mafFrags' could also extract a region from the multiple-alignments. I wonder if this was better than 'mafsInRegion' in speed. Meanwhile, I did not know the proper values for 'database' and 'track' inputs. Could you give me some example files for the two inputs, please?
>
Thank you for searching the mailing list first! Examples of values for
mafFrags inputs are hg18 (for database) and multiz28way (for track). The
mafFrags utility is expecting that a mySQL database with some of the
data loaded into tables. Some relevant information on mafFrags is in
this previously-answered question:
http://www.soe.ucsc.edu/pipermail/genome/2007-May/013647.html

Regarding extracting data from pairwise alignments: there are two
utilities in the source tree called chainFilter and netFilter that may
be close to what you are looking for.

I hope this is helpful. Please feel free to email us again at
gen...@soe.ucsc.edu if you have further questions.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


锟斤拷锟斤拷锟斤拷 wrote:
> Dear Colleagues,
>
> Now I am interested to extract a certain region from a given alignment file. I have downloaded the MAF file multiz28wayAnno.tar.gz from your ftp site, and I can extract a region using the tool mafsInRegion. For example, using command 'mafsInRegion chr22.bed chr22.out chr22.maf', I can obtain the regions of alignments specified in chr22.bed. In chr22.bed are:
> chr22 14430000 14430020
> chr22 14430212 14430292
>
> The problem is that this method took a much longer time than that used in your web site. This is not tolerable because I have thousands of regions to extract.
>
> I have searched the mailing-list for similar problems, and found that 'mafFrags' could also extract a region from the multiple-alignments. I wonder if this was better than 'mafsInRegion' in speed. Meanwhile, I did not know the proper values for 'database' and 'track' inputs. Could you give me some example files for the two inputs, please?
>
> By the way, when pairwise alignment files used, such as hg18.panTro2.all.chain.gz or hg18.panTro2.net.gz, how to extract a certain region from these files(in chain or net format)?
>
> Best wishes!
>
> Sincerely,
> Zhenguo
> -----------------------------
> 锟斤拷锟斤拷锟斤拷
> 锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷
> 锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷197锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷锟斤拷1206锟斤拷 200025
> 锟斤拷锟斤拷锟斤拷02164370045--611206
> Zhenguo Zhang
> Institute of Health and Science,SIBS,CAS
> Room 1206,Technological and Educational Building,Ruijin Hospital
> Ruijin No. 2 Road 197,Shanghai,China.200025
> Tel:02164370045-611206
> _______________________________________________
> Genome maillist - Gen...@soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>
Reply all
Reply to author
Forward
0 new messages