Re: [Bioperl-l] Bio::TreeIO

12 views
Skip to first unread message

Jason Stajich

unread,
Jan 25, 2011, 4:46:16 PM1/25/11
to Komal Jain, BioPerl List
Hi Komal -

You would just iterate over all the nodes, unset the bootstrap value if
it is < that your cutoff.

for my $node ( $tree->get_nodes ) {
if($node->bootstrap < $CUTOFF ){
$node->bootstrap('');
}
}

The subtly between whether values get moved to boostrap value or just as
an id can be set when you do Bio::TreeIO reading in, you can also call
$tree->move_id_to_bootstrap
which moves them over to the bootstrap slot.

Alternatively you can just know that ids for internal nodes are for
boostraps and use that field, e.g.
for my $node ( $tree->get_nodes ) {
if(! $node->is_Leaf && $node->id < $CUTOFF ){
$node->id('');
}
}

And write the tree back out. The HOWTO on Trees shows I/O for trees in
more detail.

Best wishes.


Komal Jain wrote:
> Dear Jason,
>
>
>
> I went through the documentation of the Bio::Tree module and it is a very
> useful tool. I would like to know from you if it is possible to remove the
> bootstrap values smaller than a cutoff from the newick tree.
>
>
>
> Thanks in advance for your reply,
>
> Komal.
>

--
Jason Stajich
ja...@bioperl.org
http://bioperl.org/wiki

_______________________________________________
Bioperl-l mailing list
Biop...@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l

Komal Jain

unread,
Jan 25, 2011, 4:50:39 PM1/25/11
to Jason Stajich, BioPerl List
Thanks very much for your prompt reply. I greatly appreciate it.

Komal.

Juan Jovel

unread,
Feb 15, 2011, 1:36:11 PM2/15/11
to Jason Stajich, kj2...@columbia.edu, bioperl

Good Morning guys,
sorry for the naive question: What's the simplest way to fish redundant sequences (complete or partial) between two (or more) fasta files.
I was thinking just to do it with SeqIO, opening two files, and compare each sequence of file_1 to each record of file_2, like:
# Read each record of file 1 and compare to each read of file 2while(my $dna1 = $seqin1->next_seq){ my $seq1 = $dna1->seq; my $id1 = $dna1->id;
# Iterate inside de second fasta file while(my $dna2 = $seqin2->next_seq){ my $seq2 = $dna2->seq; my $id2 = $dna2->id;
if(($seq1 =~ /$seq2/)||($seq2 =~ /$seq1/)){ print "Match found \n"; print OUT "Records $id1 and $id2 are redundants";
...
I am afraid it is going to be slow for large files. AND, more importantly, how do I reset the object containing the second file to the first line, as done in Perl with (SEEK(IN, 0,0)) for example. Does SeqIO allows that (sorry, I am not a frequent user of SeqIO).
If there is another more-elaborated module to fish such redundant sequences, I will appreciate to know.
Thanks,
JUAN

Juan Jovel

unread,
Feb 15, 2011, 2:34:58 PM2/15/11
to bioperl

Good Morning guys,
> sorry for the naive question: What's the simplest way to fish redundant sequences (complete or partial) between two (or more) fasta files.
> I was thinking just to do it with SeqIO, opening two files, and compare each sequence of file_1 to each record of file_2, like:
# Read each record of file 1 and compare to each read of
file 2

while(my $dna1 = $seqin1->next_seq){

my $seq1 =
$dna1->seq;

my $id1 =
$dna1->id;

# Iterate
inside de second fasta file

while(my $dna2
= $seqin2->next_seq){

my $seq2 = $dna2->seq;

my
$id2 = $dna2->id;


if(($seq1 =~ /$seq2/)||($seq2 =~ /$seq1/)){


print "Match found \n";


print OUT "Records $id1 and $id2 are redundants";

I am afraid it is going to be slow for large files. AND, more importantly, how do I reset the object containing the second file to the first line, as done in Perl with (SEEK(IN, 0,0)) for example. Does SeqIO allows that (sorry, I am not a frequent user of SeqIO). If there is another more-elaborated module to fish such redundant sequences, I will appreciate to know.

Chris Fields

unread,
Feb 15, 2011, 3:02:38 PM2/15/11
to Juan Jovel, bioperl
Juan,

If you are checking for simple complete matches, I would suggest using a hash. However, you are also looking for partial matches as well. In this case it seems like you should be (ab)using something akin to mcl to cluster like sequences together; you're essentially performing an all-v-all comparison anyway, at least take advantage of faster tools.

So, basically:

1) Run an all-v-all comparison, filtering on 100% identity, no gaps
2) cluster using mcl

Note the BLAST-related programs here for that purpose:

http://www.micans.org/mcl/man/mclfamily.html

I think you can also use other tools instead of BLAST, just can't recall the mcl pipeline at the moment to use.

chris

Dave Messina

unread,
Feb 15, 2011, 3:09:35 PM2/15/11
to Juan Jovel, bioperl
Hi Juan,

There's a nice example script in the BioPerl distribution that Jason Stajich
wrote which uses MD5 checksums to do the sequence comparison:


https://github.com/bioperl/bioperl-live/blob/master/scripts/utilities/bp_nrdb.PLS


There are also faster, nonBioPerl tools for this, such as the one that comes
with UCLUST:

http://www.drive5.com/usearch/


Dave

Chris Fields

unread,
Feb 15, 2011, 3:25:07 PM2/15/11
to Dave Messina, Juan Jovel, bioperl
SHA should work as well, didn't think of that (though I suppose the encoding step for either would be rate-limiting?).

Will have to keep an eye on UCLUST, didn't know about that one.

chris

Cook, Malcolm

unread,
Feb 15, 2011, 3:28:09 PM2/15/11
to Chris Fields, Dave Messina, Juan Jovel, bioperl
there there is CD-HIT

and blastclust from ncbi (which I think still gets installed as part of installed NCBI blast suite)


Malcolm Cook
Stowers Institute for Medical Research - Bioinformatics
Kansas City, Missouri USA

Dave Messina

unread,
Feb 15, 2011, 3:47:20 PM2/15/11
to Chris Fields, Juan Jovel, bioperl
SHA should work as well, didn't think of that (though I suppose the encoding
> step for either would be rate-limiting?).
>

I haven't tested it, but I suspect that encoding either MD5 or SHA would be
relatively quick compared to the sequence I/O, no?

Will have to keep an eye on UCLUST, didn't know about that one.


As it happens, my current pipeline uses MCL but I'm testing UCLUST as a
replacement since it's waaay faster. I'll let you know how the comparison
turns out.

And for that matter, if anyone listening has experience with UCLUST or
CD-HIT or other clustering methods (ideally in the context of metagenomic
next-gen sequence), please chime in with your thoughts.

Chris Fields

unread,
Feb 15, 2011, 4:01:49 PM2/15/11
to Dave Messina, Juan Jovel, bioperl
On Feb 15, 2011, at 2:47 PM, Dave Messina wrote:

> SHA should work as well, didn't think of that (though I suppose the encoding
>> step for either would be rate-limiting?).
>>
>
> I haven't tested it, but I suspect that encoding either MD5 or SHA would be
> relatively quick compared to the sequence I/O, no?

Possibly. But one nice thing is clustering allows for partial matches (which I think is the original criterion). I don't believe SHA/MD5 would work for that purpose.

> Will have to keep an eye on UCLUST, didn't know about that one.
>
>
> As it happens, my current pipeline uses MCL but I'm testing UCLUST as a
> replacement since it's waaay faster. I'll let you know how the comparison
> turns out.
>
> And for that matter, if anyone listening has experience with UCLUST or
> CD-HIT or other clustering methods (ideally in the context of metagenomic
> next-gen sequence), please chime in with your thoughts.

As malcolm pointed out, blastclust is also available with legacy BLAST, though I'm not sure it's available with BLAST+ (didn't see anything obvious with BLAST+ for that purpose).

chris

Dave Messina

unread,
Feb 15, 2011, 4:22:58 PM2/15/11
to Chris Fields, Juan Jovel, bioperl
>
> But one nice thing is clustering allows for partial matches (which I think
> is the original criterion). I don't believe SHA/MD5 would work for that
> purpose.


Yep, for sure. Checksums will find full-length exact matches only.


Dave

Jason Stajich

unread,
Feb 15, 2011, 5:21:34 PM2/15/11
to Dave Messina, Chris Fields, Juan Jovel, bioperl
also see cd-hit which allows you to tune the %id matching.

--

_______________________________________________

Adam Sjøgren

unread,
Feb 16, 2011, 8:01:31 AM2/16/11
to biop...@bioperl.org
On Tue, 15 Feb 2011 14:25:07 -0600, Chris wrote:

> SHA should work as well, didn't think of that (though I suppose the
> encoding step for either would be rate-limiting?).

Disk I/O might be the bottleneck - on a 3+ year old desktop I get ~144
MB/s for sha1 and ~217 MB/s for md5 in a simple test:

$ dd if=/dev/zero bs=1M count=1024 | sha1sum -
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 7.44032 s, 144 MB/s
2a492f15396a6768bcbca016993f4b4c8b0b5307 -
$ dd if=/dev/zero bs=1M count=1024 | md5sum -
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.94205 s, 217 MB/s
cd573cfaace07e7949bc0c46028904ff -

On a reasonably new standard Dell desktop I get ~249 MB/s and ~410 MB/s
respectively.


Best regards,

Adam

--
Adam Sjøgren
ad...@novozymes.com

Reply all
Reply to author
Forward
0 new messages