You would just iterate over all the nodes, unset the bootstrap value if
it is < that your cutoff.
for my $node ( $tree->get_nodes ) {
if($node->bootstrap < $CUTOFF ){
$node->bootstrap('');
}
}
The subtly between whether values get moved to boostrap value or just as
an id can be set when you do Bio::TreeIO reading in, you can also call
$tree->move_id_to_bootstrap
which moves them over to the bootstrap slot.
Alternatively you can just know that ids for internal nodes are for
boostraps and use that field, e.g.
for my $node ( $tree->get_nodes ) {
if(! $node->is_Leaf && $node->id < $CUTOFF ){
$node->id('');
}
}
And write the tree back out. The HOWTO on Trees shows I/O for trees in
more detail.
Best wishes.
Komal Jain wrote:
> Dear Jason,
>
>
>
> I went through the documentation of the Bio::Tree module and it is a very
> useful tool. I would like to know from you if it is possible to remove the
> bootstrap values smaller than a cutoff from the newick tree.
>
>
>
> Thanks in advance for your reply,
>
> Komal.
>
--
Jason Stajich
ja...@bioperl.org
http://bioperl.org/wiki
_______________________________________________
Bioperl-l mailing list
Biop...@lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/bioperl-l
Komal.
while(my $dna1 = $seqin1->next_seq){
my $seq1 =
$dna1->seq;
my $id1 =
$dna1->id;
# Iterate
inside de second fasta file
while(my $dna2
= $seqin2->next_seq){
my $seq2 = $dna2->seq;
my
$id2 = $dna2->id;
if(($seq1 =~ /$seq2/)||($seq2 =~ /$seq1/)){
print "Match found \n";
print OUT "Records $id1 and $id2 are redundants";
I am afraid it is going to be slow for large files. AND, more importantly, how do I reset the object containing the second file to the first line, as done in Perl with (SEEK(IN, 0,0)) for example. Does SeqIO allows that (sorry, I am not a frequent user of SeqIO). If there is another more-elaborated module to fish such redundant sequences, I will appreciate to know.
If you are checking for simple complete matches, I would suggest using a hash. However, you are also looking for partial matches as well. In this case it seems like you should be (ab)using something akin to mcl to cluster like sequences together; you're essentially performing an all-v-all comparison anyway, at least take advantage of faster tools.
So, basically:
1) Run an all-v-all comparison, filtering on 100% identity, no gaps
2) cluster using mcl
Note the BLAST-related programs here for that purpose:
http://www.micans.org/mcl/man/mclfamily.html
I think you can also use other tools instead of BLAST, just can't recall the mcl pipeline at the moment to use.
chris
There's a nice example script in the BioPerl distribution that Jason Stajich
wrote which uses MD5 checksums to do the sequence comparison:
https://github.com/bioperl/bioperl-live/blob/master/scripts/utilities/bp_nrdb.PLS
There are also faster, nonBioPerl tools for this, such as the one that comes
with UCLUST:
http://www.drive5.com/usearch/
Dave
Will have to keep an eye on UCLUST, didn't know about that one.
chris
and blastclust from ncbi (which I think still gets installed as part of installed NCBI blast suite)
Malcolm Cook
Stowers Institute for Medical Research - Bioinformatics
Kansas City, Missouri USA
I haven't tested it, but I suspect that encoding either MD5 or SHA would be
relatively quick compared to the sequence I/O, no?
Will have to keep an eye on UCLUST, didn't know about that one.
As it happens, my current pipeline uses MCL but I'm testing UCLUST as a
replacement since it's waaay faster. I'll let you know how the comparison
turns out.
And for that matter, if anyone listening has experience with UCLUST or
CD-HIT or other clustering methods (ideally in the context of metagenomic
next-gen sequence), please chime in with your thoughts.
> SHA should work as well, didn't think of that (though I suppose the encoding
>> step for either would be rate-limiting?).
>>
>
> I haven't tested it, but I suspect that encoding either MD5 or SHA would be
> relatively quick compared to the sequence I/O, no?
Possibly. But one nice thing is clustering allows for partial matches (which I think is the original criterion). I don't believe SHA/MD5 would work for that purpose.
> Will have to keep an eye on UCLUST, didn't know about that one.
>
>
> As it happens, my current pipeline uses MCL but I'm testing UCLUST as a
> replacement since it's waaay faster. I'll let you know how the comparison
> turns out.
>
> And for that matter, if anyone listening has experience with UCLUST or
> CD-HIT or other clustering methods (ideally in the context of metagenomic
> next-gen sequence), please chime in with your thoughts.
As malcolm pointed out, blastclust is also available with legacy BLAST, though I'm not sure it's available with BLAST+ (didn't see anything obvious with BLAST+ for that purpose).
chris
Yep, for sure. Checksums will find full-length exact matches only.
Dave
--
Jason Stajich
ja...@bioperl.org
http://bioperl.org/wiki
_______________________________________________
> SHA should work as well, didn't think of that (though I suppose the
> encoding step for either would be rate-limiting?).
Disk I/O might be the bottleneck - on a 3+ year old desktop I get ~144
MB/s for sha1 and ~217 MB/s for md5 in a simple test:
$ dd if=/dev/zero bs=1M count=1024 | sha1sum -
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 7.44032 s, 144 MB/s
2a492f15396a6768bcbca016993f4b4c8b0b5307 -
$ dd if=/dev/zero bs=1M count=1024 | md5sum -
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 4.94205 s, 217 MB/s
cd573cfaace07e7949bc0c46028904ff -
On a reasonably new standard Dell desktop I get ~249 MB/s and ~410 MB/s
respectively.
Best regards,
Adam
--
Adam Sjøgren
ad...@novozymes.com