How does PHLAWD work?

Rutger Vos

unread,

Sep 28, 2015, 6:20:49 AM9/28/15

to phlawd

Hi all,

I'm trying to wrap my head around how PHLAWD works. Is there a recent paper that explains this? My current guess is that it roughly does this:

1. user provides a reference set of curated, "good" sequences
2. user provides one or more gene names as they occur in GenBank records
3. PHLAWD looks for records that match the gene name annotations
4. PHLAWD then checks if the candidate records actually BLAST well against the reference set

Is that it? Or are candidate sequences located some other way (=by using the reference set as BLAST query sequences)?

Thanks!

Rutger

Stephen Smith

unread,

Sep 28, 2015, 9:36:25 AM9/28/15

to phl...@googlegroups.com

Hey Rutger

There are a bunch of papers that use it but Smith et al. 2009 in BMC is whether things are described.

That is all correct. The steps 2 and 3 are simply to save time (and it is reasonable considering the kinds of searching that are done). In addition to step 4, phlawd will break up bad alignments into subsets, align the subsets and the profile the subsets back together.

Also, I should say that we are currently looking at ways to improve the homology searching. It does a good job now but it can be improved with some intervention (this is of course the case with virtually all homology searching). So if you are interested in that stay tuned.

Take care

--
You received this message because you are subscribed to the Google Groups "phlawd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to phlawd+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rutger Vos

unread,

Sep 28, 2015, 10:34:34 AM9/28/15

to phlawd

Hi Stephen,

thanks for the snappy reply! My experiences with searching on gene names lead me to believe that this approach really only works with markers that have very stable names, for example as enforced by initiatives such as BOLD. So rbcL, matK and a couple of other ones work very well, but outside of that limited set it's basically the wild west w.r.t. what people put in as gene names. As that something that is a hindrance for PHLAWD? I mean, do you end up missing sequences that are annotated as CO1 instead of COI (for example)? Is that why you are working on the homology searching?

Thanks,

Rutger

Op maandag 28 september 2015 15:36:25 UTC+2 schreef Stephen Smith:

Stephen Smith

unread,

Sep 28, 2015, 10:39:29 AM9/28/15

to phl...@googlegroups.com

Well, you can put multiple variants of the names and in one branch (that Cody Hinchliff was working on) you can put wildcards so that gets around COI coi CO1 and all those. You can list them or put CO*, etc.

For the homology searching, that is one reason but a simpler one is just to make sure that we aren't missing sequences. Sometimes things get kicked out and they shouldn't. Granted, these are mostly in really large searches where it is much harder anyway, but still, striving for the best homologs we can get.

Also, I am merging my homologs for phylogenomics and phylogenetics (PHLAWD being the one for phylogenetics).

Take care

Reply all

Reply to author

Forward