Clustering proteins in multiple sequence alignments?

Edward Wallace

unread,

May 24, 2021, 10:48:45 AM5/24/21

to ashworth-c...@googlegroups.com

Hi Code Monkeys,
Any recommendations on best practice for clustering analysis of protein sequences? My MSc student Yuxuan says "I read something about doing the clustering and found there are many ways to do that (e.g. blastclust, CLUSS, AP)." Are there any particular recommendations for methods to use or avoid?

I'm not convinced that we need to do clustering rather than proper phylogenetic analysis, but it would be nice to get any advice!

For context, our proteins of interest have multiple conserved structured domains so we expect multiple sequence alignments, and clustering and phylogeny, to be quite informative. My major goal is to understand the evolution of the protein and especially of its RNA-binding sites, looking for clues to function.

Thanks!

Edward

--

Edward Wallace

Tel: +44-777-914-7542

OBBARD Darren

unread,

May 24, 2021, 11:26:59 AM5/24/21

to ashworth-c...@googlegroups.com

Hi!

> I'm not convinced that we need to do clustering rather than proper
> phylogenetic analysis, but it would be nice to get any advice!

As always, I'd start out by asking "What's the question?"

(1) If you want to ask 'How similar are these proteins?' then clustering on a distance metric is describing similarity.
(2) If you want to ask 'How are these proteins related?' then formal alignment and model-based phylogenetic inference is the way forward.

Of course, (2) is quite likely to answer (1) as well, unless there's a lot of convergence. And you might use (1) on the way to (2), as a rough approximation to define the limits of the analysis (i.e. to guess at orthology)

I would only cluster based on distance if you explicitly expect similarity and phylogeny to be decoupled (e.g. lots of convergence, or many gained or lost domains) or if phylogenetic analysis is intractable either for computational requirements, or because evolutionary models can't be usefully applied (i.e. no meaningful alignment due to domain gains/ losses or re-ordering).

For (2), my current favourite would be t_coffee -mode 'accurate', followed by IQtree2

Regards!

D

--

Darren Obbard
darren...@ed.ac.uk

Institute of Evolutionary Biology
University of Edinburgh
Ashworth Laboratories, Charlotte Auerbach Road
Edinburgh EH9 3FL

Office 0131 651 7781
Mobile: 07968 838 635

http://obbard.bio.ed.ac.uk/

> -----Original Message-----
> From: ashworth-c...@googlegroups.com <ashworth-code-
> mon...@googlegroups.com> On Behalf Of Edward Wallace
> Sent: 24 May 2021 15:48
> To: ashworth-c...@googlegroups.com
> Subject: [ashworth-code-monkeys] Clustering proteins in multiple sequence
> alignments?
>
> This email was sent to you by someone outside the University.
> You should only click on links or attachments if you are certain that the email
> is genuine and the content is safe.

> --
> The wiki is at:
> https://www.wiki.ed.ac.uk/display/AshCodes/Ashworth+Codemonkeys
> The mailing list archive is at:
> https://groups.google.com/forum/?fromgroups#!forum/ashworth-code-
> monkeys
> If you have trouble editing the wiki or emailing the group, let me know:
> sujai...@ed.ac.uk
> ---
> You received this message because you are subscribed to the Google Groups
> "Ashworth Codemonkeys" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ashworth-code-mo...@googlegroups.com
> <mailto:ashworth-code-mo...@googlegroups.com> .
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/ashworth-code-
> monkeys/CALKBTxaMo9KXoKyiTLJ9BqtZORX2a0Ts0njZGHi16qmtSmH-
> NA%40mail.gmail.com <https://groups.google.com/d/msgid/ashworth-
> code-monkeys/CALKBTxaMo9KXoKyiTLJ9BqtZORX2a0Ts0njZGHi16qmtSmH-
> NA%40mail.gmail.com?utm_medium=email&utm_source=footer> .

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

Edward Wallace

unread,

May 24, 2021, 2:07:52 PM5/24/21

to ashworth-c...@googlegroups.com

Hi Darren,

Thanks, that is pretty much what I suspected, only clearer! I'll pass the message on to the student.

The last phylogenetic analysis that I did - successfully coached by Daniel Barker and Gemma Atkinson - mafft and IQ-TREE 2 worked extremely well. But that was a nice large well-behaved protein.

Best wishes,

Edward

To unsubscribe from this group and stop receiving emails from it, send an email to ashworth-code-mo...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/ashworth-code-monkeys/AM6PR05MB501594B77E65478389FBD2F5DF269%40AM6PR05MB5015.eurprd05.prod.outlook.com.

Reply all

Reply to author

Forward