Outliers detection and treatment

16 views
Skip to first unread message

Otávio Lovison

unread,
May 25, 2023, 10:01:41 AM5/25/23
to Phylogenetic Placement
Hello community!

Any suggestions on how to detect and treat outliers samples from v3v4 16S amplicon using phylogenetic placement data? It is my first time performing phylogenetic placement. I know it is very powerful for outliers but I am having difficulties to stablish a workflow on the data I am analysing. Highly heterogeneous, elevated dispersion.. a nightmare to detect outliers using clustering strategy.


Any suggestions will be welcome!

Thanks in advance!

Lucas Czech

unread,
May 29, 2023, 7:59:20 PM5/29/23
to Phylogenetic Placement
Hi Otavio,

are you looking for whole samples that are outliers? That is interesting. An open question in phylogenetic placement is how to reliably detect individual outlier query sequences, such as chimeras, wrongly aligned sequences, or anything that does not have a close relative in the reference tree... See our review paper for some more insights on that problem.

But whole samples as outliers is an interesting new problem. I guess it boils down to the question: How would you define an outlier with respect to your particular dataset? What does it mean for a sample to be an outlier? With that in mind, an answer about how to detect them might already spring to mind. 

For instance: An outlier in that would be a sample that behaves somewhat different from your expectation or from the others? I'd have suggested clustering as a first attempt: Squash clustering will show you distances between samples; furthermore, Edge PCA or Phylo-factorization could give you and ordination plot (scatter plot of samples) that can tell you more about outliers. What "clustering strategy" were you referring to, and why did that not help? 

Cheers
Lucas

Otávio Lovison

unread,
May 30, 2023, 10:41:40 AM5/30/23
to Lucas Czech, Phylogenetic Placement
Hello! Thanks for the response!

Yes, there's no consensus on outlier sample detection.. that's more than a philosophical debate in the academic community.. I tried a lot of clustering strategies.. but I have a design problem in this project, and an elevated dispersion, so I think that I will not be able to cluster the samples the way I would like to do it.. 

I generate individual heat trees for every sample and that helped me a lot in visualizing outliers (I am still analysing... but if you have any suggestion on how to perform that I am open to hear). For example, analysing individual heat trees I saw that few (supposed) outlier samples had an elevated relative abundance in a single branch, and probably some of them were caused by contaminants, while others were caused by infection. I realize that phylogenetic placement is powerful in detecting outliers.. it is just a question on how to perform that.. I am thinking about machine learning, but I am pretty sure I will have difficulties managing the data.

About chimeras: how do you manage them? Do you perform some filtering previously to the placement? Or is there a way to treat them after the placement? 
About taxonomy assignment: any suggestions on how I can work with the generated data? I didn't figure out what to do with the 'per query table', for example. 

I am really enjoying playing around with GAPPA! 

MSc. Otávio von Ameln Lovison
CRF/RS 12363
Farmacêutico bioquímico
Especialista em Citologia Clínica
Especialista em Microbiologia Clínica
Mestre em Ciências Farmacêuticas (CAPES 7) pela Universidade Federal do Rio Grande do Sul (PPGCF/UFRGS)
Doutorando em Ciências Farmacêuticas (CAPES 7) pela Universidade Federal do Rio Grande do Sul (PPGCF/UFRGS)
Instituto Nacional de Pesquisa em Resistência Antimicrobiana - INPRA
Laboratório de Pesquisa em Resistência Bacteriana - LABRESIS
Laboratório de Microbiologia e Saúde Única - ICBS/UFRGS
Núcleo de Bioinformática (Bioinformatics Core) do Hospital de Clínicas de Porto Alegre 


--
You received this message because you are subscribed to the Google Groups "Phylogenetic Placement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to phylogenetic-plac...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/phylogenetic-placement/34c05cae-40c7-4baf-91bf-65299f98dc82n%40googlegroups.com.

Lucas Czech

unread,
Jun 6, 2023, 3:34:03 PM6/6/23
to Otávio Lovison, Lucas Czech, Phylogenetic Placement

Hi Otávio,

for sample outliers like that, I think my PhD thesis might help: https://publikationen.bibliothek.kit.edu/1000105237

It contains a more thorough examination of the methods of gappa (more than the corresponding papers). See in particular Figure 4.5(a). There is a black arrow pointing at one particular edge - which comes from an outlier sample.

From the text:

Further examples of variants of Edge Dispersion on the BV dataset are shown in
Figure 4.5. In Figure 4.5(a), which is linearily scaled, it is striking that one outlier
edge, marked with an arrow, is dominating the values, and thereby hiding the values
on less variable edges. This outlier occurs for the species Prevotella bivia in one of the
220 samples, where 2781 out of 2782 sequences in the sample have some placement
mass on that branch. Upon close examination, this outlier can also be seen in
Figure 1D of Srinivasan et al. (2012) [339], but is less apparent there. Thus, our
novel visualization can help to detect such outlier samples.

Hope that gives you some idea or direction ;-)

Other than that, this is up to you now - you might be able to work out a novel method there!

As for chimeras: No idea, that's unsolved as far as I am aware. If you have an idea, let me know!

For the tax assignment: That depends on what you want to achieve there. Just assignment of your queries to the most likely taxonomic labels? Or more?

Cheers
Lucas

Otávio Lovison

unread,
Jun 6, 2023, 4:30:32 PM6/6/23
to Lucas Czech, Lucas Czech, Phylogenetic Placement
Very interesting!

About chimeras: I will perform some tests..

About taxonomic assignment: I was thinking about its potential to annotate unannotated features from the ASVs workflow... but, besides that, I need suggestions... I don't know what to do with that data.

Thanks!

MSc. Otávio von Ameln Lovison
CRF/RS 12363
Farmacêutico bioquímico
Especialista em Citologia Clínica
Especialista em Microbiologia Clínica
Mestre em Ciências Farmacêuticas (CAPES 7) pela Universidade Federal do Rio Grande do Sul (PPGCF/UFRGS)
Doutorando em Ciências Farmacêuticas (CAPES 7) pela Universidade Federal do Rio Grande do Sul (PPGCF/UFRGS)
Instituto Nacional de Pesquisa em Resistência Antimicrobiana - INPRA
Laboratório de Pesquisa em Resistência Bacteriana - LABRESIS
Laboratório de Microbiologia e Saúde Única - ICBS/UFRGS
Núcleo de Bioinformática (Bioinformatics Core) do Hospital de Clínicas de Porto Alegre 

Reply all
Reply to author
Forward
0 new messages