Sorting samples into groups of similar region of origin?

34 views

Skip to first unread message

cdittyinc

unread,

Apr 20, 2021, 3:22:26 PM4/20/21

to IsoriX

Hi all!

For context, my project is establishing where army cutworm moths are originating from using the stable isotope hydrogen. Army cutworm moths migrate east-west (interestingly enough) from the Great Plains in the US to the Rocky Mountains and back.

Based on the first run-through I did using the precipitation as the base isoscape and known-origin samples calibrated with that isoscape, it looks like the moths are coming from both sides of the Rocky Mountains/all over. Is there a way to automatically sort groups of moths into cohorts according to which rough regions they came from, instead of assessing the whole group based on the assumption that they originated from the same area? I have hundreds of samples from the high elevation areas that need to be assigned to an origin, and it would take a while to manually tease out which are from similar regions. It would be nice to make a statement about which areas most migrants come probably from, some migrants come probably from, and only a select few migrants probably come from.

Thanks! Please let me know your thoughts.

Alexandre Courtiol

unread,

Apr 21, 2021, 5:07:55 AM4/21/21

to iso...@googlegroups.com

Hi Clare,

OK, so as I understand your message you want to assign moths by sub-groups.

This should be doable in IsoriX if you are willing to write some small custom code around it.

I would start by clustering your data to assign using a clustering algorithm.

There are plenty of options out there for how to do that, but if your input is univariate (i.e. a single isotope), simple clustering methods should do.

I am not an expert on clustering, but I think for e.g. that kmeans() should be OK.

Here is a detailed tutorial on kmeans: https://statsandr.com/blog/clustering-analysis-k-means-and-hierarchical-clustering-by-hand-and-in-r/

There is also this package: Ckmeans.1d.dp which has nice how-to vignettes and implement things a little differently.

If I were you I would run several clustering methods, compare and probably go for one of the most conservative one (i.e. not too many clusters identified).

Once this is done, you just need to write a little loop to run group assignments separately for each cluster.

Here is a simple example of all the steps required, based on the example from the IsoriX documentation for ?isofind:

library(IsoriX)

GNIPDataDEagg <- prepsources(data = GNIPDataDE)

GermanFit <- isofit(data = GNIPDataDEagg,
mean_model_fix = list(elev = TRUE, lat_abs = TRUE))

## We build the isoscape
GermanScape <- isoscape(raster = ElevRasterDE,
isofit = GermanFit)

## We fit the calibration model
CalibAlien <- calibfit(data = CalibDataAlien,
isofit = GermanFit)

## We create a made up dataset with 2 clusters:
AssignDataAlien_doubled <- rbind(AssignDataAlien, AssignDataAlien)
AssignDataAlien_doubled$sample_ID <- as.character(AssignDataAlien_doubled$sample_ID)
AssignDataAlien_doubled$sample_ID[1:10] <- paste0(AssignDataAlien_doubled$sample_ID[1:10], "_group1")
AssignDataAlien_doubled$sample_ID[11:20] <- paste0(AssignDataAlien_doubled$sample_ID[11:20], "_group2")
AssignDataAlien_doubled$sample_value[11:20] <- AssignDataAlien_doubled$sample_value[11:20] + 3
AssignDataAlien_doubled$grp_true <- factor(gl(2, 10))

## We perform the clustering
library(factoextra)
fviz_nbclust(AssignDataAlien_doubled[, "sample_value", drop = FALSE], kmeans, method = "silhouette") ## it selects 8 but 2 looks close to as good, so I pick 2
clust <- kmeans(AssignDataAlien_doubled[, "sample_value", drop = FALSE], centers = 2) ## 2 as 2 is what we chose
AssignDataAlien_doubled$grp_inferred <- factor(clust$cluster)

table(AssignDataAlien_doubled$grp_inferred, AssignDataAlien_doubled$grp_true)
## results not terrific, but not too bad: mind that the order of grouping numbers don't matter, so here we have 14 good assignment and 6 wrong ones

## We perform the assignment
results <- list()

for (grp in unique(AssignDataAlien_doubled$grp_inferred)) {
results[[grp]] <- isofind(data = subset(AssignDataAlien_doubled, grp_inferred == grp),
isoscape = GermanScape,
calibfit = CalibAlien)
}

## We plot assignment for group 1
plot(results[[1]])

## We create all pdfs of all assignment plots
for (grp in unique(AssignDataAlien_doubled$grp_inferred)) {
pdf(file = paste0("temp/group_", grp, ".pdf"))
plot(results[[grp]])
dev.off()
}

Many improvements could be done around this toy example.

One specific to the task at hand is that you could include any other variables about the moths to cluster them.

It would in fact be probably best if you did not have to do the clustering based on the isotopic value.

You could use any other info you collected on the moths, including things such as when you collected them.

The reason is that clustering based on the isotopic values will probably truncate the natural distributions by assigning extreme individuals to other clusters.

Yet, if you only have the isotopic values, you don't really have an option.

In any case, try to compare the entire workflow based on different clustering methods or choices for the number of clusters.

Ideally, your results should be reasonably robust to small changes.

I could integrate this directly into IsoriX, but I am not super keen since I think clustering should be done carefully.

If I integrate that, it may become a blackbox people will use without always paying enough attention to the outcome of the clustering.

If anyone has know-how on this topic, feel free to contribute!

Also Clare, if you could post a picture of your moths, the landscape around you, or whatever related to the project, it will make us travel a bit virtually :-)

It would also be good for the people from this list to grow an accurate idea of the diversity of questions, species and people working with IsoriX.

Many thanks,

Alex

--
You received this message because you are subscribed to the Google Groups "IsoriX" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isorix+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/isorix/ad0575b0-fb77-4db2-b700-0b606af136d1n%40googlegroups.com.