Hi Jean,
Convert to MEGA formatYou can convert your data to FASTA format to read it to Mega, see code below.
devtools::install_github("green-striped-gecko/dartR.base@dev")
library(dartRverse)
t1 <- readRDS("glmg.Rdata")
t1$other$loc.metrics$TrimmedSequence <- t1$other$loc.metrics$TrimmedSequenceSnp
t1 <- t1[1:20,1:100]
indNames(t1) <- gsub("-","_",indNames(t1))
indNames(t1) <- gsub("'","",indNames(t1))
indNames(t1) <- gsub(" ","_",indNames(t1))
gl2fasta(t1,
method = 1,
outpath = getwd())
Run Structure
Based on the data you sent me, I found that the issue is that your individual names are not unique after Structure truncates individual names to 11 characters. This issue is described in the documentation of gl.run.structure. You can solve this issue by following the code below:
library(dartRverse)
t1 <- readRDS("glmg.Rdata")
t1$other$loc.metrics$TrimmedSequence <- t1$other$loc.metrics$TrimmedSequenceSnp
t1 <- t1[1:100,1:100]
indNames(t1) <- gsub("-","_",indNames(t1))
indNames(t1) <- gsub("'","",indNames(t1))
indNames(t1) <- gsub(" ","_",indNames(t1))
res <- gl.run.structure(t1, k.range = 2:4, num.k.rep = 1,
exec ='C:/Users/JEAN/Documents/final/Pop/console/structure.exe')
res2 <- gl.plot.structure(res, K=2:4)
library(dartRverse)
library(stringr)
tn <- readRDS("glmg.Rdata")
tn$other$loc.metrics$TrimmedSequence <- tn$other$loc.metrics$TrimmedSequenceSnp
# changing dash by underscore in individual names
indNames(tn) <- gsub("-","_",indNames(tn))
# removing quotations from individual names
indNames(tn) <- gsub("'","",indNames(tn))
# removing spaces from individual names
indNames(tn) <- gsub(" ","_",indNames(tn))
# truncate individuals names and making them unique
indNames(tn) <- make.unique(str_sub(indNames(tn),1,8), sep = "_")
# testing that individual names have less than 11 characters
max(nchar(indNames(tn))) <= 11
# testing that individual names are unique
nInd(tn) == length(unique(indNames(tn)))
res <- gl.run.structure(tn, k.range = 2:4, num.k.rep = 2)
res2 <- gl.plot.structure(res, K=2:4)
PS
Although Structure remains the most widely used software for investigating genetic structure, it was developed over 20 years ago when microsatellites and only a few dozen SNPs were standard input data. To ensure reliable results, Structure requires:
- A long burn-in period (burnin),
- A large number of MCMC replicates (numreps), and
- Multiple independent runs (num.k.rep).
In the function gl.run.structure, the default values for these parameters are insufficient and should be increased 10 to 100 times (see for example: . For your dataset (564 genotypes, 26,169 SNPs), running Structure as recommended could take several weeks on a personal computer.
Instead, I suggest exploring alternative methods already implemented in dartR, which are more efficient for large-scale SNP datasets:
- gl.run.popcluster (see attached paper).
- gl.run.faststructure (Mac/Linux only)
- gl.run.snmf
- gl.pcoa / gl.pcoa.plot
If you still wish to use Structure, here is an article detailing a method to run it in parallel:
Cheers,
Luis