Converting GL objet into Hapmap and MEGAX format

37 views
Skip to first unread message

Jean Rodrigue Sangaré

unread,
Feb 1, 2025, 8:08:27 AMFeb 1
to dartR
Dear Jose
I am still encountering issues with my genlight object. The objective of my research is to assess the genetic diversity of rice landraces in my home country of Mali using several diversity estimators, including the number of segregating sites, proportion of polymorphic sites, theta, and genetic similarity. To accomplish this, I need to convert the data into MEGAX format. I know the HapMap file can be exported to PHYLIP interleaved format using TASSEL and then converted to MEGA X format using PGDSpider. However, I have been unable to successfully convert the genlight object into HapMap format. Additionally, I have tried converting my genlight into a structure format, but I have been unable to perform the structure analysis using the structure software (See attached images). I understand you have a lot on your plate, but I would greatly appreciate your assistance in finding a solution to these challenges.
Regards
error MEGAX.png
structure error.png
structure data format.png

Jose Luis Mijangos

unread,
Feb 10, 2025, 11:25:53 PMFeb 10
to dartR
Hi Jean,

Convert to MEGA format

You can convert your data to FASTA format to read it to Mega, see code below.

devtools::install_github("green-striped-gecko/dartR.base@dev")
library(dartRverse)
t1 <- readRDS("glmg.Rdata")
t1$other$loc.metrics$TrimmedSequence <- t1$other$loc.metrics$TrimmedSequenceSnp
t1 <- t1[1:20,1:100]
indNames(t1) <- gsub("-","_",indNames(t1))
indNames(t1) <- gsub("'","",indNames(t1))
indNames(t1) <- gsub(" ","_",indNames(t1))
gl2fasta(t1,
         method = 1,
         outpath = getwd())

Run Structure

Based on the data you sent me, I found that the issue is that your individual names are not unique after Structure truncates individual names to 11 characters. This issue is described in the documentation of gl.run.structure. You can solve this issue by following the code below:

library(dartRverse)
t1 <- readRDS("glmg.Rdata")
t1$other$loc.metrics$TrimmedSequence <- t1$other$loc.metrics$TrimmedSequenceSnp
t1 <- t1[1:100,1:100]
indNames(t1) <- gsub("-","_",indNames(t1))
indNames(t1) <- gsub("'","",indNames(t1))
indNames(t1) <- gsub(" ","_",indNames(t1))

res <- gl.run.structure(t1, k.range = 2:4, num.k.rep = 1,
                        exec ='C:/Users/JEAN/Documents/final/Pop/console/structure.exe')
res2 <- gl.plot.structure(res, K=2:4)

library(dartRverse)
library(stringr)
tn <- readRDS("glmg.Rdata")
tn$other$loc.metrics$TrimmedSequence <- tn$other$loc.metrics$TrimmedSequenceSnp
# changing dash by underscore in individual names
indNames(tn) <- gsub("-","_",indNames(tn))
# removing quotations from individual names
indNames(tn) <- gsub("'","",indNames(tn))
# removing spaces from individual names
indNames(tn) <- gsub(" ","_",indNames(tn))
# truncate individuals names and making them unique
indNames(tn) <- make.unique(str_sub(indNames(tn),1,8), sep = "_")
# testing that individual names have less than 11 characters
max(nchar(indNames(tn))) <= 11
# testing that individual names are unique
nInd(tn) == length(unique(indNames(tn)))

res <- gl.run.structure(tn, k.range = 2:4, num.k.rep = 2)
res2 <- gl.plot.structure(res, K=2:4)

PS 

Although Structure remains the most widely used software for investigating genetic structure, it was developed over 20 years ago when microsatellites and only a few dozen SNPs were standard input data. To ensure reliable results, Structure requires:

- A long burn-in period (burnin),
- A large number of MCMC replicates (numreps), and
- Multiple independent runs (num.k.rep).

In the function gl.run.structure, the default values for these parameters are insufficient and should be increased 10 to 100 times (see for example: . For your dataset (564 genotypes, 26,169 SNPs), running Structure as recommended could take several weeks on a personal computer.

Instead, I suggest exploring alternative methods already implemented in dartR, which are more efficient for large-scale SNP datasets:

- gl.run.popcluster (see attached paper).
- gl.run.faststructure (Mac/Linux only)
- gl.run.snmf
- gl.pcoa / gl.pcoa.plot

If you still wish to use Structure, here is an article detailing a method to run it in parallel:


Cheers,
Luis 

Wang_2024.pdf
Reply all
Reply to author
Forward
0 new messages