converting a VCF to a genlight and keeping the alleles.

206 views
Skip to first unread message

renee....@gmail.com

unread,
Jul 17, 2024, 1:47:23 AM7/17/24
to dartR
Hi All,

When I have a VCF file that I have read in and done some filtering on with vcfR (such as allele balance filtering), I then want to turn it into a genlight for use in dartR.

When I do this:

gl.1 <- vcfR2genlight(data.7)

The actual alleles for each locus are not sent over. I want these actual alleles kept for some downstream analyses.

So then I thought I would write out the vcf file and then read it back in:

write.vcf(data.7, file = "data7.vcf")
gl.1 <- gl.read.vcf("data7.vcf")

This last line of code gives me this error:
Error in .seppop_internal(x = x, pop = pop, treatOther = treatOther, keepNA = keepNA,  :
  pop not provided and pop(x) is empty

But VCF files don't have populations assigned and the instructions don't say to include a metadata file? I would assume we add the metadata after we read it in. Any ideas on what I'm doing wrong would be appreciated.

Or how to get a VCF to a genlight and keep the information on the REF/SNP alleles.

Thanks,

Renee

renee....@gmail.com

unread,
Jul 17, 2024, 1:52:04 AM7/17/24
to dartR
I now realise that the gl.read.vcf function uses vcfR, so that doesn't solve the allele problem even if I get it to work.

Jose Luis Mijangos

unread,
Jul 17, 2024, 2:00:29 AM7/17/24
to dartR
Hi Renee,

Give it a try using the developing version of dartR.base, as shown below using the attached testing file (split..1.vcf). gl.read.vcf has also the option of including metadata information using the parameter "ind.metafile"

Cheers,
Luis 

devtools::install_github("green-striped-gecko/dartR.base@dev")
library(dartRverse)
t1 <- gl.read.vcf("split.1.vcf")
#here are the alleles
t1$loc.all

split.1.vcf

Renee Catullo

unread,
Jul 19, 2024, 12:11:16 AM7/19/24
to da...@googlegroups.com
Thanks Luis. I ended up doing a bit of a hack that seems to have worked.

First I made a data frame of the info from the VCF file I needed, including making a SNP column:

alleles_info <- data.frame(
  CHROM = data.7@fix[, "CHROM"],
  POS = data.7@fix[, "POS"],
  REF = data.7@fix[, "REF"],
  ALT = data.7@fix[, "ALT"],
  row.names = NULL
)

#create the right column for adding back later
alleles_info$SNP <- paste0(alleles_info$POS, ":", alleles_info$REF, ">", alleles_info$ALT)

Then I converted to a genlight and added all the necessary info, including the SNP data

gl.1 <- vcfR2genlight(data.7)

#Tell dartr that it is a binary SNP data
ploidy(gl.1) <- 2 

# Add pop info and other individual metrics ### make sure individuals are in same order in both genlight and strata file before reading in. Get individual order from genlight using indNames(gl). Write to file and check against stratafile
# Export sample names so can check order of strata file
#names <- as.data.frame(indNames(gl))
#write.csv(names, paste0(filename, "_samplenames.csv"))
gl.1@other$ind.metrics <- meta[,c(1:6)]

# Assign population
gl.1 <- gl.reassign.pop(gl.1, as.pop = "pop")
# check
table(pop(gl.1))

# Generate empty columns in loc.metrics
gl.1 <- gl.compliance.check(gl.1)

# Calculate loc.metrics
gl.1 <- gl.recalc.metrics(gl.1)

#Add back the allele info
gl.1@other$loc.metrics$SNP <- alleles_info$SNP

#assign lat/long
gl.1@other$latlong <- gl.1@other$ind.metrics[,c(4:5)]


Then did a big pile of filtering etc suitable to my phylogeny and then stole code from gl2fasta to be able to output the file:

###do this things so a fasta can be output
#add the alleles
alleles <- paste0(gl.5$other$loc.metrics$REF, "/", gl.5$other$loc.metrics$ALT)
gl...@loc.all <- alleles

#add the SNP postiion
position <- as.integer(gl.5$other$loc.metrics$POS)
gl.5@position <- position

gl.5$other$loc.metrics$position <- gl.5$other$loc.metrics$POS

###add fake trimmed sequence
gl.5@other$loc.metrics$TrimmedSequence <- "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"

####trash version of gl2fasta code that works with VCF file origin data, with output suitable to IQtree.
x <- gl.5
outfile = "gl5.fasta"

#just run all of this
method = 3
allnames <- locNames(x)
snp <- as.character(x...@loc.all)
trimmed <- as.character(x@other$loc.metrics$TrimmedSequence)
snpmatrix <- as.matrix(x)
conversion <- matrix(c("A", "W", "R", "M", "W", "T", 
                       "K", "Y", "R", "K", "G", "S", "M", "Y", "S", "C"), 
                     nrow = 4, ncol = 4)
colnames(conversion) <- c("A", "T", "G", "C")
rownames(conversion) <- colnames(conversion)
allelepos <- x@position
allele1 <- gsub("(.)/(.)", "\\1", snp, perl = T)
allele2 <- gsub("(.)/(.)", "\\2", snp, perl = T)

# Open the sink to start writing to the file to your working directory
sink(outfile)
#run the for loop
for (i in 1:nInd(x)) {
  seq <- NA
  for (j in 1:nLoc(x)) {
    if (is.na(snpmatrix[i, j])) {
      code <- "N"
    }
    else {
      if (snpmatrix[i, j] == 0) {
        a1 <- allele1[j]
        a2 <- allele1[j]
      }
      if (snpmatrix[i, j] == 1) {
        a1 <- allele1[j]
        a2 <- allele2[j]
      }
      if (snpmatrix[i, j] == 2) {
        a1 <- allele2[j]
        a2 <- allele2[j]
      }
      code <- conversion[a1, a2]
    }
    snppos <- allelepos[j]
    if (method == 1) {
      if (code != "N") {
        seq[j] <- paste0(substr(as.character(x@other$loc.metrics$TrimmedSequence[j]), 
                                1, snppos), code, substr(x@other$loc.metrics$TrimmedSequence[j], 
                                                         snppos + 2, 500))
      }
      else {
        seq[j] <- paste(rep("N", nchar(as.character(x@other$loc.metrics$TrimmedSequence[j]))), 
                        collapse = "")
      }
    }
    else if (method == 3) {
      seq[j] <- code
    }
  }
  result <- paste(seq, sep = "", collapse = "")
  cat(paste0(">", indNames(x)[i], "_", pop(x)[i], "\n"))
  cat(result, " \n")
}
# Close the sink to stop writing to the file
sink()


There is a whole lot more to this if someone wants it they can email me.

-- 
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/7712398a-e438-42d7-b5ec-cadbd8578881n%40googlegroups.com.
<split.1.vcf>

Reply all
Reply to author
Forward
0 new messages