Which analysis could be performed Poppr if I have Genomewide SNP data?

892 views
Skip to first unread message

jigar trivedi

unread,
Nov 2, 2015, 5:15:39 PM11/2/15
to poppr

This would be a naive question to ask in this group, but it is important to me. I am doing a genomewide SNP analysis on a clonal organism and I am using GATK for the same. My output file contains SNPs is in the form of VCF file and I would like to perform various analysis mentioned in the paper. In that case how should I proceed or start with?

Zhian Kamvar

unread,
Nov 2, 2015, 5:51:59 PM11/2/15
to poppr
Hi,

This would be a naive question to ask in this group, but it is important to me.
 
Naïve questions are quite welcome in this group, as it provides a resource for others who will likely have the same question :)


I am doing a genomewide SNP analysis on a clonal organism and I am using GATK for the same. My output file contains SNPs is in the form of VCF file and I would like to perform various analysis mentioned in the paper. In that case how should I proceed or start with?


This is a good question. One important factor we left out of the Frontiers paper was how exactly to import genome-wide data from VCF format. There are two packages that you can use to extract genotypes from a VCF file, pegas (https://github.com/emmanuelparadis/pegas) and vcfR (https://github.com/knausb/vcfR). 
Both of these packages have valuable tools for analysis and visualization of genomic data. From either of these packages, you can convert to a matrix, which you can then convert to a genlight object (see the adegenet "genomics" tutorial: https://github.com/thibautjombart/adegenet/blob/master/tutorials/tutorial-genomics.pdf). 

Before you start, I would recommend at this time to install the github versions of the packages as there are many bug fixes and improvements that aren't on CRAN as of yet. You can do this using devtools (you must also have a C compiler working on your computer, see here for links to options https://github.com/grunwaldlab/poppr/#stable-and-development-versions):

if (!require("devtools")) install.packages("devtools")
devtools
::install_github("emmanuelparadis/pegas/pegas")
devtools
::install_github("knausb/vcfR@devel")
devtools
::install_github("thibautjombart/adegenet")
devtools
::install_github("grunwaldlab/poppr@devel")


Once you have your data in a genlight object, convert it to a snpclone object with the function as.snpclone() (yes, I know there are quite a few steps to get here). This will make sure that you can have mutlilocus genotype definitions travel with your data. From here, you can:


  • Calculate raw genetic distance with bitwise.dist()
  • Define multilocus lineages with mlg.filter()
  • Calculate sliding windows of the standardized index of association with win.ia() 
  • Randomly sample loci for the standardized index of association with samp.ia()
  • Construct minimum spanning networks with poppr.msn()
  • Create boostrapped dendrograms with aboot()


Hope that helps.


Best,

Zhian


Brian

unread,
Nov 2, 2015, 6:21:19 PM11/2/15
to Zhian Kamvar, poppr
Hello,

The vcfR package is still under development (by me), but I think It might have what you need. There is a read.vcf function to read in vcf files and a vcfR2genind function to make a genind object.

There are a few limitations to working on genomic data in R. The big thing is that memory may be a limiting factor. Because of this I advocate working on a single chromosome (supercontig or contig depending on your genome). And some functions in other packages may not yet scale well to genome size problems. For example,  my vcfR2genind calls adegenet's df2genind, the later of which appears to have problems with large data sets. Although I have not tried the github versions as Zhian has recommended.

Good luck!
Brian
--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/b1304e01-a94e-40dd-aa57-d5cd3fc9ebeb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jigar trivedi

unread,
Nov 4, 2015, 1:30:23 PM11/4/15
to poppr
Hi Zhian,

Thank you very much for this. It will help me a lot. Will get back to you in case I experience any difficulty in my analysis.

jigar trivedi

unread,
Nov 6, 2015, 5:29:04 PM11/6/15
to poppr
Hi Zhian,

I am not able to install devtools and Pegas. I am doing that through R tools.

Zhian Kamvar

unread,
Nov 6, 2015, 5:55:24 PM11/6/15
to jigar trivedi, poppr
Can you elaborate on this? Installing devtools at the moment is a necessary step until the packages are updated on CRAN. R tools is not a platform from which to install packages, it is a set of tools that plug into R to allow you to build packages that contain compiled code. What is the error you get when you try to install devtools?

Zhian
> --
> You received this message because you are subscribed to the Google Groups "poppr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
> To post to this group, send email to po...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/0d2aa140-3a64-4e7e-a837-8381dbbc3b33%40googlegroups.com.

jigar trivedi

unread,
Nov 7, 2015, 12:44:34 AM11/7/15
to poppr, jig...@gmail.com
Hi Zhian,

Maybe I was doing it the wrong way. How should I install devtools on Mac OS? 

Zhian Kamvar

unread,
Nov 7, 2015, 1:34:50 PM11/7/15
to jigar trivedi, poppr
For OSX, things should be a bit easier. As I've written here (https://github.com/grunwaldlab/poppr/tree/devel#stable-and-development-versions), you will need Xcode's command line tools. You need to open your terminal and type xcode-select --install to install them. This tutorial will show you how to do that (http://osxdaily.com/2014/02/12/install-command-line-tools-mac-os-x/). 

After that, you can open R or Rstudio (https://www.rstudio.com/products/rstudio/download/) and use 

install.packages('devtools')

From there, you can use the instructions as I've outlined in my initial reply

Cheers,
Zhian

On Nov 6, 2015, at 21:44 , jigar trivedi <jig...@gmail.com> wrote:

Hi Zhian,

Maybe I was doing it the wrong way. How should I install devtools on Mac OS? 

On Friday, 6 November 2015 17:55:24 UTC-5, Zhian Kamvar wrote:
Can you elaborate on this? Installing devtools at the moment is a necessary step until the packages are updated on CRAN. R tools is not a platform from which to install packages, it is a set of tools that plug into R to allow you to build packages that contain compiled code. What is the error you get when you try to install devtools? 

Zhian 


> On Nov 6, 2015, at 14:29 , jigar trivedi <jig...@gmail.com> wrote: 

> Hi Zhian, 

> I am not able to install devtools and Pegas. I am doing that through R tools. 

> On Monday, 2 November 2015 17:15:39 UTC-5, jigar trivedi wrote: 

> This would be a naive question to ask in this group, but it is important to me. I am doing a genomewide SNP analysis on a clonal organism and I am using GATK for the same. My output file contains SNPs is in the form of VCF file and I would like to perform various analysis mentioned in the paper. In that case how should I proceed or start with? 

> -- 
> You received this message because you are subscribed to the Google Groups "poppr" group. 
> To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com
> To post to this group, send email to po...@googlegroups.com
> To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/0d2aa140-3a64-4e7e-a837-8381dbbc3b33%40googlegroups.com
> For more options, visit https://groups.google.com/d/optout


-- 
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.

jigar trivedi

unread,
Nov 10, 2015, 1:17:50 PM11/10/15
to poppr, jig...@gmail.com
> devtools::install_github(repo = "grunwaldlab/poppr", build_vignettes = TRUE)
Downloading GitHub repo grunwaldlab/poppr@master
Installing poppr
Skipping 3 packages ahead of CRAN: knitr, rmarkdown, sp
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore CMD  \
  build  \
  '/private/var/folders/lg/lkd74c_d1039_b4_vl4tmqr80000gn/T/RtmpLnZfTq/devtools5ad584294ad/grunwaldlab-poppr-33e48fc'  \
  --no-resave-data --no-manual

* checking for file ‘/private/var/folders/lg/lkd74c_d1039_b4_vl4tmqr80000gn/T/RtmpLnZfTq/devtools5ad584294ad/grunwaldlab-poppr-33e48fc/DESCRIPTION’ ... OK
* preparing ‘poppr’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
* creating vignettes ... ERROR
Loading required package: adegenet
Loading required package: ade4

   /// adegenet 2.0.0 is loaded ////////////

   > overview: '?adegenet'
   > tutorials/doc/questions: 'adegenetWeb()'
   > bug reports/feature resquests: adegenetIssues()


This is poppr version 2.0.2.99.17. To get started, type package?poppr
OMP parallel support: unavailable
Loading required package: ape
Error : cannot evaluate distance function, it might be missing.
sh: kpsewhich: command not found
Warning in test_latex_pkg("framed", system.file("misc", "framed.sty", package = "knitr")) :
  unable to find LaTeX package 'framed'; will use a copy from knitr
Warning in infinite_vals_replacement(D, warning) :
  Infinite values detected.
Error in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  :
  Running 'texi2dvi' on 'algo.tex' failed.
Calls: <Anonymous> -> texi2pdf -> texi2dvi
Execution halted
Error: Command failed (1)

jigar trivedi

unread,
Nov 10, 2015, 1:18:45 PM11/10/15
to poppr, jig...@gmail.com
Hi Zhian,

I am getting this error when I am trying to install the Github version of Poppr.

Zhian Kamvar

unread,
Nov 10, 2015, 1:23:07 PM11/10/15
to poppr, jig...@gmail.com
Hi,

This should work:

devtools::install_github(repo = "grunwaldlab/poppr")

I believe that it's the build_vignettes option. You have to have LaTeX support to be able to do that.

Let me know if that helps.

Zhian

jigar trivedi

unread,
Nov 10, 2015, 2:52:26 PM11/10/15
to poppr, jig...@gmail.com
Hi Zhian,

That worked.

Thank you!!

jigar trivedi

unread,
Nov 10, 2015, 3:26:05 PM11/10/15
to poppr, zka...@gmail.com
Hi Brian,

I have installed 'VCF R' to covert VCF to genid. Would it be possible to do it for the entire genome of size 30 MB?  I cannot do it through individual supercontigs as they are around 1800.

Brian

unread,
Nov 10, 2015, 4:23:01 PM11/10/15
to jigar trivedi, poppr, zka...@gmail.com
Hi Jigar,

I'm afraid your genome size is not very informative as to how large an object your data set will require. The real issue is the number of variants (rows) and the number of samples (columns). I'd suggest you give it a try though. The function read.vcf includes a parameter 'limit' to try and keep you from using all your memory. If you exceed this limit it should give you a message with a guess as to how much memory you might need. You can then try again with a new parameterization for limit.

The number of supercontigs you have is not really so great. I worked through a project this morning with ~5,000 contigs. I use something like:

Files <- list.files("./directory_with_vcfs/", pattern="vcf$", full.names=TRUE)
for(i in 1:length(Files)){
  x <- read.vcf(Files[i])
  # some sort of processing of each vcf
}

To cycle through large numbers of files.

Also, vcfR2genind does not appear to be working well for large objects. I believe this has to do with an adegenet function. I have yet to try the version on github though. The function vcfR2genlight may be a better option.

Good luck!
Brian

jigar trivedi

unread,
Nov 10, 2015, 6:12:33 PM11/10/15
to poppr, jig...@gmail.com, zka...@gmail.com
Hi Brian,

Thank you!!

I would try and see if it works out. Though, I am not expecting lot of variants as the organism is clonal.

jigar trivedi

unread,
Nov 11, 2015, 5:52:03 PM11/11/15
to poppr, jig...@gmail.com, zka...@gmail.com
Hi Brian,

Could you send me the commands or your workflow which you use to convert VCF2genid. I am first time user of your tool and R?

On Tuesday, 10 November 2015 16:23:01 UTC-5, Brian Knaus wrote:

Brian Knaus

unread,
Nov 12, 2015, 6:37:08 PM11/12/15
to jigar trivedi, poppr, Zhian Kamvar
Hi Jigar,

You presumably have a vcf format file.  This may either be *.vcf or *.vcf.gz format. If we call this file 'mydata.vcf' your workflow could be as simple as:

library(vcfR)
vcf <- read.vcf('mydata.vcf')
mygenind <- vcfR2genind(vcf)

At which point you should have a genind object which can be used with adegenet or poppr.

In order to find information about a command in R you can use:

?command

Where 'command' is the name of the command you are looking for information on. Many packages also come with vignettes to describe their use. I would suggest you spend some time with these documents and, perhaps some introductory materials on R, so you can find answers if they already exist in the documentation.

As stated previously, the function vcfR2genind may not perform well depending on how large your dataset is. If you have problems with it, let me know. I have another path I can suggest.

Good luck!
Brian



For more options, visit https://groups.google.com/d/optout.



--
Brian J. Knaus, Ph.D.
Corvallis, Oregon, USA
brianknaus.com
http://grunwaldlab.cgrb.oregonstate.edu/

Melanie Montes

unread,
Dec 5, 2016, 11:28:48 AM12/5/16
to poppr
Hi,
I'm using poppr for the first time on a data set of about 70 individuals x 55 000 snps, and also on a thinned data set where I only selected snps that are more than 1kb apart for running the Index of association to avoid physical linkage. I have a couple of questions: 
(1) Is it WRONG to import SNP data as a genind/genclone object rather than genlight/snpclone, or just slower?
(2) Is there any way to add my population data (which individuals belong to which population) after I've created the genclone/snpclone object? It seems that having the pop data is essential to a lot of the analyses. 

Thanks for any advice you can give! 
Best,
Melanie 

Zhian Kamvar

unread,
Dec 5, 2016, 11:52:09 AM12/5/16
to Melanie Montes, poppr
Hi Melanie,

1) No it is not wrong to import your data as a genind/genclone object. The downside is that the genind/genclone will take up a lot of memory. There is a vcfR2genlight() function in vcfR, however if you are interested.

2) Population data can be added in two ways: If you have multiple population strata, you can add a data frame with individuals in rows and strata in columns using the strata() function and then select the population with the setPop() function. If you have a single population definition, you can use the pop() function to set the population. You can find an example of that for micorsatellite data (it will work for all data types) here: http://popgen.nescent.org/2015-12-15-microsatellite-differentiation.html

Hope that helps! Please feel free to ask if you have any further questions.

Best,
Zhian

P.S. For future readers: all of the bug fixes mentioned earlier on this thread were fixed and have been on CRAN for some time now.


On Dec 5, 2016, at 03:17 , Melanie Montes <melanie...@gmail.com> wrote:

Hi,
I'm using poppr for the first time on a data set of about 70 individuals x 55 000 snps, and also on a thinned data set where I only selected snps that are more than 1kb apart for running the Index of association to avoid physical linkage. I have a couple of questions: 
(1) Is it WRONG to import SNP data as a genind/genclone object rather than genlight/snpclone, or just slower?
(2) Is there any way to add my population data (which individuals belong to which population) after I've created the genclone/snpclone object? It seems that having the pop data is essential to a lot of the analyses. 

Thanks for any advice you can give! 
Best,
Melanie 

On Monday, November 2, 2015 at 11:51:59 PM UTC+1, Zhian Kamvar wrote:
Hi,



This would be a naive question to ask in this group, but it is important to me.
 
Naïve questions are quite welcome in this group, as it provides a resource for others who will likely have the same question :)


I am doing a genomewide SNP analysis on a clonal organism and I am using GATK for the same. My output file contains SNPs is in the form of VCF file and I would like to perform various analysis mentioned in the paper. In that case how should I proceed or start with?


This is a good question. One important factor we left out of the Frontiers paper was how exactly to import genome-wide data from VCF format. There are two packages that you can use to extract genotypes from a VCF file, pegas(https://github.com/emmanuelparadis/pegas) and vcfR (https://github.com/knausb/vcfR). 
Both of these packages have valuable tools for analysis and visualization of genomic data. From either of these packages, you can convert to a matrix, which you can then convert to a genlight object (see the adegenet "genomics" tutorial: https://github.com/thibautjombart/adegenet/blob/master/tutorials/tutorial-genomics.pdf). 

Before you start, I would recommend at this time to install the github versions of the packages as there are many bug fixes and improvements that aren't on CRAN as of yet. You can do this using devtools (you must also have a C compiler working on your computer, see here for links to options https://github.com/grunwaldlab/poppr/#stable-and-development-versions):

if (!require("devtools")) install.packages("devtools")
devtools
::install_github("emmanuelparadis/pegas/pegas")
devtools
::install_github("knausb/vcfR@devel")
devtools
::install_github("thibautjombart/adegenet")
devtools
::install_github("grunwaldlab/poppr@devel")


Once you have your data in a genlight object, convert it to a snpclone object with the function as.snpclone() (yes, I know there are quite a few steps to get here). This will make sure that you can have mutlilocus genotype definitions travel with your data. From here, you can:



  • Calculate raw genetic distance with bitwise.dist()
  • Define multilocus lineages with mlg.filter()
  • Calculate sliding windows of the standardized index of association with win.ia() 
  • Randomly sample loci for the standardized index of association with samp.ia()
  • Construct minimum spanning networks with poppr.msn()
  • Create boostrapped dendrograms with aboot()


Hope that helps.


Best,

Zhian


-- 
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To post to this group, send email to po...@googlegroups.com.

Melanie Montes

unread,
Dec 6, 2016, 8:20:21 AM12/6/16
to poppr, melanie...@gmail.com
Thanks for your help! I do only have a single population definition, but for some reason I am having trouble with the pop() function. Here is what I did:
1) created my genlight object and converted it to a snpclone object named gl

> gl

 ||| SNPCLONE OBJECT |||||||||


 || 68 genotypes,  55,288 binary SNPs, size: 7 Mb

 1201510 (31.96 %) missing data


 || Basic content

   @gen: list of 68 SNPbin

   @mlg: 68 original multilocus genotypes

   @ploidy: ploidy of each individual  (range: 2-2)


 || Optional content

   @ind.names:  68 individual labels

   @loc.names:  55288 locus labels

   @chromosome: factor storing chromosomes of the SNPs

   @position: integer storing positions of the SNPs

   @other: a list containing: elements without names


2)

popmap <- read.table("rad_pop.csv", header = TRUE, sep = ","

pop(gl) <- popmap


this returned the error 

Error in `pop<-`(`*tmp*`, value = list(X = c(11L, 10L, 12L, 13L, 14L,  : 

  Vector length does no match number of individuals


However, I have 68 individuals and 68 lines (excluding the header) in my popmap file. My guess is that maybe I'm formatting the .csv file incorrectly? I made a file that had the individual name in the first column, and the population it belongs to in the second. Is this sufficient or do they have to be in the same order as the vcf file somehow? 

Sorry, I know this should be very simple, feeling very ignorant at the moment! 
Thanks again for your help, and for the awesome software :) 
Melanie 

Zhian Kamvar

unread,
Dec 6, 2016, 10:23:43 AM12/6/16
to Melanie Montes, poppr
Hi Melanie,

The following should work:

pop(gl) <- popmap[[1]]

popmap is a data frame, but pop() expects a vector. 

Hope that helps!
Zhian

Melanie Montes

unread,
Dec 6, 2016, 10:49:56 AM12/6/16
to poppr, melanie...@gmail.com
Thanks, it did work, BUT I don't think it assigned the populations correctly. 
This is how my popmap file is formatted:



> head(popmap)

      X pop

1 196-2  M4

2   196  M4

3   219  M4

4   220  M4

5 220-2  M4

6   233  M4


And now when I look at the snpclone object is says:

   @pop: population of each individual (group size range: 1-1)


So it is assigning each individual to a population of 1, correct? Should I not have a column with the individual names? And how do I tell it which individual belongs to which population otherwise? Or should it be a list of "196-2=M4", "196=M4"... etc? 

I apologize again for the simple questions... quite new to R! 
Thanks, 
Melanie 

Zhian Kamvar

unread,
Dec 6, 2016, 6:19:29 PM12/6/16
to Melanie Montes, poppr
Hi Melanie,

The previous code I sent was assuming that your population column was the first column.
If your population data is in the same order as your genotype data (indNames(gl) == popmap$X), 
they you can use the second column (pop(gl) <- popmap$pop).

If your population data is not in the same order as your genetic data, you will need to resort your
population data first and then set the population.

Best,
Zhian
Reply all
Reply to author
Forward
0 new messages