poppr function memory issues

19 views
Skip to first unread message

David Tork

unread,
Feb 10, 2025, 6:44:42 PMFeb 10
to poppr
Hello,

I am having issues running my SNP dataset through the poppr() function. The dataset contains 995 individuals across 6 populations and 10736 SNPs. On my local machine (16Gb RAM) the function runs for ~10min and lists out all population names in the console before returning "Error: vector memory limit of 16.0 Gb reached, see mem.maxVSize()".

I figured this was a memory issue, so I then tried this on RStudio servers, first with 60Gb and then 500Gb RAM. The function similarly lists out all populations before aborting R: "the R session was abnormally terminated due to an unexpected crash." Each time I increased the RAM the function ran for a longer duration, but the outcome was the same. 

Could this be something other than a memory issue? One problem with my dataset is a relatively high amount of missing data (~60% call rate overall). Could this be interfering with one of the calculations in the function? 

Appreciate any assistance,
David 


Zhian Kamvar

unread,
Feb 11, 2025, 12:08:29 AMFeb 11
to David Tork, poppr
The most recent version of adegenet (released late last week) might fix this issue. I would recommend installing from there and trying again. 


Zhian

--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/poppr/dd4f9aff-62b6-4a31-84fc-9adc8b5de2bfn%40googlegroups.com.

David Tork

unread,
Feb 11, 2025, 6:42:45 PMFeb 11
to poppr
Hello Zhian,

I believe I am using the most recent version of adegenet (2.1.11) but the issue is persisting. When the final population appears in the console R studio is showing a memory usage of 70.49Gb. It's high, but nowhere near the 500Gb available, although I suspect the estimate may be inaccurate. 

I gradually removed populations to see if I could get the function to yield an output. It finally worked when I excluded my two largest populations containing hundreds of individuals each. I was able to run the large populations independently to obtain results, but when these were combined it crashed again, so it does appear to be a file size issue. Is there any downside to breaking up the analysis in this manner other than the lack of a "total" group?  If I wanted to run everything together it seems my only option would be to apply stringent filtering to reduce the overall file size.

Thanks,
David

Zhian Kamvar

unread,
Feb 11, 2025, 7:26:24 PMFeb 11
to David Tork, poppr
I do not think the missing data is necessarily a problem. The problem is the poppr function was originally designed to handle microsatellite data and it must create a matrix that is the square of the number of loci to just calculate the denominator for the index of association. 

What you might consider doing is reading your data in as genlight data (you can use the vcfR package for this if your data are in vcf format), which reduces the size of the data in memory by up to 8-fold. You can get most of the way there by using the `diversity_stats()` function and the `bitwise.ia()` function. To get the expected heterozygosity, you could roll your own by using `glSum()` to get allele counts and frequencies. 

hope that helps,
Zhian

Reply all
Reply to author
Forward
0 new messages