gl.pcoa crashes with large dataset - parallel processing not working?

15 views
Skip to first unread message

Gabriella Scatà

unread,
Nov 29, 2022, 7:19:14 PM11/29/22
to dartR
Hi everyone,
I am having some issues running the function gl.pcoa() on my current dataset.
Normally it works beautifully, but my current dataset is quite large (40000 loci), so R starts the computation but then just freezes...I waited for hours then I realised it was probably stuck.

I have tried also with parallel processing with n.cores = 16 (as given by default) ("parallel" = TRUE), but it seems to produce the same issue. In addition, in the gl.pcoa() function description it is stated that this option (parallel = TRUE) fails in Windows...is it still the case?

Is there any other way around this problem?

I would really appreciate some advice.
Thanks a lot!
Best,
Gabriella

Michael Sandel

unread,
Nov 29, 2022, 7:36:04 PM11/29/22
to da...@googlegroups.com
Hi Gabriella, 

In my experience, there is no real benefit to running these analyses with large data sets. I've pushed the loci limit on multiple datasets, and there are no meaningful changes above 2-3,000 snps. You *can* run it with 40k, but it is highly unlikely to reveal any info different than 4k.

Cheers,
Mike Sandel



--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/bcefb327-85e9-4880-ba5c-24f28154b55en%40googlegroups.com.

Bernd.Gruber

unread,
Nov 29, 2022, 9:04:59 PM11/29/22
to da...@googlegroups.com

Hi Gabriella,

 

My quick advice would be to run it on a Linux machine (if you have one available) [R does not have memory limit on a linux system].

You could also try to increase the memory limit on your windows machine via memory.limit (though this function is no longer available under R.4.2)

 

So depending on the R version you are using (and assuming you are using the 64 bit version) you can either upgrade to 4.2 (then everything is done automatically) or use memory.limit for all Rs <4.2

 

 

In regards to your data set

40000 loci is quite a lot, but should not be a too big an issue. Maybe you can filter a bit “harder” e.g. on callrate to have less missing data and therefore less loci.

 

How many individuals to do you have in your data set?

 

Regards, Bernd

 

 

==============================================================================

Dr Bernd Gruber                                              )/_         

                                                         _.--..---"-,--c_    

Professor Ecological Modelling                      \|..'           ._O__)_     

Tel: (02) 6206 3804                         ,=.    _.+   _ \..--( /          

Fax: (02) 6201 2328                           \\.-''_.-' \ (     \_          

Institute for Applied Ecology                  `'''       `\__   /\          

Faculty of Science and Technology                          ')                

University of Canberra   ACT 2601 AUSTRALIA

Email: bernd....@canberra.edu.au

WWW: bernd-gruber

 

Australian Government Higher Education Provider Number CRICOS #00212K 

NOTICE & DISCLAIMER: This email and any files transmitted with it may contain
confidential or copyright material and are for the attention of the addressee
only. If you have received this email in error please notify us by email
reply and delete it from your system. The University of Canberra accepts
no liability for any damage caused by any virus transmitted by this email.

==============================================================================

Gabriella Scatà

unread,
Nov 30, 2022, 12:20:57 AM11/30/22
to dartR
Hi Bernd,
thank you for your feedback.

This is just a preliminary analysis with multiple species together so it's a bit hard to apply more stringent thresholds at the moment to filter out more loci...as I may actually miss species-specific information. I have over 500 individuals.

I managed to obtain a pca after R crashed 2 times...so for now it's ok but i may need to run additional analysis with slightly larger datasets so I might try the approach you suggested. However, I'm already using R version 4.2.1...so not sure I can do more then.
Maybe it's possible to assign more memory to R externally from R...

Worst case, I will try to cut out more loci as suggested...hoping i'm not losing anything major.
Linux may be a good alternative in that case. Thank you for the suggestion!
Thank you Mike as well for your comment.

Best,
Gabriella

Reply all
Reply to author
Forward
0 new messages