rbind.genlight if objects have different numbers of SNPs

391 views
Skip to first unread message

Kate Quigley

unread,
Feb 25, 2021, 5:34:37 PM2/25/21
to da...@googlegroups.com
Hi DART-ers,
Does anyone know of a function within DARTr or Adegenet for binding multiple genlight objects together that have a different number of SNPs and individuals? 
For example, the following work well if the number of SNPs is the same:
"cbind"
"rbind" 
"cbind.genlight"
"rbind.genlight"  
However, if these vary, the function will not work.

One workaround I have found is to convert these to matrices, do cbind, then use df2genind, but I lose genlight object metadata that way. 

Any hints?

cheers,
Kate

Arthur Georges

unread,
Feb 25, 2021, 5:46:24 PM2/25/21
to da...@googlegroups.com
I guess add missing data (-) for individuals and loci not present in all genlight objects to bring them up to the same dimension, then use gl.join.

Not sure how to add the missing data, would have to fiddle. There would be a number of ways of doing that depending on your skills in R and adegenet.

Arthur


--
You received this message because you are subscribed to the Google Groups "dartR" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dartr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dartr/CAD%2Bk70nV8xs6FOKAO-Xqu6NZ1GOK4hvXmU%3DTXgm14ouJgTC23A%40mail.gmail.com.

Bernd.Gruber

unread,
Feb 25, 2021, 6:08:00 PM2/25/21
to da...@googlegroups.com

Hi Kate,

 

Not sure. What would you like to fill in for the minssing loci (NAs I assume) and this most likely does then introduce a lot of NAs

 

My idea would be to reduce the genlights object to have the same number of loci and then combine those two, but maybe I am just missing the ideas why joining to genlight object with different number of loci is beneficial.

 

To subset you can simply use the gl.keep.loc function.

 

Cheers,Bernd

 

 

 

==============================================================================

Dr Bernd Gruber                                              )/_         

                                                         _.--..---"-,--c_    

Professor Ecological Modelling                      \|..'           ._O__)_     

Tel: (02) 6206 3804                         ,=.    _.+   _ \..--( /          

Fax: (02) 6201 2328                           \\.-''_.-' \ (     \_          

Institute for Applied Ecology                  `'''       `\__   /\          

Faculty of Science and Technology                          ')                

University of Canberra   ACT 2601 AUSTRALIA

Email: bernd....@canberra.edu.au

WWW: bernd-gruber

 

Australian Government Higher Education Provider Number CRICOS #00212K 

NOTICE & DISCLAIMER: This email and any files transmitted with it may contain
confidential or copyright material and are for the attention of the addressee
only. If you have received this email in error please notify us by email
reply and delete it from your system. The University of Canberra accepts
no liability for any damage caused by any virus transmitted by this email.

==============================================================================

--

Kate Quigley

unread,
Mar 17, 2021, 2:37:16 AM3/17/21
to da...@googlegroups.com
Hi Bernd and Arthur, 

Thank you very much for the quick responses. It took me a bit to get back to this, but I wanted to put the solution here (based on your advice), in case others would like to know.

#gl.test1 67 SNPs
#gl.test2 831 SNPs
#gl.test3 76 SNPs

gl.test2_reduced<-gl.keep.loc(gl.test2, first=1, last=67) #reduce gl object to minimum number of SNPs 
gl.test3_reduced<-gl.keep.loc(gl.test3, first=1, last=67)

postStrictclean<-rbind.genlight(gl.test1, gl.test2_reduced)
#Daisy chain third option
postStrictclean2<-rbind.genlight(postStrictclean, gl.test3_reduced)

I ended up using rbind.genlight instead of gl.join because each genlight object has different sample names, leading to the following error:
#Fatal Error: the two genlight objects do not have data for the same individuals in the #same order

cheers,
Kate





--
Kate Quigley
Cuerpo de Paz-Peru
Voluntaria de Medio Ambiente
995974412
RPM:*377992

Michael Sandel

unread,
Jun 30, 2021, 7:37:36 AM6/30/21
to dartR
Thank you to everyone for providing this solution. 

I was able to combine my ...SNP_3 and ...SNP_4 files using cbind.genlight(). I wanted to check the dartr expectations using gl.compliance.check(), however, I received the following output:

Starting gl.compliance.check 
  Processing a SNP dataset
  Checking coding of SNPs
    SNP data scored NA, 0, 1 or 2 confirmed
  Checking locus metrics and flags
  Recalculating locus metrics
Error: cannot allocate vector of size 251.4 Mb

It appears that the function did not run properly. Is this a limitation inherent to the gl.compliance.check(), or should I be concerned that my combined gl is somehow misformatted?

Thank you,
Michael Sandel 

Bernd.Gruber

unread,
Jun 30, 2021, 7:51:19 AM6/30/21
to da...@googlegroups.com

Hi Michael,

 

This is a memory issue.

 

Maybe first do a garbage collection

 

gc()  #this free unused memory in case you have done a lot of calculations before.

 

Then check your R version and if you run on 64 Bit.

 

 

> R.version

               _                          

platform       x86_64-w64-mingw32         

arch           x86_64     

 

 

Then you could try to increase your memory limit (first save everything as this might stall your system).

 

 

memory.size() ### Checking your memory size

memory.limit() ## Checking the set limit

memory.limit(size=16000) ### expanding your memory _ here it needs to be higher than you current memory (depends how many GB you have, I have 16 GB memory so 16000 is close to max here).

Limit is in MB units.

 

Hope that helps,

 

Cheers, Bernd

 

Ps out of interest how big is your data set (individuals and loci)

Michael Sandel

unread,
Jun 30, 2021, 10:10:23 AM6/30/21
to dartR
Thank you Bernd!
I should have recognized the simple memory issue... but now I have new questions.

I want to combine two gl files returned from DArT, each with about 44k loci after gl.filter.rdepth(). When checking locNames(), I find only two loci are shared between the two files, so I have removed the two duplicate reads prior to combining the files. There are 730 individuals in each file, in identical order. When using rbind.genlight(), I received the familiar error about having unequal number of SNPs in the two files, but when I used cbind.genlight() it seems to work. 

Now I am having some trouble getting downstream functions to work on the combined gl. I have tried gl.filter.callrate() and gl2fasta() with no success. These commands work fine on the two source gls, but not on the combined gl.  This seems to have to do with missing/corrupted metadata, suggested by others in the forum. How do I re-establish the metadata after combining the two gls?

Cheers,
Mike

Michael Sandel

unread,
Jun 30, 2021, 10:33:18 AM6/30/21
to dartR
My apologies, Bernd. It appears that I missed one of the other solutions provided earlier in the thread:

gl.join() is working on the combined genlight object, with no metadata problems. 

Thank you for your patience.
M

Renee Catullo

unread,
Jun 30, 2021, 8:32:17 PM6/30/21
to da...@googlegroups.com
Hey everyone,

I think it’s worth taking a step back from this question and asking not can you, but should you? There are big theoretical issues with this action and it can seriously disrupt the suitability of the dataset for answering questions. For example, let’s say your first dataset had two kind-of divergent groups. When you call SNPs across them you get a Wahlund effect - basically heterozygosity is comparable and relative between them but not externally as it is reduced. Then you add samples called on just one group. It will likely have artificially higher heterozygosity than the same group from the original data. Your snp calling is a hypothesis you are testing. You can’t just combine data without understanding how that affects the distribution of allele frequencies. What if a snp was missing from the first dataset because it was tri-allelic, but included in the second because it was biallelic? How would that impact your analysis?

For a paper I am working on I got 4 sets of snps called - target species plus out group for a phylogeny, two target species together for admixture tests, and each species separately for within species analyses. Yes- it makes a massive difference. I also had it called in everyone and all the stats were very different.

There are basically an infinite number of artificial outcomes added to your dataset if you do this. Combining filtered datasets is probably worse. I would reject any paper where this was done without justification that it doesn’t impact the outcomes, so you should fully disclose.

Note this likely doesn’t apply where you are doing it with variant sites and not snps. 

The proper thing to do is have snps called again across the individuals of interest for your specific analysis.

Cheers,

Renee

On 30 Jun 2021, at 10:33 pm, Michael Sandel <evo...@gmail.com> wrote:

My apologies, Bernd. It appears that I missed one of the other solutions provided earlier in the thread:

Michael Sandel

unread,
Jun 30, 2021, 9:33:54 PM6/30/21
to da...@googlegroups.com
Thanks Renee,

I share your concerns. I recently confirmed that my case is a bit different, because the two report files were provided from the same run, in the same format, for the same taxon set. In my case, the files were split simply due to the size of the report/data set. Since each of the concerns you (rightly) raised do not apply, I am able to move forward with the gl.join()... after a little assistance from the forum.

Cheers,
Mike

Renee Catullo

unread,
Jun 30, 2021, 10:19:38 PM6/30/21
to da...@googlegroups.com
HI Mike,

I knew when I wrote that there would be genuine cases where it was acceptable. Yours is clearly one of them!

Cheers,

Renee

Bernd.Gruber

unread,
Jun 30, 2021, 10:55:00 PM6/30/21
to da...@googlegroups.com, Andrzej Kilian

Hi,

 

Good discussion here and enjoy reading it as it demonstrates the user needs and difficulties.

 

Not sure if it helps at the moment, but we are working together with dart on a dartR function that lets you directly access the report from their server and convert into genlight into R.

 

This would avoid the need to join files and also duplicates in names etc.

 

Cheers, Bernd

 

 

From: da...@googlegroups.com <da...@googlegroups.com> On Behalf Of Renee Catullo
Sent: Thursday, 1 July 2021 12:19 PM
To: da...@googlegroups.com
Subject: Re: [dartR] rbind.genlight if objects have different numbers of SNPs

 

HI Mike,

Reply all
Reply to author
Forward
0 new messages