remove of loci

498 views
Skip to first unread message

Rita Castilho

unread,
Aug 14, 2019, 8:30:14 AM8/14/19
to poppr
Hi,
I have both genind and genlight objects for a dataset of over 5000 loci. I would like to remove loci from which there is no genotypes for a population, for instance loci 3-4 and 7 in the data below.
Any idea how to do it?
Thanks,
Rita


data.png


Natalia Bayona

unread,
Aug 14, 2019, 11:17:01 AM8/14/19
to Rita Castilho, poppr
You can remove loci according to missing data levels using the function:

For example, removing loci with missing values greater than 90%:

miss <- obj %>% missingno("loci", cutoff = 0.9) %>% info_table(plot = TRUE) 

Best,

--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/b0cad310-9fcf-425d-804c-5b9fe80443ab%40googlegroups.com.


--

Natalia J. Bayona Vásquez 
Environ. Health Science | Inst. of Bioinformatics 
Postdoctoral Research Associate

150 E Green St 
Athens, GA 30602

e: njba...@gmail.com 
w: https://njbayona7.wixsite.com/natalia-bayona

University of Georgia

Natalia Bayona

unread,
Aug 14, 2019, 11:18:27 AM8/14/19
to Rita Castilho, poppr
Sorry, the function is to view the loci that will be removed. 
Here the function to create a new genind object removing these loci:

newobj <- missingno(obj, type = "loci", cutoff = 0.9, quiet = FALSE, freq = FALSE)

Rita Castilho

unread,
Aug 14, 2019, 11:21:06 AM8/14/19
to poppr
Hi Natalia,
Many thanks for your reply, but that is not what I want to do. I want to eliminate loci for which, for instance, one population has no data. The function missingno allows an overall across population cut-off value. Correct me if I am wrong, please.
Thanks.
Rita



On Wednesday, 14 August 2019 16:17:01 UTC+1, Natalia Bayona wrote:
You can remove loci according to missing data levels using the function:

For example, removing loci with missing values greater than 90%:

miss <- obj %>% missingno("loci", cutoff = 0.9) %>% info_table(plot = TRUE) 

Best,

El mié., 14 ago. 2019 a las 8:30, Rita Castilho (<rita....@gmail.com>) escribió:
Hi,
I have both genind and genlight objects for a dataset of over 5000 loci. I would like to remove loci from which there is no genotypes for a population, for instance loci 3-4 and 7 in the data below.
Any idea how to do it?
Thanks,
Rita


data.png


--
You received this message because you are subscribed to the Google Groups "poppr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to po...@googlegroups.com.

Natalia Bayona

unread,
Aug 14, 2019, 11:38:52 AM8/14/19
to Rita Castilho, poppr
Rita,
You are right. However, my understanding is that because a genind object is created from a matrix, you couldn't have more columns (loci) for certain rows (individuals) than other rows. So you have X number of loci across all populations, if some of these are not genotyped for some particular loci, these are treated as missing data in further analyses. 
Now if you want to calculate population specific parameters and you really want to omit the missing data for that population. I guess you can try to create a new object for just that population, then filter according to missing data and then estimate the population-specific parameters. Hope that makes sense, here is what you could do:

obj$pop #visualizing population names
pop_obj <- popsub(obj, sublist=c("POPNAME-FROM-INTEREST"))
pop_obj <- missingno(pop_obj, type = "loci", cutoff = 0.9, quiet = FALSE, freq = FALSE)

So, you will have a pop_obj object that has data only for the desired population where loci with high missing data greater than 90% have been removed

I hope this was helpful.

Best,

To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/ab201b2f-7caf-4d16-ad1e-f57c25f377f3%40googlegroups.com.

Νίκος Τουρβάς

unread,
Aug 14, 2019, 11:54:02 AM8/14/19
to po...@googlegroups.com
If all you want is to remove loci by specifying their names, it is possible to do. For example in order to remove loci 3,4, and 7 the following would work:

all_loci <- locNames(obj) #vector of all loci
removeloc
<- c("3", "4", "7") #loci to remove
keeploc
<- setdiff(all_loci, removeloc) #loci to keep
obj_new
<- obj[loc = keeploc]

This would allow you to remove specific loci without applying an across-population cut-off value. However I am not sure how this will be more helpful than simply applying a cut-off value such as 90% as Natalia suggested. Any loci that are retained using the procedure I suggest, but are removed using a 90% cut-off filter are unlikely to contain any useful information.

Hope this helps.

Nikolaos Tourvas, BSc
MSc Student in Genetics and Plant Breeding
Laboratory of Forest Genetics and Tree Breeding Aristotle University of Thessaloniki https://orcid.org/0000-0002-0476-4468

Rita Castilho

unread,
Aug 14, 2019, 11:57:51 AM8/14/19
to poppr
Dear Νίκος,
I have over 5000 loci, it is not practical to name them all like you suggest. 
Again, I would like to be able to go pop by pop, and remove every loci for which a particular population only has missing data.

Thanks for your reply, I will keep your suggestion in mind.
Best,
Rita

On Wednesday, 14 August 2019 16:54:02 UTC+1, Νίκος Τουρβάς wrote:
If all you want is to remove loci by specifying their names, it is possible to do. For example in order to remove loci 3,4, and 7 the following would work:

all_loci <- locNames(obj) #vector of all loci

removeloc
<- c("3", "4", "7)" #loci to remove

Rita Castilho

unread,
Aug 14, 2019, 1:17:57 PM8/14/19
to poppr
Natalia,
Again thank you for spending time with my query. I completely understand what you say, and it is a quite helpful solution. Is there a way to merge the final pop_obj in a genind object?

If you can bear with me just one step further, how would I remove loci from the whole dataset, the ones that were not genotyped for one or more populations?

1. How to identify the 000000 (3 coded alleles) is in all individuals of one population?
2. List several loci that satisfy that condition.
3. Remove from genind.

Best,
Rita

Natalia Bayona

unread,
Aug 14, 2019, 1:40:13 PM8/14/19
to Rita Castilho, poppr
Hi Rita,

Is there a way to merge the final pop_obj in a genind object?
I think you may want to give a look to this function 

How to identify the 000000 (3 coded alleles) is in all individuals of one population?

If you set the cutoff to 99.9% for each of your populations you will remove loci that are missing in all individuals.

2. List several loci that satisfy that condition.

When you run the function, the console will write the name of each locus that satisfy that condition (missing data greater than 99.9%)

pop_obj %>% missingno("loci", cutoff = 0.999) %>% info_table(plot = FALSE)

3. Remove from genind.

As, explained before, the function below is creating a new genind object (which I named the same, but below I will name it differently) that has removed the loci with the greater levels of missing data .
In detail:
#Below I am creating a new genind object that has only the desired population from my starting genind object
pop_obj <- popsub(obj, sublist=c("POPNAME-FROM-INTEREST"))
#Below I am removing from the genind object with one population those loci that have missing data in 99.9% percent of individuals
pop_obj_fil <- missingno(pop_obj, type = "loci", cutoff = 0.999, quiet = FALSE, freq = FALSE)

So you have obj: this is a genind object with all your populations and loci
You have pop_obj: this is a genind object with only genotypes from one population
You have pop_obj_fil: this is a genind object with genotyopes from one population and removing loci that have missing data greater than 99.9%

Best,

To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/9d48336d-5861-4897-b6a4-0b36463cd0ec%40googlegroups.com.

Rita Castilho

unread,
Aug 14, 2019, 1:42:36 PM8/14/19
to poppr
Natalia,
That's it. I think I got it. Thanks ever so much for your dedicated help!
Best,
Rita

Zhian Kamvar

unread,
Aug 15, 2019, 4:29:28 AM8/15/19
to Rita Castilho, poppr
Thank you all for your quick responses! It makes me genuinely happy to see people sharing their expertise on this forum ^_^

For what it's worth, It's possible to do this with only info_table.

suppressPackageStartupMessages(library("poppr"))
data(nancycats)

# The nancycats data set has 9 loci, but population 17 is missing data at locus fca45
nLoc(nancycats)
#> [1] 9

# Get the proportion of missing data per population (rows) per locus (columns)
miss_tab <- info_table(nancycats, plot = FALSE)[, locNames(nancycats)]

# check how many populations are missing loci entirely.
colSums(miss_tab == 1)
#>  fca8 fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37
#>     0     0     0     1     0     0     0     0     0

# subset the data to include only loci that are represented in all populations
nancycats_sub <- nancycats[loc = colSums(miss_tab == 1) == 0]

# now we have 8 loci
nLoc(nancycats_sub)
#> [1] 8


Hope that helps and thank you all again!

Best,
Zhian

To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/fcee2747-4843-4073-b7fd-503913739b2a%40googlegroups.com.

Rita Castilho

unread,
Aug 15, 2019, 8:28:19 AM8/15/19
to poppr
Dear Zhian,

That is exactly what I was looking for! However in case there are more than one pop with missing data, would this be the right way of doing it?

nancycats_sub <- nancycats[loc = colSums(miss_tab >= 1) == 0]

Thanks,
Rita

Zhian Kamvar

unread,
Aug 15, 2019, 9:06:46 AM8/15/19
to Rita Castilho, poppr
Hi Rita,

That will work, but it's not necessary. Since each cell in the table represents the fraction of missing data for a single population at a single locus, the maximum value of any one cell is 1, so miss_tab == 1 explicitly flags populations with 100% missing data.

To unsubscribe from this group and stop receiving emails from it, send an email to poppr+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/poppr/b41e9b45-083d-42e4-b0fa-7425c5bca1a6%40googlegroups.com.

Rita Castilho

unread,
Aug 15, 2019, 9:59:49 AM8/15/19
to poppr
Hi Zhian,
Sorry I did not realize that!
Best,
Rita
Reply all
Reply to author
Forward
0 new messages