summary subset

22 views
Skip to first unread message

matias.ledesmagelos

unread,
Oct 16, 2017, 2:03:36 PM10/16/17
to manipulatr

Hi everyone,


I’m using the package plyr and dplyr and I have problems with attaching appropriated labels to the output and obtain the sum of each variable for each subset.

As example I’m going to use the dataset baseball in packages (dplyr)....


The objective is to obtain the sum rbi (runs batted in) per year and team, is not the case here but in my data set there are few observations for some subset so I apply a sample with reposition to each subset of data.


Library (plyr); library(dplyr)


samp_size<-50       

iter <-10

 

funct_df<-function(df){matrix(sample(1:nrow(df), samp_size*iter, replace=T), ncol=samp_size, byrow=T)}


model<-dlply(data,.(year,team),funct_df) 

str(model) = 2527 list [1:10,1:50]


I need to transpose and then calculate the sum of each variable but I don’t know how.

I have tried with ldply as in the example in the split apply combine... but without exit.

 

Could someone help me?

Cheers,

 Matias

Brandon Hurr

unread,
Oct 16, 2017, 2:33:46 PM10/16/17
to matias.ledesmagelos, manipulatr
Matias, 

As far as I can tell, there is no baseball dataset in dplyr so I can't replicate what you're trying to do. 

There is one in Lahman so I did this... 

library(Lahman)
library(tidyverse)

glimpse(Batting)

Batting %>% group_by(yearID, teamID) %>% summarize(sumRBI = sum(RBI, na.rm = TRUE))

Why do you need the sampling with replacement? 

Could your rework your example with the Lahman::Batting dataset or give us a better representative dataset?

Thanks, 
Brandon

--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+unsubscribe@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at https://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.

David Winsemius

unread,
Oct 16, 2017, 5:47:22 PM10/16/17
to Brandon Hurr, matias.ledesmagelos, manipulatr

> On Oct 16, 2017, at 11:33 AM, Brandon Hurr <brando...@gmail.com> wrote:
>
> Matias,
>
> As far as I can tell, there is no baseball dataset in dplyr so I can't replicate what you're trying to do.
>
> There is one in Lahman so I did this...
>
> library(Lahman)
> library(tidyverse)
>
> glimpse(Batting)
>
> Batting %>% group_by(yearID, teamID) %>% summarize(sumRBI = sum(RBI, na.rm = TRUE))
>
> Why do you need the sampling with replacement?
>
> Could your rework your example with the Lahman::Batting dataset or give us a better representative dataset?

I suspect the original poster was using the baseball dataset in pkg plyr. He loaded it as well.

http://finzi.psych.upenn.edu/R/library/plyr/html/baseball.html

--
David


>
> Thanks,
> Brandon
>
> On Mon, Oct 16, 2017 at 11:03 AM, matias.ledesmagelos <matias.le...@gmail.com> wrote:
> Hi everyone,
>
>
>
> I’m using the package plyr and dplyr and I have problems with attaching appropriated labels to the output and obtain the sum of each variable for each subset.
>
> As example I’m going to use the dataset baseball in packages (dplyr)....
>
>
>
> The objective is to obtain the sum rbi (runs batted in) per year and team, is not the case here but in my data set there are few observations for some subset so I apply a sample with reposition to each subset of data.
>
>
>
> Library (plyr); library(dplyr)
>
>
>
> samp_size<-50
> iter <-10
>
> funct_df<-function(df){matrix(sample(1:nrow(df), samp_size*iter, replace=T), ncol=samp_size, byrow=T)}
>
>
>
> model<-dlply(data,.(year,team),funct_df)
> str(model) = 2527 list [1:10,1:50]
>
> I need to transpose and then calculate the sum of each variable but I don’t know how.
> I have tried with ldply as in the example in the split apply combine... but without exit.
>
> Could someone help me?
> Cheers,
> Matias
>
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.

David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law





Matias Ledesma

unread,
Oct 16, 2017, 9:48:42 PM10/16/17
to David Winsemius, Brandon Hurr, manipulatr

Hi Brandon,


Yes, as David wrote the data set is in pkg plyr.


The reason why I have to sampling with replacement and made a matrix is because I need a representative mean for the population and 50 is the number of crustacean that optimal should be sampled in the monitoring program per station each year, but many times I have less. 

My dataset have 8 variables which I want to summary by two variables (years and stations) and some variables have great amount of zeros so by sample with replacement I get a good representation.  


I know how to do it if the results is just one list but not for many


Library (plyr); library(dplyr)

 

samp_size<-50       

iter <-10

 

funct_df<-function(df){matrix(sample(1:nrow(df), samp_size*iter, replace=T), ncol=samp_size, byrow=T)}

 

model<-dlply(data,.(year,team),funct_df) 

 

str(model) = 2527 list [1:10,1:50]

 

y<-t(apply(model, 1, function(i) colSums(x[i,])))

y[1:8,] * if model have only 1 list


Matias


> To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+unsubscribe@googlegroups.com.

> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "manipulatr" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+unsubscribe@googlegroups.com.

> To post to this group, send email to manip...@googlegroups.com.
> Visit this group at https://groups.google.com/group/manipulatr.
> For more options, visit https://groups.google.com/d/optout.

Brandon Hurr

unread,
Oct 16, 2017, 11:27:43 PM10/16/17
to Matias Ledesma, David Winsemius, manipulatr
Matias, 

I'm not sure I understand fully what you're trying to do so I'll describe what I think you're doing and you can correct me. I only play a statistician on the internet, not in real life.

The baseball data is similar enough to your crustacean data. 
You have 8 variables that have been measured over time (Year) and location (Team). You need to do resampling because sometimes you don't have enough data (I have no idea if this is ok, but it seems fishy.). 

You want 50 rows of data derived from your original dataset for each Year * Team combination, but I think you want to do this 10 times to get an idea of the reliability of those estimates. Sounds like you're bootstrapping. 

Best I can find is this, but it's only one of your groups at a time. 

I think you could use this code though. 

I don't have a ton of time right now, but I think I would do something like:
library(plyr)
library(tidyverse)

iterations <- as.list(1:10)

sample_iter <- function(x) {
map(iterations, .f = function(q) sample_n(tbl = x, size = 50, replace = TRUE))
} #notice we're iterating over iterations, but not using it in the map call

baseball %>% 
group_by(year, team) %>%
nest() %>%
slice(1:10) %>% # this is so it doesn't do the whole dataset
mutate(samples = map(data, .f = function(x) sample_iter(x))) 
# here we iterate over the nested data and sample it with the function we made



Now you have a nested list column called samples that has 10 independently sampled datasets. 

This tutorial on bootstrapping might be useful. You should probably go ahead and do whatever modeling in the sample_iter() function (rename it according) and then summarize the modeling. 

That's my best guess and attempt right now. 

Brandon

Brandon Hurr

unread,
Oct 16, 2017, 11:47:20 PM10/16/17
to Matias Ledesma, David Winsemius, manipulatr
Sorry, the link to the bootstrapping example didn't make it. https://rstudio-pubs-static.s3.amazonaws.com/173375_d29231e26def4ec1ad9e504755a2995d.html
Reply all
Reply to author
Forward
0 new messages