combining results of different length into a data.frame

19 views
Skip to first unread message

DrorD

unread,
Mar 15, 2015, 11:28:05 AM3/15/15
to israel-r-...@googlegroups.com
Hello. I'm applying a custom function on a data.frame using lapply and would like to put the result into a (new) data.frame.

The function gets a vector of integers in the range 1-5 and  returns a vector of: the relative freq. of each value (in %), the relative freq. of NAs (in %), the sum of relative frequencies (should be 100), and the sum of all observations (aka N).

The function code is:

factor.dist=function(vec) {
  res
=table(vec, useNA="always")
  res
.pc=100*res/sum(res)
  result
= c(res.pc, sum(res.pc), sum(res))
}


I have trouble solving the cases in which some of the variables in a data-set (the data.frame) have only some of the values (eg. only 2-5) while other variables has complete representation of the entire range (1-5).
This means that some applications of the function returns values with full representation of the entire range:
        1          2          3          4          5       <NA>                      
  3.488372  17.441860  18.604651  17.441860  39.534884   3.488372 100.000000  86.000000
While other applications return a partial representaion:
         2          3          4          5       <NA>                      
  2.325581   6.976744  23.255814  67.441860   0.000000 100.000000  86.000000
And of course these cannot be simply coerced into a data.frame (due to different length)

I'm looking for ideas how can this be solved so I can combine the two lists of result into one data.frame that would look like:
        1          2          3          4          5       <NA>                      
  3.488372  17.441860  18.604651  17.441860  39.534884   3.488372 100.000000  86.000000
  0.000000  2.325581   6.976744  23.255814  67.441860   0.000000 100.000000  86.000000

Thank you,
dror

Yoni Sidi

unread,
Mar 15, 2015, 1:59:12 PM3/15/15
to israel-r-...@googlegroups.com
i am confused by your example you have more columns 8 than column names c(seq(1,5),NA). can you give a reproducible example?

DrorD

unread,
Mar 15, 2015, 3:33:34 PM3/15/15
to israel-r-...@googlegroups.com
Thanks Yoni. It is hard for me to create a reproducible example (I tried), so here are some clearifications:
The last two columns don't have names: The result of the factor.dist function has 8 cells of data, 6 with names (1:5, NA) and two with no lables.
The non-labled cells are the two sums (sum(res.pc), sum(res)) and are not part of the problem.
The problem is with the table() output, which sometimes return a 6-cells result (full range 1:5,NA) and sometimes returns a partial result, with 5 or 4 cells (eg. 3-5,NA).

Does it make more sense?
Is there any idea for a solution?

Thanks,
dror

amit gal

unread,
Mar 15, 2015, 9:13:12 PM3/15/15
to israel-r-...@googlegroups.com
if you know in advance all the values that might go into your vector you can solve it relatively easily by passing these values

factor.dist  = function(vec, values = names(table(vec))) {
  #print(vec)
  #print(values)
  res = numeric(length(values))
  names(res) = values
  tb = table(vec,useNA="always")
  res[names(tb)] = tb
  #append any other values to res, now that you have it.
  res
}

if you supply a values vector, that contain all possible values for your vec (excluding NA, which is handled anyway), then you'll be fine



--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

amit gal

unread,
Mar 15, 2015, 10:14:53 PM3/15/15
to israel-r-...@googlegroups.com
just thought of an even better idea, that does not require knowing the values in advance:
start as you did with the same, function, but do provide names to all the entries in the resulting vector, including the NA, for example:


factor.dist=function(vec) {
  res
=table(vec, useNA="always")
  res
.pc=100*res/sum(res)
  names(res.pc) = c(names(res)[-length(res)],"NA")
  result = c(res.pc, A= sum(res.pc), B = sum(res))
  result
}

# now, apply your function to all variables, suppose they are in df

library(reshape2)
long.dist = do.call(rbind,lapply(names(df), function(x)
               {a = factor.dist(df[[x]]);data.frame(id=x,var=names(a),val=a)}))
my.dist = dcast(long.dist,var~id)

Yoni Sidi

unread,
Mar 16, 2015, 12:21:10 AM3/16/15
to israel-r-...@googlegroups.com
the function can be called and ordered autmoatically in plyr using mdply or ddply, it will organise each call with a unique id. and the extra variable are postprocessing, you dont need them in the loop you can calculate them after the fact.

vec.in can be a big df with results or a matrix with results...doesnt really matter.

df.out=mdply(vec.in,.fun=function(vec) as.data.frame(table(vec, useNA="always")))
df.out=ddply(vec.in,.(index),.fun=function(vec) as.data.frame(table(vec, useNA="always")))

in past experience the table is a clumsy tool to use, it doesnt fit well for further calculations, as you saw. I moved to CrossTable from the gmodels package it was much more versatile.
yoni
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-group+unsub...@googlegroups.com.

DrorD

unread,
Mar 17, 2015, 6:28:51 AM3/17/15
to israel-r-...@googlegroups.com
Amit and Yoni, thank you very much!

Not only did you help me solve the problem, you also introduced me to new packages and functions :-)
Reply all
Reply to author
Forward
0 new messages