combining results of different length into a data.frame

DrorD

unread,

Mar 15, 2015, 11:28:05 AM3/15/15

to israel-r-...@googlegroups.com

Hello. I'm applying a custom function on a data.frame using lapply and would like to put the result into a (new) data.frame.

The function gets a vector of integers in the range 1-5 and returns a vector of: the relative freq. of each value (in %), the relative freq. of NAs (in %), the sum of relative frequencies (should be 100), and the sum of all observations (aka N).

The function code is:

factor.dist=function(vec) {
  res=table(vec, useNA="always")
  res.pc=100*res/sum(res)
  result = c(res.pc, sum(res.pc), sum(res))
}

I have trouble solving the cases in which some of the variables in a data-set (the data.frame) have only some of the values (eg. only 2-5) while other variables has complete representation of the entire range (1-5).
This means that some applications of the function returns values with full representation of the entire range:
        1          2          3          4          5       <NA>
3.488372 17.441860 18.604651 17.441860 39.534884   3.488372 100.000000 86.000000
While other applications return a partial representaion:
         2          3          4          5       <NA>
2.325581   6.976744 23.255814 67.441860   0.000000 100.000000 86.000000
And of course these cannot be simply coerced into a data.frame (due to different length)

I'm looking for ideas how can this be solved so I can combine the two lists of result into one data.frame that would look like:
        1          2          3          4          5       <NA>
3.488372 17.441860 18.604651 17.441860 39.534884   3.488372 100.000000 86.000000
0.000000 2.325581   6.976744 23.255814 67.441860   0.000000 100.000000 86.000000

Thank you,
dror

Yoni Sidi

unread,

Mar 15, 2015, 1:59:12 PM3/15/15

to israel-r-...@googlegroups.com

i am confused by your example you have more columns 8 than column names c(seq(1,5),NA). can you give a reproducible example?

DrorD

unread,

Mar 15, 2015, 3:33:34 PM3/15/15

to israel-r-...@googlegroups.com

Thanks Yoni. It is hard for me to create a reproducible example (I tried), so here are some clearifications:
The last two columns don't have names: The result of the factor.dist function has 8 cells of data, 6 with names (1:5, NA) and two with no lables.
The non-labled cells are the two sums (sum(res.pc), sum(res)) and are not part of the problem.
The problem is with the table() output, which sometimes return a 6-cells result (full range 1:5,NA) and sometimes returns a partial result, with 5 or 4 cells (eg. 3-5,NA).

Does it make more sense?
Is there any idea for a solution?

Thanks,
dror

amit gal

unread,

Mar 15, 2015, 9:13:12 PM3/15/15

to israel-r-...@googlegroups.com

if you know in advance all the values that might go into your vector you can solve it relatively easily by passing these values

factor.dist = function(vec, values = names(table(vec))) {

#print(vec)

#print(values)

res = numeric(length(values))

names(res) = values

tb = table(vec,useNA="always")

res[names(tb)] = tb

#append any other values to res, now that you have it.

res

}

if you supply a values vector, that contain all possible values for your vec (excluding NA, which is handled anyway), then you'll be fine

--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

amit gal

unread,

Mar 15, 2015, 10:14:53 PM3/15/15

to israel-r-...@googlegroups.com

just thought of an even better idea, that does not require knowing the values in advance:

start as you did with the same, function, but do provide names to all the entries in the resulting vector, including the NA, for example:

factor.dist=function(vec) {
  res=table(vec, useNA="always")
  res.pc=100*res/sum(res)

  names(res.pc) = c(names(res)[-length(res)],"NA")

  result = c(res.pc, A= sum(res.pc), B = sum(res))

  result

# now, apply your function to all variables, suppose they are in df

library(reshape2)

long.dist = do.call(rbind,lapply(names(df), function(x)
{a = factor.dist(df[[x]]);data.frame(id=x,var=names(a),val=a)}))

my.dist = dcast(long.dist,var~id)

Yoni Sidi

unread,

Mar 16, 2015, 12:21:10 AM3/16/15

to israel-r-...@googlegroups.com

the function can be called and ordered autmoatically in plyr using mdply or ddply, it will organise each call with a unique id. and the extra variable are postprocessing, you dont need them in the loop you can calculate them after the fact.

vec.in can be a big df with results or a matrix with results...doesnt really matter.

df.out=mdply(vec.in,.fun=function(vec) as.data.frame(table(vec, useNA="always")))

df.out=ddply(vec.in,.(index),.fun=function(vec) as.data.frame(table(vec, useNA="always")))

in past experience the table is a clumsy tool to use, it doesnt fit well for further calculations, as you saw. I moved to CrossTable from the gmodels package it was much more versatile.

yoni

To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-group+unsub...@googlegroups.com.

DrorD

unread,

Mar 17, 2015, 6:28:51 AM3/17/15

to israel-r-...@googlegroups.com

Amit and Yoni, thank you very much!

Not only did you help me solve the problem, you also introduced me to new packages and functions :-)

Reply all

Reply to author

Forward