Feb 27: Data Management

Bill Evans

unread,

Feb 1, 2015, 11:36:28 PM2/1/15

to

Date: Friday, Feb 27
Time: ~~4:30pm~~ 5:15pm
Location: MIIS, ~~actual room TBD~~ McGowan Building, room MG001 (map)
Topic: Data Management

For many, “data management” consists of basic subsetting and sorting an R data.frame. However, when it comes time to do level-specific processing or merging the data with another data.frame, we often resort to bringing it back into a spreadsheet, manually band-aiding things together, and eventually bringing it back into R for the statistical analysis. Let’s discuss ways we can do all of this in R, easier and faster.

Subtopics:

Review basic data.frame subsetting and management within R;
Row-wise operations using foreach and iterators, facilitating simple parallelization in a multi-core or cluster environment;
Use of dplyr and tidyr for row-wise operations, group-wise operations, summarizing, and complex merging;
Dealing with large numbers of data files, e.g., a directory full of CSVs for aggregate or sequential analysis.

(For those who came in January to the “(Relative) (Re)Introduction to R”, this will be an expansion of the middle section, somewhere between “apply” and “multi-plot graphic layout options”.)

Fernando DePaolis

unread,

Feb 28, 2015, 12:51:52 PM2/28/15

to montere...@googlegroups.com

Bill,

Last night session was great!!! there were a couple of very neat tricks and I'd like to play with them myself, is there a chance you can post the script you used? Much appreciated

~fernando

Bill Evans

unread,

Feb 28, 2015, 2:15:22 PM2/28/15

to montere...@googlegroups.com

Thanks, Fernando! Attached are the three scripts:

dataStructs.R provides the functions used to automate the walk-through in rglstuff.R;
rglstuff.R is a walk-through, trying to provide visualization of different data structures. Though it’s not very self-documenting, most of the examples show the code that is being depicted. I have some thoughts on how to improve this, and will likely be improving this in the future on the groups Github site (project forthcoming).
notes.R is the remainder of the talk, walking through the R functions of *apply, dplyr, tidyr, etc.

Questions, comments, suggestions, and pull-requests are welcome!

-r2

notes.R

rglstuff.R

dataStructs.R

Reply all

Reply to author

Forward