Date: Friday, Feb 27
Time: 4:30pm 5:15pm
Location: MIIS, actual room TBD McGowan Building, room MG001 (map)
Topic: Data Management
For many, “data management” consists of basic subsetting and sorting an R data.frame. However, when it comes time to do level-specific processing or merging the data with another data.frame, we often resort to bringing it back into a spreadsheet, manually band-aiding things together, and eventually bringing it back into R for the statistical analysis. Let’s discuss ways we can do all of this in R, easier and faster.
Subtopics:
Review basic data.frame subsetting and management within R;
Row-wise operations using foreach and iterators, facilitating simple parallelization in a multi-core or cluster environment;
Use of dplyr and tidyr for row-wise operations, group-wise operations, summarizing, and complex merging;
Dealing with large numbers of data files, e.g., a directory full of CSVs for aggregate or sequential analysis.
(For those who came in January to the “(Relative) (Re)Introduction to R”, this will be an expansion of the middle section, somewhere between “apply” and “multi-plot graphic layout options”.)
Thanks, Fernando! Attached are the three scripts:
dataStructs.R provides the functions used to automate the walk-through in rglstuff.R;rglstuff.R is a walk-through, trying to provide visualization of different data structures. Though it’s not very self-documenting, most of the examples show the code that is being depicted. I have some thoughts on how to improve this, and will likely be improving this in the future on the groups Github site (project forthcoming).notes.R is the remainder of the talk, walking through the R functions of *apply, dplyr, tidyr, etc.Questions, comments, suggestions, and pull-requests are welcome!
-r2