Monday, 5 January 2015

Using refset in your R analyses

refset is an R package that creates subsets which refer to an original dataset. Here is a short example of how to use refset to simplify your data analysis.

I have a dataset from a laboratory experiment. In total, 532 subjects participated. Each subject was in a group of 4 and made decisions in 20 rounds. There were several treatments.

I typically create two scripts: R-create-data.R makes a data frame from my raw files, including any computed variables, and saves it to a .RData file. R-analysis.R loads that file and runs any statistics or graphs. That way, I can rerun my analyses without recreating the data.

At the end of R-create-data.R, I have made a data frame called subj. A row of subj records one decision by one subject, so it contains 532 * 20 = 10640 rows. The script then create subsets of subj to represent, e.g. the data from particular treatments:
subjhist <- subj[subj$treatment1=="H",]
subjPX <- subj[subj$treatment1=="PX",]
subjn <- subj[subj$treatment1=="N",]
subjst <- subj[subj$treatment1=="S",]
save(subj, subjhist, subjPX, subjn, subjst,    file="data.RData")
In theory, the script has created all the variables I need, and now I can just run regressions using whichever subset is appropriate. In practice, I will look at the data and think of a new analysis I should do, which requires a new variable.* Hmm... did my subjects play differently in earlier than later rounds?

When that happens, I need to create a new variable in all my datasets:
subj$lateround <- subj$Period > 10
subjPX$lateround <- subjPX$Period > 10 subjhist$lateround <- subjhist$Period > 10
## et cetera...
Eventually, I will get round to including lateround in R-create-data.R, rerunning that file, and removing the code above from my analysis script. But in the meantime, doing this is a hassle, and creates repetitive, verbose, unreadable code.

Let's replace those subsets by refsets.
library(refset)
subjhist %r% subj[subj$treatment1=="H",]
subjPX %r% subj[subj$treatment1=="PX",]
subjn %r% subj[subj$treatment1=="N",]
subjst %r% subj[subj$treatment1=="S",]
Each of these variables now refers to the main data frame subj, rather than being a copy of it.

We'll put this in R-analysis.R, so that it happens after our data is loaded. Otherwise, saving and loading breaks the connection between refset and data frame, and simply creates subsets as before. The data will look just the same, and I can run my existing analyses unchanged.

Now, when I create a new variable, I just do it in the main data frame:
subj$lateround <- subj$Period > 10
The variable is then visible in all of my refsets:
head(subjPX$lateround, 20)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
## [14]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## same for subjhist, subjn...

Problem solved: I can create new variables once, not several times, and immediately use them in any analysis.

* Yes, I know... one should have prepared all my analyses ex ante and simply run them at the press of a button. Otherwise one is just doing post hoc exploration and curve fitting. I applaud the ideal and strive to live up to it, but the real world isn't always like that -- and in any case, sometimes post hoc exploration is worth doing.