Monday, 12 June 2006

R tips

I've been working with large datasets in R and thought I would share some tips. Of interest to statisticians only!

1. On Windows, get tinn-R. You don't want to be working with the basic R editor. On Linux, use emacs or vi as you prefer - emacs is supposedly pretty good.

2. Record everything you do - don't rely on saving the history for this, as it will be indecipherably messy. The ideal is that anyone should be able to replicate your results, from publicly available data, just by running your file of commands. It also helps a lot when you lose data and have to go back and redo it.

3. You will probably end up with a lot of temporary variables. To know which variables you can safely delete, give temporary variables a dot at the end of their name, like:

for (yr. in dataset$years)

4. Save shortened versions of often-used commands in your Rprofile file (on Windows, this will be in c:\program files\R\R-\etc\

For example, I like to type hs instead of If I just put

hs <-

in, I will get an error on startup. This is because is in the utils package which is not loaded until after the Rprofile has been executed. So you have to be a little sneaky:

setHook(packageEvent("utils", "onLoad"), function (...) {
hs <<-

This means that when utils gets loaded, the function gets assigned to hs. The double-headed arrow, by the way, is for global assignment. Otherwise, the assignment would only happen in the body of our function, which would be useless.

5. You can create new operators. For example, I like to be able to type
1:5 %-% 3
and get c(1, 2, 4, 5), i.e. the numbers from 1 to 5 with 3 removed. Another line in my
"%-%" <<- setdiff
The setdiff function does what I want, but with %-% it's quicker and more intuitive to read. Similarly

"%like%" <<- function(x,y) grep(y,x, perl=T)
means I can type state[state$name %like% "Al.*",] and get data for Alabama and Alaska.

6. Statisticians tend to want to put everything in one huge table. So for example, if they have 50000 Eurobarometer respondents and they want to use respondent's nation's GDP as an independent variable, they'll create a big table with 50000 rows:
name | nation | GDP | ... other national variables
This is fine until you want to take the log of GDP and it takes the computer five minutes to create 50000 new variables, most of which are duplicates. Take a hint from database administration: keep national variables in a separate data frame, with one row for each nation. Then merge them once you have created all your independent variables, before you start running regressions.

(If you want to know more about how to create good databases, here's a good guide.)

7. Tired of typing brackets all the time to run simple commands? Here's a neat hack:
print.command <- function (x) {
default.args <- attr(x, "default.args")
if (! length(default.args)) default.args <- list()
print(, default.args, envir=parent.frame()))

class(ls) <- c("command", class(ls))
class(search) <- c("command", class(search))

Now you can type ls or search at the command line without brackets. The magic here is that the end result of any command line is printed, i.e. the print method is called on the object. If we give ls a class of "command", the function
print.command gets called when we evaluate ls. This then runs the function in the command line environment. To set up default arguments, do, e.g.:

attr(ls, "default.args") <- list(all=T)

It would be nice to be able to type fix foo instead of fix(foo), but I don't think it's possible. Correct me if you know better.

NEW 8. You don't have to save everything in one workspace. This is the easiest way to go at first, but when your data becomes large and takes minutes to load, you can separate it into different workspaces and load only the bits you need. To do this, instead of save.image, use

save(foo1, file="foo1.RData")
save(foo2, file="foo2.RData")

et cetera. You then load these in the normal way.

That's your lot! For more, check out or R tips. And of course, the occasionally grumpy but always enlightening R-help mailing list.