R: jackknife the coefficients of a linear regression model

19Dec08

jackknifeFor one of my statistics classes I had to do a jackknife (leave-on-out) estimation of a the parameters of simple linear regression model.

The difficulty with the jackknife method from the bootstrap package is that by default it returns a scalar only. Thus some programming is needed to return all the coefficients. Thanks to Simon Knapp for helping me out here (see r-help, dec. 2008).

###############################################################

library(bootstrap)
# to do a leave-on-out jackknife estimate for the mean of the
# data ?jackknife gives an example
# Having a look at the jackknife function we see that it demands
# two parameters: x and theta. x is supposed to contain the data
# and theta the function that is applied to the data

x <- rnorm(20)
theta <- function(x){mean(x)}
results <- jackknife(x,theta)

###############################################################

Now I want to program the estimation for one coefficient of a linear regression model.

###############################################################

DF <- as.data.frame(matrix(rnorm(250), ncol=5))   # my data
model.lm <- formula(V1 ~ V2 + V3 + V4)             # my model

# Now I need to specify the theta function. Here x is not the
# data itself but is used as the row index vector to select
# a subset from the data frame (xdata). Also the coefficient
# to be returned is specified. 

theta <- function(x, xdata, coefficient){
              coef(lm(model.lm, data=xdata[x,]))[coefficient] }

# So now at each leave-on-out run the model is calculated with
# a subset defined by the vector x (here one is left out) and one
# coefficient is returned:

results <- jackknife(1:50, theta, xdata=DF,
                     coefficient="(Intercept)")

###############################################################

To expand this code onto the estimation of all the regression coefficients is only a small step now. As the theta function is supposed to return a scalar and not a list of estimates for each coefficient, the following workaround is used: The sapply function calls the jackknife function four times prompting a different parameter estimate at each run. The prompted coeffient is passed on to the jackknife function by the three point (…) argument .



###############################################################

# The following function calculates all then coefficients

jackknife.apply <- function(x, xdata, coefs)
{
sapply(coefs,
       function(coefficient) jackknife(x, theta, xdata=xdata,
                                       coefficient=coefficient),
        simplify=F)
}

# now jackknife.apply() can be called

results <- jackknife.apply(1:50, DF, c("(Intercept)", "V2", 
                                       "V3", "V3"))

###############################################################

The output is a list containing four elements with attributes as aspecified in the jackknife function. So the last step is to bring the output into a nice format, let’s say like this:

+--------------------------------------------------------------+
|               Intercept      V1          V2         V3       |
|    1             0.34        2.2        .03         1.1      |
|    2             0.29        1.9        .11         1.2      |
|    ...           ...        ...         ...         ...      |

I will use Hadley Wickhams plyr package functions for that – quite a new package (oct. 2008) which is really worthwhile having a look at! I am just starting… The following solution sure is a kludge but does the job.


###############################################################

# ldply() takes a list, applies a function and puts the results
# into a data frame (unfortunately by cbind(), thus the
# transposition). Here the function only selects the
# $jack.values from each list element

library(plyr)

jack.values <- t(ldply(results, function(x) x$jack.values))
dimnames(jack.values)[[2]]  <- names(results)
dimnames(jack.values)[[1]]  <- 1:50

###############################################################

Now let’s do the same for the other $jack.se and $jack.bias.

###############################################################

jack.par <- t(ldply(results, function(x) cbind(x$jack.se, x$jack.bias)))
dimnames(jack.par)[[2]]  <- names(results)
dimnames(jack.par)[[1]]  <- c("jack.se", "jack.bias")

###############################################################

Although the function works perfectly its obvious disatvantage is its inefficiency. Normally only n models would be calculated leaving on data element out each time. Now the number of calculated models multiplies with number of estimated model coefficients, so its n times estimated regression parameters. This surely boosts calculation time! If anyone knows a solution here, please feel free to comment!

mh

About these ads


One Response to “R: jackknife the coefficients of a linear regression model”

  1. 1 SW

    I know this post was written some time ago, but I am trying to amend this code to accommodate jackknifing regression coefficients from mixed effect models. Within theta, using ‘coef(lme(model, random=~1|plot, data=…’ with everything else as before gives me an error that reads ‘Error in u – mean(u) : non-numeric argument to binary operator’. All of my variables are numeric save the random term.. I can retrieve coefficients from an lme model using coef… any suggestions as to what the problem is?



Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: