### R: jackknife the coefficients of a linear regression model

19Dec08

For one of my statistics classes I had to do a jackknife (leave-on-out) estimation of a the parameters of simple linear regression model.

The difficulty with the jackknife method from the bootstrap package is that by default it returns a scalar only. Thus some programming is needed to return all the coefficients. Thanks to Simon Knapp for helping me out here (see r-help, dec. 2008).

```###############################################################

library(bootstrap)
# to do a leave-on-out jackknife estimate for the mean of the
# data ?jackknife gives an example
# Having a look at the jackknife function we see that it demands
# two parameters: x and theta. x is supposed to contain the data
# and theta the function that is applied to the data

x <- rnorm(20)
theta <- function(x){mean(x)}
results <- jackknife(x,theta)

###############################################################```

Now I want to program the estimation for one coefficient of a linear regression model.

```###############################################################

DF <- as.data.frame(matrix(rnorm(250), ncol=5))   # my data
model.lm <- formula(V1 ~ V2 + V3 + V4)             # my model

# Now I need to specify the theta function. Here x is not the
# data itself but is used as the row index vector to select
# a subset from the data frame (xdata). Also the coefficient
# to be returned is specified.

theta <- function(x, xdata, coefficient){
coef(lm(model.lm, data=xdata[x,]))[coefficient] }

# So now at each leave-on-out run the model is calculated with
# a subset defined by the vector x (here one is left out) and one
# coefficient is returned:

results <- jackknife(1:50, theta, xdata=DF,
coefficient="(Intercept)")

###############################################################```

To expand this code onto the estimation of all the regression coefficients is only a small step now. As the theta function is supposed to return a scalar and not a list of estimates for each coefficient, the following workaround is used: The sapply function calls the jackknife function four times prompting a different parameter estimate at each run. The prompted coeffient is passed on to the jackknife function by the three point (…) argument .

```
###############################################################

# The following function calculates all then coefficients

jackknife.apply <- function(x, xdata, coefs)
{
sapply(coefs,
function(coefficient) jackknife(x, theta, xdata=xdata,
coefficient=coefficient),
simplify=F)
}

# now jackknife.apply() can be called

results <- jackknife.apply(1:50, DF, c("(Intercept)", "V2",
"V3", "V3"))

###############################################################```

The output is a list containing four elements with attributes as aspecified in the jackknife function. So the last step is to bring the output into a nice format, let’s say like this:

```+--------------------------------------------------------------+
|               Intercept      V1          V2         V3       |
|    1             0.34        2.2        .03         1.1      |
|    2             0.29        1.9        .11         1.2      |
|    ...           ...        ...         ...         ...      |```

I will use Hadley Wickhams plyr package functions for that – quite a new package (oct. 2008) which is really worthwhile having a look at! I am just starting… The following solution sure is a kludge but does the job.

```
###############################################################

# ldply() takes a list, applies a function and puts the results
# into a data frame (unfortunately by cbind(), thus the
# transposition). Here the function only selects the
# \$jack.values from each list element

library(plyr)

jack.values <- t(ldply(results, function(x) x\$jack.values))
dimnames(jack.values)[[2]]  <- names(results)
dimnames(jack.values)[[1]]  <- 1:50

###############################################################```

Now let’s do the same for the other \$jack.se and \$jack.bias.

```###############################################################

jack.par <- t(ldply(results, function(x) cbind(x\$jack.se, x\$jack.bias)))
dimnames(jack.par)[[2]]  <- names(results)
dimnames(jack.par)[[1]]  <- c("jack.se", "jack.bias")

###############################################################```

Although the function works perfectly its obvious disatvantage is its inefficiency. Normally only n models would be calculated leaving on data element out each time. Now the number of calculated models multiplies with number of estimated model coefficients, so its n times estimated regression parameters. This surely boosts calculation time! If anyone knows a solution here, please feel free to comment!

mh