R: Combining vectors or data frames of unequal length into one data frame

23Jan09

Today I will treat a problem I encounter every once in a while. Let’s suppose we have several dataframes or vectors of unequel length but with partly matching column names,  just like the following ones:

df1 <- data.frame(Intercept = .4, x1=.4, x2=.2, x3=.7)
df2 <- data.frame(Interceptlego = .5,        x2=.8       )

This for example may occur when fitting several multiple regression models each time using different combination of regressors. Now I would like to combine the results into one data frame.  The merge() as well as the rbind() function do not help here as they require equal lengths.

I posted this matter on r-help as my first solution was somewhat awkward and could not be generalized to any data frames or list of data frames. The first solution was posted by Charles C. Berry. myList is a list containing the data frames as elements

myList <- list(df1, df2)

What he does is to use a nested loop. The inner loop runs for each data frame over each column name. It basically takes each column name and the correponding element [i, j] from the data frame ( myList[[i]] ) and writes it into an empty data frame (dat). Thereby a new column that is named just like the column from the list element data frame is created. The cells that are left out are automatically set NA.

dat <- data.frame()
for(i in seq(along=myList)) for(j in names(myList[[i]]))
                                 dat[i,j] <- myList[[i]][j]
dat

Note that the order of the output columns depends on the input order. The list below renders a different order, though it contains the same elements but ordered differently.

myList <- list(df2, df1)

  Intercept  x2  x1  x3
1       0.5 0.8  NA  NA
2       0.4 0.2 0.4 0.7

Another solution was posted by Henrique Dallazuanna. This one has the advantage that it does not use loops.

l <- myList
do.call(rbind, lapply(lapply(l, unlist), "[",
        unique(unlist(c(sapply(l,names))))))

It looks a bit scary at first, so let’s examine it starting from the inside.

# a list of names from each list element
c(sapply(l,names))

# unlist them and find unique names
unique(unlist(c(sapply(l,names))))

# gives unlisted vectors with column names for each list element
lapply(l, unlist) 

As a next step for each vector with column names all columns are selected leaving those that are not present with NA values.

listOfVectors <- lapply(lapply(l, unlist), "[",
                        unique(unlist(c(sapply(l,names)))))

As a last step the vectors having the same columns are combined.

do.call(rbind, listOfVectors)
# or in full
DF <- do.call(rbind, lapply(lapply(l, unlist), "[",
              unique(unlist(c(sapply(l,names))))))

The only little flaw in this function is that the column names of the first vector are taken as column names of the developing data frame. Using the second list from above, gives the following.

l <- list(df2, df1) 
     Intercept  x2 <NA> <NA>
[1,]       0.5 0.8   NA   NA
[2,]       0.4 0.2  0.4  0.7

Thus, in a last step we need change the column names of the data frame.

DF <- as.data.frame(DF)
names(DF) <- unique(unlist(c(sapply(l,names))))
DF

Well this works but it would be much more convenient to get this done in one single function and well, since october 2008 there is one. It can be found in the plyr package written by Hadley Wickham. So the solution is as easy as:

library(plyr)
l <- myList
do.call(rbind.fill, l)

# another example

l <- list(data.frame(a=1, b=2), data.frame(a=2, c=3, d=5))
do.call(rbind.fill, l)

The results:

  Intercept  x1  x2  x3
1       0.4 0.4 0.2 0.7
2       0.5  NA 0.8  NA

Now, this is nice! It is really worthwhile having a look at Hadley Wickhams plyr package as it provides a lot of functions that make life a lot easier when it comes to splitting list or data frames, doing a calculation or not and merge them afterwards again. More on that another day.

Cheers, Mark

About these ads


10 Responses to “R: Combining vectors or data frames of unequal length into one data frame”

  1. 1 Pete D.

    This was very helpful!
    Thanks.

    Pete D.

  2. 2 Chris A.

    Big thanks, this was just what I have been looking for.

  3. 3 John A

    fantastic; just what I needed for combining tables contructed from unequal subsets of a large data file. regards jac

  4. For the newest version of my ‘qpcR’ package I wrote a function called
    “cbind.na” which is based on the methods:::cbind function but has added code that brings all vectors (or columns in a data.frame or matrix) to the same length (the maximum length encountered) by adding NA’s.

    The function is:
    cbind.na <- function (…, deparse.level = 1)
    {
    na <- nargs() – (!missing(deparse.level))
    deparse.level <- as.integer(deparse.level)
    stopifnot(0 <= deparse.level, deparse.level <= 2)
    argl <- list(…)

    ### determine max length
    tempLEN <- NULL
    for (i in 1:length(argl)) {
    DIM <- dim(argl[[i]])
    if (is.null(DIM)) tempLEN[i] <- length(argl[[i]])
    else tempLEN[i] <- max(apply(argl[[i]], 2, function(x) length(x)))
    }
    maxLEN <- max(tempLEN)
    ### added NA fill to max length
    for (i in 1:length(argl)) {
    DIM <- dim(argl[[i]])
    if (is.null(DIM)) argl[[i]] <- c(argl[[i]], rep(NA, maxLEN – length(argl[[i]])))
    else argl[[i]] 0 && is.null(argl[[na]])) {
    argl <- argl[-na]
    na <- na – 1
    }

    if (na == 0)
    return(NULL)
    if (na == 1) {
    if (isS4(..1))
    return(cbind2(..1))
    else return(.Internal(cbind(deparse.level, …)))
    }
    if (deparse.level) {
    symarg <- as.list(sys.call()[-1L])[1L:na]
    Nms <- function(i) {
    if (is.null(r <- names(symarg[i])) || r == "") {
    if (is.symbol(r <- symarg[[i]]) || deparse.level ==
    2)
    deparse(r)
    }
    else r
    }
    }
    if (na == 2) {
    ### changed to second argl item
    r <- argl[[2]]
    #r <- ..2
    fix.na <- FALSE
    }
    else {
    nrs <- unname(lapply(argl, nrow))
    iV <- sapply(nrs, is.null)
    fix.na <- identical(nrs[(na - 1):na], list(NULL, NULL))
    if (fix.na) {
    nr <- max(if (all(iV)) sapply(argl, length) else unlist(nrs[!iV]))
    argl[[na]] <- cbind(rep(argl[[na]], length.out = nr),
    deparse.level = 0)
    }
    if (deparse.level) {
    if (fix.na)
    fix.na <- !is.null(Nna <- Nms(na))
    if (!is.null(nmi <- names(argl)))
    iV <- iV & (nmi == "")
    ii <- if (fix.na)
    2:(na – 1)
    else 2:na
    if (any(iV[ii])) {
    for (i in ii[iV[ii]]) if (!is.null(nmi <- Nms(i)))
    names(argl)[i] <- nmi
    }
    }
    r <- do.call(cbind, c(argl[-1L], list(deparse.level = deparse.level)))
    }
    d2 <- dim(r)
    ### changed to first argl item
    r <- cbind2(argl[[1]], r)
    #r <- cbind2(..1, r)
    if (deparse.level == 0)
    return(r)
    ism1 <- !is.null(d1 <- dim(..1)) && length(d1) == 2L
    ism2 <- !is.null(d2) && length(d2) == 2L && !fix.na
    if (ism1 && ism2)
    return(r)
    Ncol <- function(x) {
    d 0L)
    }
    nn1 <- !is.null(N1 <- if ((l1 <- Ncol(..1)) && !ism1) Nms(1))
    nn2 <- !is.null(N2 <- if (na == 2 && Ncol(..2) && !ism2) Nms(2))
    if (nn1 || nn2 || fix.na) {
    if (is.null(colnames(r)))
    colnames(r) <- rep.int("", ncol(r))
    setN <- function(i, nams) colnames(r)[i] <<- if (is.null(nams))
    ""
    else nams
    if (nn1)
    setN(1, N1)
    if (nn2)
    setN(1 + l1, N2)
    if (fix.na)
    setN(ncol(r), Nna)
    }
    r
    }

    Now things work like:
    x <- 1:5
    y <- 1:10
    cbind(x, y)
    ## first vetor gets repeated
    cbind.na(x, y)
    ## first vector gets filled
    x <- matrix(rnorm(100), ncol = 2)
    y <- matrix(rnorm(100), ncol = 4)
    cbind(x, y)
    ## does not work because of unequal row numbers, BUT
    cbind.na(x, y)
    ## works…
    Does the same with data.frames and preserves all column names.

  5. Sorry, pasting went wrong! Again:

    cbind.na <- function (…, deparse.level = 1)
    {
    na <- nargs() – (!missing(deparse.level))
    deparse.level <- as.integer(deparse.level)
    stopifnot(0 <= deparse.level, deparse.level <= 2)
    argl <- list(…)

    ### determine max length
    tempLEN <- NULL
    for (i in 1:length(argl)) {
    DIM <- dim(argl[[i]])
    if (is.null(DIM)) tempLEN[i] <- length(argl[[i]])
    else tempLEN[i] <- max(apply(argl[[i]], 2, function(x) length(x)))
    }
    maxLEN <- max(tempLEN)
    ### added NA fill to max length
    for (i in 1:length(argl)) {
    DIM <- dim(argl[[i]])
    if (is.null(DIM)) argl[[i]] <- c(argl[[i]], rep(NA, maxLEN – length(argl[[i]])))
    else argl[[i]] 0 && is.null(argl[[na]])) {
    argl <- argl[-na]
    na <- na – 1
    }

    if (na == 0)
    return(NULL)
    if (na == 1) {
    if (isS4(..1))
    return(cbind2(..1))
    else return(.Internal(cbind(deparse.level, …)))
    }
    if (deparse.level) {
    symarg <- as.list(sys.call()[-1L])[1L:na]
    Nms <- function(i) {
    if (is.null(r <- names(symarg[i])) || r == "") {
    if (is.symbol(r <- symarg[[i]]) || deparse.level ==
    2)
    deparse(r)
    }
    else r
    }
    }
    if (na == 2) {
    ### changed to second argl item
    r <- argl[[2]]
    #r <- ..2
    fix.na <- FALSE
    }
    else {
    nrs <- unname(lapply(argl, nrow))
    iV <- sapply(nrs, is.null)
    fix.na <- identical(nrs[(na - 1):na], list(NULL, NULL))
    if (fix.na) {
    nr <- max(if (all(iV)) sapply(argl, length) else unlist(nrs[!iV]))
    argl[[na]] <- cbind(rep(argl[[na]], length.out = nr),
    deparse.level = 0)
    }
    if (deparse.level) {
    if (fix.na)
    fix.na <- !is.null(Nna <- Nms(na))
    if (!is.null(nmi <- names(argl)))
    iV <- iV & (nmi == "")
    ii <- if (fix.na)
    2:(na – 1)
    else 2:na
    if (any(iV[ii])) {
    for (i in ii[iV[ii]]) if (!is.null(nmi <- Nms(i)))
    names(argl)[i] <- nmi
    }
    }
    r <- do.call(cbind, c(argl[-1L], list(deparse.level = deparse.level)))
    }
    d2 <- dim(r)
    ### changed to first argl item
    r <- cbind2(argl[[1]], r)
    #r <- cbind2(..1, r)
    if (deparse.level == 0)
    return(r)
    ism1 <- !is.null(d1 <- dim(..1)) && length(d1) == 2L
    ism2 <- !is.null(d2) && length(d2) == 2L && !fix.na
    if (ism1 && ism2)
    return(r)
    Ncol <- function(x) {
    d 0L)
    }
    nn1 <- !is.null(N1 <- if ((l1 <- Ncol(..1)) && !ism1) Nms(1))
    nn2 <- !is.null(N2 <- if (na == 2 && Ncol(..2) && !ism2) Nms(2))
    if (nn1 || nn2 || fix.na) {
    if (is.null(colnames(r)))
    colnames(r) <- rep.int("", ncol(r))
    setN <- function(i, nams) colnames(r)[i] <<- if (is.null(nams))
    ""
    else nams
    if (nn1)
    setN(1, N1)
    if (nn2)
    setN(1 + l1, N2)
    if (fix.na)
    setN(ncol(r), Nna)
    }
    r
    }

  6. 6 ranjit

    Its this work

    merge(df1,df2,all=TRUE)

  7. 7 tim

    this is great! thanks so much, really helped me out

  8. 8 KatjaK

    I’m very thankful!!! it was a big help for me!

  9. 9 Frank Burbrink

    Is there a way to add a third and fourth column to this? That is, can I bind 3 or 4 lists that also do not have all of the same named elements?

  10. 10 markheckmann

    Do you mean something like below? If not please add a minimal example.
    library(plyr)
    df1 <- data.frame(x1=1, x2=2)
    df2 <- data.frame( x2=2, x3=3 )
    df3 <- data.frame( x3=3, x4=4)
    rbind.fill(list(df1, df2, df3))



Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: