germany_by_unemployment_shapefileLately, David Smith from REvolution Computing set out to challenge the R community with the reprocuction of a beautiful choropleth map (= multiple regions map/thematic map) on US unemployment rates he had seen on the Flowing Data blog. Here you can find the impressing results. Being a fan of beautiful visualizations I tried to produce a similar map for Germany.

1. Getting the spatial country data

The first step resulted in getting data to draw a map of the German administrative districts. Unfortunately, the maps for Germany do not come along in the map package, which would mean I could easily adopt the code results from the challenge. Getting data: The GADM database of Global Administrative Areas has the aim to provide data of administrative districts for the whole world on different levels (country, state and county level). The data can be downloaded as as a shapefile, an ESRI geodatabase file, a Google Earth .kmz file and very convenient for R users, as an Rdata file.

2. Getting socio-demographic data (e. g. unemployment rates by administrative district): A lot of data is available online at www.statistikportal.de. On this site you find links to several data bases. To get the unemployment stats by county I clicked my way through: Regionaldatenbank Deutschland -> Arbeitsmarkt -> Arbeitsmarktstatistik der Bundesagentur für Arbeit -> Arbeitslose nach ausgewählten Personengruppen sowie Arbeitslosenquoten – Jahresdurchschnitt – (ab 2008) regionale Tiefe: Kreise und krfr. Städte -> Werteabruf -> save as CSV format. This table contains all the information I need, although for some reson, for a few districts there is no data listed. I also looked for another source. On Regionalatlas a nice online visualization tool is offered. In the menu I selected unemployment rate 2008 as indicator. Besides the nice visualization you get, there is a menu button “tables” where you can retrieve a html table of the data. I simply copied and pasted it into a .txt file which gives me a tab seperated value format I can read in R. But still: some districts are not listed. Here is a pdf file containing the data. Continue reading ‘Infomaps using R – Visualizing German unemployment rates by district on a map’


regression_models_in_latexMost people using LaTex feel that creating tables is no fun. Some days ago I stumbled across a neat function written by Paul Johnson that produces LaTex code as well as LaTex code that can be used within Lyx. The output can be used for regression models and looks like output from the Stata outreg command. His R function that produces the LaTex code has the same name:  outreg(). The outreg code can be found on his website or in the PDF copy of the code from his website.

I took the code, put it into a .rnw file and sweaved it. It worked like a charm and produced beautiful results (see the picture on the left and the PDF). Below you can find the code for the noweb file (.rnw). Latex code is colored grey, R-code is colored blue. Just have a look at all the results as a PDF file. Besides, Paul Johnson has also created a nice list of R-Tips that can be found on his website as well.

Continue reading ‘R: Function to create tables in LaTex or Lyx to display regression model results’


Just a little note for german speaking R beginners: There is an introductory course in R (german) available online on the website of the department of methodology and evaluation research at the University of Jena. Dr. Ivailo Partchev holds a seven sessions course on that topic (duration 11.5 hours).


Before you read this post, please have a look at Enrique’s comment below. He pointed out that the built-in R function modifyList() already does what I wanted to describe in this post. Well, I live to learn :)

I was wondering how I could write a function that uses default settings but accepts a list to overwrite the default settings via the dot-dot-dot / three-point argument. Here comes my solution.

# building a function with a list of default settings
# that can be modified by an optional list passed
# via the dot-dot-dot / three point argument

Continue reading ‘R: Building functions – using default settings that can be modified via the dot-dot-dot / three point argument’


On the REvolutions Blog there is a nice posting treating the often raised concern on “How good or reliable R is”. At my university R is hardly used. Sometimes I was asked by lecturers wether the calculations done by R and its packages are accurate. The linked posting treats this matter and tries to clarify this point.


zipperzippersSometimes I find it useful to merge two data frames like the following ones

  X1 X2 X3 X4      Y1 Y2 Y3 Y4   
1  o  o  o  o       X  X  X  X
2  o  o  o  o       X  X  X  X
3  o  o  o  o       X  X  X  X

by using zip feeding either along the columns

   X1 Y1 X2 Y2 X3 Y3 X4 Y4
1  o  X  o  X  o  X  o  X
2  o  X  o  X  o  X  o  X
3  o  X  o  X  o  X  o  X

or along the rows of the data frames.

  V1 V2 V3 V4
1  o  o  o  o
4  X  X  X  X
2  o  o  o  o
5  X  X  X  X
3  o  o  o  o
6  X  X  X  X

Continue reading ‘R: Zip fastener for two data frames / combining rows or columns of two dataframes in an alternating manner’


progress_barEvery once in while I have to write a function that contains a loop doing thousands or millions of calculations. To make sure that the function does not get stuck in an endless loop or just to fulfill the human need of control it is useful to monitor the progress. So  first I tried the following:



###############################################################

total <- 10
for(i in 1:total){
   print(i)
   Sys.sleep(0.1)
}

###############################################################

Unfortunately this does not work as the console output to the basic R GUI is buffered. This means that it is printed to the console at once after the loop is finished. The R FAQs (7.1) explains a solution: Either to change the R GUI buffering settings in the Misc menu which can be toggled via <Ctrl-W> or to tell R explicitly to empty the buffer by flush.console(). So like this it works:

###############################################################

total <- 20
for(i in 1:total){
   Sys.sleep(0.1)
   print(i)
   # update GUI console
   flush.console()                          
}

###############################################################

Of course it would be even nicer to have a real progress bar. For different progress bars we can use the built-in R.utils package. First a text based progress bar:

Continue reading ‘R: Monitoring the function progress with a progress bar’


footnoteIn some statistical programs there is the option available to attach a footnote to the graphical output that is created. This footnote may contain the name of the script or the file that produced the graphic, the author’s name and the date of creation. In SAS for example there is a footnote command to achieve this. Ever since I realized that this makes life a lot easier, I wrote a simple three-lines function in R which I use at the end of the construction of any graphic. I suppose, that this is what my professors meant with “good practice”. The nice thing about implementing this in the grid graphics system is that you can produce multiple graphics [e.g. by par(mfrow=c(2, 2))] and still the footnote will be positioned correctly.

Continue reading ‘R: Good practice – adding footnotes to graphics’


variationsAlthough the graphic at the left might not seem a 100% appropriate, it gives a hint to what I am about to do. I want to calculate all possible linear regression models with one dependent and several independent variables. I do not want to address bias and fitting issues or the question if this makes sense from a statistical point of view in this posting. Here I want to emphasize the technical issues only.

To solve the task, several approaches are possible. The first one is a step-by-step approach using a lot of code. Another one would be to make use of a specialized package. The packages leaps and meifly would be appropriate for the task but have some slight drawbacks in terms of flexibility. I will not address solutions using these packages here, but I would like to point out that in contrast to the below only a few lines of code would do the job.

The step-by-step approach

Let’s suppose we have the following set of four possible regressors.

regressors <- c("y1", "y2", "y3", "y4")

Now we want to construct a formula that contains the first and third regressor.

vec <- c(T, F, T, F)
paste(regressors[vec])
> [1] "y2" "y3"

So the paste commmand works vectorwise which helps a lot in this case. Now we add a plus sign between the regressors…

Continue reading ‘R: Calculating all possible linear regression models for a given set of predictors’


Today I will treat a problem I encounter every once in a while. Let’s suppose we have several dataframes or vectors of unequel length but with partly matching column names,  just like the following ones:

df1 <- data.frame(Intercept = .4, x1=.4, x2=.2, x3=.7)
df2 <- data.frame(Interceptlego = .5,        x2=.8       )

This for example may occur when fitting several multiple regression models each time using different combination of regressors. Now I would like to combine the results into one data frame.  The merge() as well as the rbind() function do not help here as they require equal lengths.

I posted this matter on r-help as my first solution was somewhat awkward and could not be generalized to any data frames or list of data frames. The first solution was posted by Charles C. Berry. myList is a list containing the data frames as elements

myList <- list(df1, df2)

What he does is to use a nested loop. The inner loop runs for each data frame over each column name. It basically takes each column name and the correponding element [i, j] from the data frame ( myList[[i]] ) and writes it into an empty data frame (dat). Thereby a new column that is named just like the column from the list element data frame is created. The cells that are left out are automatically set NA.

dat <- data.frame()
for(i in seq(along=myList)) for(j in names(myList[[i]]))
                                 dat[i,j] <- myList[[i]][j]
dat

Continue reading ‘R: Combining vectors or data frames of unequal length into one data frame’