Infomaps using R – Visualizing German unemployment rates by district on a map

16Nov09

Lately, David Smith from REvolution Computing set out to challenge the R community with the reprocuction of a beautiful choropleth map (= multiple regions map/thematic map) on US unemployment rates he had seen on the Flowing Data blog. Here you can find the impressing results. Being a fan of beautiful visualizations I tried to produce a similar map for Germany.

1. Getting the spatial country data

The first step resulted in getting data to draw a map of the German administrative districts. Unfortunately, the maps for Germany do not come along in the map package, which would mean I could easily adopt the code results from the challenge. Getting data: The GADM database of Global Administrative Areas has the aim to provide data of administrative districts for the whole world on different levels (country, state and county level). The data can be downloaded as as a shapefile, an ESRI geodatabase file, a Google Earth .kmz file and very convenient for R users, as an Rdata file.

2. Getting socio-demographic data (e. g. unemployment rates by administrative district): A lot of data is available online at www.statistikportal.de. On this site you find links to several data bases. To get the unemployment stats by county I clicked my way through: Regionaldatenbank Deutschland -> Arbeitsmarkt -> Arbeitsmarktstatistik der Bundesagentur für Arbeit -> Arbeitslose nach ausgewählten Personengruppen sowie Arbeitslosenquoten – Jahresdurchschnitt – (ab 2008) regionale Tiefe: Kreise und krfr. Städte -> Werteabruf -> save as CSV format. This table contains all the information I need, although for some reson, for a few districts there is no data listed. I also looked for another source. On Regionalatlas a nice online visualization tool is offered. In the menu I selected unemployment rate 2008 as indicator. Besides the nice visualization you get, there is a menu button “tables” where you can retrieve a html table of the data. I simply copied and pasted it into a .txt file which gives me a tab seperated value format I can read in R. But still: some districts are not listed. Here is a pdf file containing the data.

3. Preparing the data

Now I have two datafiles: One (gadm) containaing the spatial information, the other one (unempl) containing the unemployment rates. It turns out that the same districts are not always named alike. Sometimes the name comes along with a supplement or in other cases the deviations are more severe so that simple parsing will not do it.

I decided to take the quick-and-dirty route and do a fuzzy matching, which surely is prone to errors, very slow and not at all elegant… Well, never underestimate the rawness of raw data.

4. Plotting the data

On Claudia Engel’s Anthrospace blog I found an R script already perfect to make use of the data provided. The Rdata files turn out to contain SpatialPolygonsDataFrame so we can print the data without any further preparation using the sp package.

###############################################################
library(sp)
library(RColorBrewer)

# get spatial data for Germany on county level
con <- url("http://gadm.org/data/rda/DEU_adm3.RData")
print(load(con))
close(con)
# plot Germany with random colors
col = rainbow(length(levels(gadm$NAME_3)))
spplot(gadm, "NAME_3", col.regions=col, main="German Regions",
       colorkey = FALSE, lwd=.4, col="white")
###############################################################

This looks nice. To produce a color vector to visualize the unemployment rate the two data sets have to be merged.

###############################################################
### DATA PREP ###
# loading the unemployment data
unempl <- read.delim2(file="./data/data_germany_unemployment_by_
                     county.txt", header = TRUE, sep = "\t",
                     dec=",", stringsAsFactors=F)

# due to Mac OS encoding, otherwise not needed
gadm_names <- iconv(gadm$NAME_3, "ISO_8859-2", "UTF-8")

# fuzzy matching of data: quick & dirty
# caution: this step takes some time ~ 2 min.

# parsing out "Städte"
gadm_names_n <- gsub("Städte", "", gadm_names) 

total <- length(gadm_names)
# create progress bar
pb <- txtProgressBar(min = 0, max = total, style = 3) 
order <- vector()
for (i in 1:total){  
   order[i] <- agrep(gadm_names_n[i], unempl$Landkreis, 
                     max.distance = 0.2)[1]
 setTxtProgressBar(pb, i)               # update progress bar
}

# choose color by unemployment rate
col_no <- as.factor(as.numeric(cut(unempl$Wert[order],
                    c(0,2.5,5,7.5,10,15,100))))
levels(col_no) <- c(">2,5%", "2,5-5%", "5-7,5%",
                    "7,5-10%", "10-15%", ">15%")
gadm$col_no <- col_no
myPalette<-brewer.pal(6,"Purples")

# plotting
spplot(gadm, "col_no", col=grey(.9), col.regions=myPalette,
main="Unemployment in Germany by district")

###############################################################

It seems that the districts for which no data was available mainly belong to the states Sachsen-Anhalt and Sachsen. Also you can see that east of Germany has got a much higher unemplyoment rate than the west. The same holds true for a north-south comparison.

Besides ths sp package there are many other ways to produce such a graphic. I will now take another approch using shapefile data which also is available on GDAM. The data is availabe as a .zip file which includes several dBase files for all levels (3=district, 1=state etc.).

###############################################################

library(sp)
library(maptools)

nc1 <- readShapePoly("./data/DEU_adm/DEU_adm1.dbf",
                    proj4string=CRS("+proj=longlat +datum=NAD27"))
nc3 <- readShapePoly("./data/DEU_adm/DEU_adm3.dbf",
                    proj4string=CRS("+proj=longlat +datum=NAD27"))

# col_no comes from the calculations above
par(mar=c(0,0,0,0))
plot(nc3, col=myPalette[col_no], border=grey(.9), lwd=.5)
plot(nc1, col=NA, border=grey(.5), lwd=1, add=TRUE)

###############################################################

What I like about all this, it that it is pretty simple to draw almost any country you like. Besides the very messy part of data preparation it is only a few lines of code and the results are nice.

Filed under: R / R-Code | 26 Comments
Tags: maps, R, visualization

26 Responses to “Infomaps using R – Visualizing German unemployment rates by district on a map”

Feed for this Entry

1 Gaël Laurans on November 18, 2009 said:

The data from Saxony is really missing so that the few districts that are colored really are mistakes. The data from Saxony-Anhalt is in your table but there was a Kreisreform in 2007 and the map you are using was apparently not updated. Because of that, the matching is very messy and several districts are obviously wrong (most of those in Saxony-Anhalt do not exist under their old names in the dataset, both Wittenberg and Ohrekreis should be over 10% instead of being light blue, Berlin’s unemployment rate is way too low…)
2 markheckmann on November 18, 2009 said:

Hi Gaël,

you are certainly right and thanks a lot for clarifying the reasons for the inconsistencies! This is really the difficult task of preparing the raw data as sometimes it comes along really raw… Above my focus was put onto the visualization only, thus I would recommend not to take this graph at face value. The data combination was necessary and I took a quick-and-dirty error-prone road to not have to dig too deep into the data prep step, thus yielding several obvious errors. If someone could come up with another simple matching strategy I would be glad.

Other questions arises in the data prep:
1) How to merge polygon objects to print them as one if districts have e.g. merged, changed.
2) Where to get accurate map data for this task.

Mark
3 Eoin Brazil on November 20, 2009 said:

Hi Mark,

Thanks for the post and useful pointers, I’ve created an Irish version (http://braz.blogspot.com/2009/11/following-recent-infomaps-with-r-trend.html) but the points you make about the data preparation are really valid. The Irish statistics agency (CSO) uses NUTS2/3 districts which are not typically used to represent regions or counties, further more the last Irish census was in 2006 so even the population figures are only estimates. I’d have to simply say that any unemployment visualisation is only a rough and perhaps somewhat accurate reflection of the data.

Danke,
Eoin
4 Gaël Laurans on November 24, 2009 said:

Hi Mark,

I had another look at the problem and I think “agrep” isn’t really useful for that. You need to allow for a big distance to find matches for some districts and then you have many matches for other districts. Worse, the results are not ordered by distance and taking the first one just doesn’t work. It could be possible to get around this problem by rolling out your own fuzzy matching function but it seems like a lot of trouble, especially if you want to make it reasonably fast.

In fact, a few calls to the regular “grep” can do the job as well, using gadm’s VARNAME and TYPE fields.

landkreise <- as.data.frame(cbind(unlist(gadm@data$NAME_3),
unlist(gadm@data$VARNAME_3)))
names(landkreise) <- c("OName","VName")
landkreise$OName[landkreise$OName == "NA"] <- NA
landkreise$OName <- gsub(" Städte","",landkreise$OName)
landkreise$VName[landkreise$VName == "NA"] <- NA
landkreise$FName <- paste(landkreise$OName, ", ", gsub("Städte", "Stadt", gsub("Landkreise", "Landkreis", unlist(gadm@data$TYPE_3))), sep="")

order1 <- as.vector(unlist(lapply(sapply(landkreise$FName, grep, unempl$Landkreis, value=F),function(x) x[1])))
order2 <- as.vector(unlist(lapply(sapply(landkreise$OName, grep, unempl$Landkreis, value=F),function(x) x[1])))
order3 <- as.vector(unlist(lapply(sapply(landkreise$VName, grep, unempl$Landkreis, value=F),function(x) x[1])))

# Order is important to deal properly with districts sharing the name of a city
order <- ifelse(is.na(order1),ifelse(is.na(order3),order2,order3),order1)

There are still about 45 unmatched districts but Saxony is empty as it should be, and there are fewer incorrect matches in Saxony-Anhalt and fewer obvious mistakes. The overall geographic pattern also looks cleaner.

Still, I think that the real problem is the quality of the gadm maps. The names are sometimes strange (try to look up "Speyer"), the maps are not up to date, their source is not clear, etc.

Cheers,
Gaël
5 allattention on December 10, 2009 said:

Another option to deal with names mess is to use the pmatch function (partial matching). Learned that the hard way. It can provide you with a list of the nomatch cases (either multiple matches or no matches) which you can then index mannually. Still if you actually care about exact indexing of every region you will have to input some better names, gadm is indeed strange with region names (but free!).
6 dangerouspenguin on February 17, 2010 said:

Wow..thank you so much for this helpful example. I am trying to learn how to plot maps in R instead of relying on a GIS, and this is fantastic.
7 Ananda on May 17, 2010 said:

Awesome example. It inspired me to explore doing the same for Tamil Nadu. Perhaps you have some tips for me: I’m trying to get rid of the bounding box around the map, and I’m trying to add labels to the districts. Any suggestion on how to do that?

Check out my map here: http://news.mrdwab.com/2010-05-16/choropleth-party-with-r/
8 markheckmann on May 17, 2010 said:

Hi Ananda,

thanks! Actually I did not succeed in removing the bounding box neither, so I switched to the plot() command instead of spplot() like in the bottom example. Maybe post it on geo-sig mailing list where the developers are…
For the coordinates you can use coordinates() from the sp packages. Here a short snippet from some project of mine (counties contains a list of spatial polygon objects).

plot(counties, col=2:4)
text(coordinates(counties), as.character(1:29), col=”white”)

Also, quite neat, you can use thinnedSpatialPoly() dp reduce number of points if your output file gets to big:
counties <- thinnedSpatialPoly(counties, tolerance=0.01, minarea=0.0005)

HTH Mark
9 Ananda on May 18, 2010 said:

Hi Mark,

Thanks for the suggestions. I’ve managed to get names on my map; now I need to figure out how to make the labels look a little bit nicer.
10 Gregor on March 14, 2011 said:

The german districts change from time to time. It’s not correct to map “Aachen, Kreisfreie Stadt” to “Aachen Städte” because this are different regions with different population, area etc. Unfortunately, the data from GADM database is outdated..
11 markheckmann on March 14, 2011 said:

Hi Gregor,

as addressed in the other comments there are a couple of problems with the map…
Take this code as a guide how to create a map, but do not rely on the data represented. I used an error-prone fuzzy matching approach to match region and data for the sake of time. If you really want to come up with a perfect map this will take a bit of time to clean up the data: “Never underestimate the rawness of raw data” ;)

Mark
12 Simon Kiss on May 12, 2011 said:

HI there:
So I’m trying to work with the CAnadian data from GADM and I’m following your and Caludia’s examples as best I can, but R is crashing repeatedly. Can you make any suggestions? It starts drawing the map, but then it just stalls.
Simon
13 grc on October 24, 2011 said:

hi,
how is it possible to load dbf files with the readShapePoly function?
14 markheckmann on October 24, 2011 said:

Hi,
I am not too familiar with files formats in this field. So, frankly, I do not know.
Best
Mark
15 Dimitri on January 27, 2012 said:

In reply to Simon Kiss: could it be you have too little memory? I have a relatively fast laptop and 8 GB of memory -and still the Canadian maps-building is very slow.
16 Dimitri on January 27, 2012 said:

Question for the author.
First of all – thanks a lot, it helped me personally a lot.
One question – for your example:
What is the best way to take the Germany map you built and then overimpose on it (e.g., in black) the borders between German States (Laender).
I am trying to do it for Canada. I am able to create a map like yours with counties, a separate map that only has borders between Provinces, but I don’t know how to join the two.
Thanks a lot!
17 markheckmann on January 27, 2012 said:

Hi Dimitri,

I am afraid I cannot help you with this one. Its quite a long time since I last worked with maps.
I would need to really delve into the subject to remember all these details. Sorry.

Mark
18 RC on January 31, 2012 said:

Counties could be combined as follows.
http://stackoverflow.com/questions/8567603/combine-counties-in-geographic-data
19 Torsten on March 11, 2012 said:

Thank you very much for the good example.
Greetings from Düsseldorf

Feeds

All postings
Authors
- markheckmann
- nattomi
Ecosia – eco-friendly web search
Archives
Archives
Blogroll
Links
apply building functions combine data frame dot-dot-dot fields footnote graphics histogram interactive plot intro jackknife LaTex Lyx maps Matlab merge News NGD normalized google distance permutation playwith plotrix plyr progress bar R R.basic regression rggobi strings sweave tables trellis visualization zip fastener
Categories

Categories
Mark Heckmann’s blog
R-help latest postings
- An error has occurred; the feed is probably down. Try again later.
Recent Comments
Mark on Twitter
Tweets by markheckmann
Terms of use

The contents of the "R" you ready? blog are offered under a creative commons license.
Meta

Infomaps using R – Visualizing German unemployment rates by district on a map

Share this:

Related

26 Responses to “Infomaps using R – Visualizing German unemployment rates by district on a map”

Feeds

Authors

Ecosia – eco-friendly web search

Archives

Blogroll

Links

Categories

Mark Heckmann’s blog

R-help latest postings

Recent Comments

Mark on Twitter

Terms of use

Meta