R: Retrieving information from google using the RCurl package
Lately I read the article Automatic Meaning Discovery Using Google by Cilibras and VitanyiIt which introduces the normalized google distance (NGD) as a measure of semantic relatedness of two search terms. As its basis for calculation the NGD uses simple google search result counts.
Now I want to figure out how to impelement this calculation using R. The first step is to retrieve the needed information from the google website. The second step is to do the calculations. For today only step one, the rest will follow.
I found a nice site written by Duncan Temple Lang that explains the extraction of HTML code from any internet site using the RCurl package. Via the package it is possible to specify URL, user name, password and many other things. The package provides features to send requests to a site either via the browsers HTTP line or directly to a site’s forms. Also an example is given of how to submit a search request to google. This is all we need to get going.
###############################################################
library(RCurl) # now lets extract the HTML code from my blog using getURL() # from the RCurl package getURL("http://www.markheckmann.de") # this looks pretty unstructured. But we may have an organized # view using htmlTreeParse() from the XML package # This is just to see what we are dealing with library(XML) htmlTreeParse(getURL("http://www.markheckmann.de") # Now let's do a google request using the browsers command # line. This can be achieved via the RCurl getForm() function, # which constructs and sends such a line. Here we can choose # hl=language, q= search terms and several other parameters. # Let's search for the term "r-project". site <- getForm("http://www.google.com/search", hl="en", lr="", q="r-project", btnG="Search") htmlTreeParse(site) # Now we have the Google result HTML code and have to # extract the relevant information from it. typeof(site) # As we see, site contains plain character HTML code, so # I can use use simple text manipulation functions here. # What part of the code do I have to extract now? Somewhere # in the HTML code there is a line like this: # <b> some numerics </b> # So the number is in bewteen the <b> </b> argument. How # can we get this? text <- "We are looking for something like <b>12.345</b> or similar" gregexpr("<b>12.345</b>", text, fixed = TRUE) # gregexpr will return the position of the text we are searching # for. Now we need to generalize this to all numbers. I am # still not too familiar with regular expressions. Chapter # seven in Spector, P. (2008). Data Manipulation with R (UseR) # contains a good explanation of these. gregexpr('<b>[0-9.,]{1,20}</b>', text) # This does the job! The problem now is that there are a # number of brackets like the one above containing numbers. # So we need a to find the exact parts which to extract. # In an English google search there is the words "of about" # followed by the search count. In German it is preceeded by # the word "ungefähr". I will use these as indicator words to # spot the position from where to extract. indicatorWord <- "of about" # start extraction after indicator word posExtractStart <- gregexpr(indicatorWord, siteHTML, fixed = TRUE)[[1]] # extract string of 30 chracters length which should be enough # to get the numbers stringExtract <- substring(siteHTML, first=posExtractStart, last = posExtractStart + 30) # search for <b>number</b> (see above) posResults <- gregexpr('<b>[0-9.,]{1,20}</b>', stringExtract) posFirst <- posResults[[1]][1] textLength <- attributes(posResults[[1]])$match.length stringExtract <- substring(stringExtract, first=posFirst, last = posFirst + textLength) # actually the last four lines are usually not necessary. Just # in case the search term itself is numeric we would run the # risk of unwillingly extracting some abundant numerics # distorting the count results. # erase everything but the numbers stringExtract <- gsub("[^0-9]", "", stringExtract) print(stringExtract) # now we can use this for the calculation of the normalized # google distance ###############################################################
The above implementation surely is not technically mature (e.g. the extraction code). Especially as I suppose this could be done much easier using Google APIs. Comments are welcome!
As a last step let’s wrap the above way to extract the google search results count into a function.
############################################################### # # description: returns the google results count # usage: getGoogleCount(searchTerms, language, ...) # arguments: # searchterms The terms searched for in # vector form e.g. c("wikipedia") # or c("wikipedia","R") # language in which lnguage to search. # Either "en" (english) or # "de" (german) getGoogleCount <- function(searchTerms=NULL, language="de", ...){ # check for arguments if(is.null(searchTerms)) stop("Please enter search terms!") if(!any(language==c("de","en"))) stop("Please enter correct language (de, en)!") # construct google like expression require(RCurl) # Collapse search terms. entry <- paste(searchTerms, collapse="+") siteHTML <- getForm("http://www.google.com/search", hl=language, lr="", q=entry, btnG="Search") # select language sepcific indicator word if(language=="de") indicatorWord <- "ungefähr" else indicatorWord <- "of about" # start extraction at indicator word position posExtractStart <- gregexpr(indicatorWord, siteHTML, fixed = TRUE)[[1]] # extract string of 30 chracters length stringExtract <- substring(siteHTML, first=posExtractStart, last = posExtractStart + 30) # search for <b>number</b> (can be left out, see above) posResults <- gregexpr('<b>[0-9.,]{1,20}</b>', stringExtract) posFirst <- posResults[[1]][1] textLength <- attributes(posResults[[1]])$match.length stringExtract <- substring(stringExtract, first=posFirst, last = posFirst + textLength) # erase everything but the numbers matchCount <- as.numeric(gsub("[^0-9]", "", stringExtract)) return(matchCount) } # NOR RUN getGoogleCount(c("r-project"), language="en") getGoogleCount(c("r-project", "europe"), language="en") ###############################################################
Next time I will use this function to calculate the normalized google distance. Comments are welcome!
Happy New Year!
Mark
Filed under: R / R-Code | 8 Comments
Tags: NGD, normalized google distance
Thanks Mark. I am a beginner and slowly working through your code and learning a lot.
I think there may be a missing close brackets near the top of your code?
Should it be?
library(XML)
htmlTreeParse(getURL(“http://www.markheckmann.de”))
yes, thanks. wherever a bracket is opened, it mus be closed again :)
– Mark
This no longer works? Not certain if the change is with google or the XML package