R: Retrieving information from google using the RCurl package

01Jan09

semantics1Lately I read the article Automatic Meaning Discovery Using Google by Cilibras and VitanyiIt which introduces the normalized google distance (NGD) as a measure of semantic relatedness of two search terms. As its basis for  calculation the NGD uses simple google search result counts.

Now I want to figure out how to impelement this calculation using R. The first step is to retrieve the needed information from the google website. The second step is to do the calculations. For today only step one, the rest will follow.

I found a nice site written by Duncan Temple Lang that explains the extraction of HTML code from any internet site using the RCurl package. Via the package it is possible to specify URL, user name, password and many other things. The package provides features to send requests to a site either via the browsers HTTP line or directly to a site’s forms. Also an example is given of how to submit a search request to google. This is all we need to get going.

###############################################################
library(RCurl)

# now lets extract the HTML code from my blog using getURL()
# from the RCurl package

getURL("http://www.markheckmann.de")

# this looks pretty unstructured. But we may have an organized
# view using htmlTreeParse() from the XML package
# This is just to see what we are dealing with

library(XML)
htmlTreeParse(getURL("http://www.markheckmann.de")

# Now let's do a google request using the browsers command
# line. This can be achieved via the RCurl getForm() function,
# which constructs and sends such a line. Here we can choose
# hl=language, q= search terms and several other parameters.
# Let's search for the term "r-project".

site <- getForm("http://www.google.com/search", hl="en",
                lr="", q="r-project", btnG="Search")
htmlTreeParse(site)

# Now we have the Google result HTML code and have to
# extract the relevant information from it.

typeof(site)

# As we see, site contains plain character HTML code, so
# I can use use simple text manipulation functions here.

# What part of the code do I have to extract now? Somewhere
# in the HTML code there is a line like this:
#                  <b> some numerics </b>
# So the number is in bewteen the <b> </b> argument. How
# can we get this?

text <- "We are looking for something like <b>12.345</b>
         or similar"
gregexpr("<b>12.345</b>", text, fixed = TRUE)

# gregexpr will return the position of the text we are searching
# for. Now we need to generalize this to all numbers. I am
# still not too familiar with regular expressions. Chapter
# seven in Spector, P. (2008). Data Manipulation with R (UseR)
# contains a good explanation of these.

gregexpr('<b>[0-9.,]{1,20}</b>', text)

# This does the job! The problem now is that there are a
# number of brackets like the one above containing numbers.
# So we need a to find the exact parts which to extract. 
# In an English google search there is the words "of about"
# followed by the search count. In German it is preceeded by
# the word "ungefähr". I will use these as indicator words to
# spot the position from where to extract.

indicatorWord <- "of about"

# start extraction after indicator word
posExtractStart <- gregexpr(indicatorWord, siteHTML,
                            fixed = TRUE)[[1]]

# extract string of 30 chracters length which should be enough
# to get the numbers
stringExtract <- substring(siteHTML, first=posExtractStart,
                           last = posExtractStart + 30)

# search for <b>number</b> (see above)
posResults <- gregexpr('<b>[0-9.,]{1,20}</b>', stringExtract)
posFirst <- posResults[[1]][1]
textLength  <- attributes(posResults[[1]])$match.length
stringExtract <- substring(stringExtract, first=posFirst,
                           last = posFirst + textLength)

# actually the last four lines are usually not necessary. Just
# in case the search term itself is numeric we would run the
# risk of unwillingly extracting some abundant numerics
# distorting the count results.

# erase everything but the numbers
stringExtract <- gsub("[^0-9]", "", stringExtract)

print(stringExtract)

# now we can use this for the calculation of the normalized
# google distance

###############################################################

The above implementation surely is not technically mature (e.g. the extraction code). Especially as I suppose this could be done much easier using Google APIs. Comments are welcome!

As a last step let’s wrap the above way to extract the google search results count into a function.

###############################################################
#
#  description:   returns the google results count
#  usage:         getGoogleCount(searchTerms, language, ...)
#  arguments:
#                 searchterms   The terms searched for in
#                               vector form e.g. c("wikipedia")
#                               or  c("wikipedia","R")
#                 language      in which lnguage to search.
#                               Either "en" (english) or
#                               "de" (german)          

getGoogleCount <- function(searchTerms=NULL,
                           language="de",
                           ...){

    # check for arguments
    if(is.null(searchTerms)) stop("Please enter search terms!")
    if(!any(language==c("de","en"))) stop("Please enter correct
                                           language (de, en)!")

    # construct google like expression
    require(RCurl)
    # Collapse search terms.
    entry <- paste(searchTerms, collapse="+")
    siteHTML <- getForm("http://www.google.com/search",
                        hl=language, lr="", q=entry,
                        btnG="Search")
    
    # select language sepcific indicator word
    if(language=="de") indicatorWord <- "ungefähr" else
                       indicatorWord <- "of about"        

    # start extraction at indicator word position
    posExtractStart <- gregexpr(indicatorWord, siteHTML,
                                fixed = TRUE)[[1]]
    # extract string of 30 chracters length
    stringExtract <- substring(siteHTML, first=posExtractStart,
                               last = posExtractStart + 30)
    # search for <b>number</b> (can be left out, see above)
    posResults <- gregexpr('<b>[0-9.,]{1,20}</b>', stringExtract)
    posFirst <- posResults[[1]][1]
    textLength  <- attributes(posResults[[1]])$match.length
    stringExtract <- substring(stringExtract, first=posFirst,
                               last = posFirst + textLength)
    # erase everything but the numbers
    matchCount <- as.numeric(gsub("[^0-9]", "", stringExtract))

    return(matchCount)
}

# NOR RUN

getGoogleCount(c("r-project"), language="en")
getGoogleCount(c("r-project", "europe"), language="en")

###############################################################

Next time I will use this function to calculate the normalized google distance. Comments are welcome!

Happy New Year!

Mark

About these ads


6 Responses to “R: Retrieving information from google using the RCurl package”

  1. Thanks Mark. I am a beginner and slowly working through your code and learning a lot.
    I think there may be a missing close brackets near the top of your code?
    Should it be?

    library(XML)
    htmlTreeParse(getURL(“http://www.markheckmann.de”))

  2. 2 markheckmann

    yes, thanks. wherever a bracket is opened, it mus be closed again :)
    - Mark

  3. 3 Matt

    This no longer works? Not certain if the change is with google or the XML package


  1. 1 R-ohjelmointi.org » Blog Archive » GoogleFight
  2. 2 Is there a way for me to quickly copy/paste a HTML table into a format suitable for R? - Quora
  3. 3 R: Normalized Google distance (NGD) in R part II | "R" you ready?

Follow

Get every new post delivered to your Inbox.

Join 51 other followers

%d bloggers like this: