R: Retrieving information from google using the RCurl package

01Jan09

semantics1 Lately I read the article Automatic Meaning Discovery Using Google by Cilibras and VitanyiIt which introduces the normalized google distance (NGD) as a measure of semantic relatedness of two search terms. As its basis for calculation the NGD uses simple google search result counts.

Now I want to figure out how to impelement this calculation using R. The first step is to retrieve the needed information from the google website. The second step is to do the calculations. For today only step one, the rest will follow.

I found a nice site written by Duncan Temple Lang that explains the extraction of HTML code from any internet site using the RCurl package. Via the package it is possible to specify URL, user name, password and many other things. The package provides features to send requests to a site either via the browsers HTTP line or directly to a site’s forms. Also an example is given of how to submit a search request to google. This is all we need to get going.

###############################################################

library(RCurl)

# now lets extract the HTML code from my blog using getURL()
# from the RCurl package

getURL("http://www.markheckmann.de")

# this looks pretty unstructured. But we may have an organized
# view using htmlTreeParse() from the XML package
# This is just to see what we are dealing with

library(XML)
htmlTreeParse(getURL("http://www.markheckmann.de")

# Now let's do a google request using the browsers command
# line. This can be achieved via the RCurl getForm() function,
# which constructs and sends such a line. Here we can choose
# hl=language, q= search terms and several other parameters.
# Let's search for the term "r-project".

site <- getForm("http://www.google.com/search", hl="en",
                lr="", q="r-project", btnG="Search")
htmlTreeParse(site)

# Now we have the Google result HTML code and have to
# extract the relevant information from it.

typeof(site)

# As we see, site contains plain character HTML code, so
# I can use use simple text manipulation functions here.

# What part of the code do I have to extract now? Somewhere
# in the HTML code there is a line like this:
#                  <b> some numerics </b>
# So the number is in bewteen the <b> </b> argument. How
# can we get this?

text <- "We are looking for something like <b>12.345</b>
         or similar"
gregexpr("<b>12.345</b>", text, fixed = TRUE)

# gregexpr will return the position of the text we are searching
# for. Now we need to generalize this to all numbers. I am
# still not too familiar with regular expressions. Chapter
# seven in Spector, P. (2008). Data Manipulation with R (UseR)
# contains a good explanation of these.

gregexpr('<b>[0-9.,]{1,20}</b>', text)

# This does the job! The problem now is that there are a
# number of brackets like the one above containing numbers.
# So we need a to find the exact parts which to extract. 
# In an English google search there is the words "of about"
# followed by the search count. In German it is preceeded by
# the word "ungefähr". I will use these as indicator words to
# spot the position from where to extract.

indicatorWord <- "of about"

# start extraction after indicator word
posExtractStart <- gregexpr(indicatorWord, siteHTML,
                            fixed = TRUE)[[1]]

# extract string of 30 chracters length which should be enough
# to get the numbers
stringExtract <- substring(siteHTML, first=posExtractStart,
                           last = posExtractStart + 30)

# search for <b>number</b> (see above)
posResults <- gregexpr('<b>[0-9.,]{1,20}</b>', stringExtract)
posFirst <- posResults[[1]][1]
textLength  <- attributes(posResults[[1]])$match.length
stringExtract <- substring(stringExtract, first=posFirst,
                           last = posFirst + textLength)

# actually the last four lines are usually not necessary. Just
# in case the search term itself is numeric we would run the
# risk of unwillingly extracting some abundant numerics
# distorting the count results.

# erase everything but the numbers
stringExtract <- gsub("[^0-9]", "", stringExtract)

print(stringExtract)

# now we can use this for the calculation of the normalized
# google distance

###############################################################

The above implementation surely is not technically mature (e.g. the extraction code). Especially as I suppose this could be done much easier using Google APIs. Comments are welcome!

As a last step let’s wrap the above way to extract the google search results count into a function.

###############################################################
#
#  description:   returns the google results count
#  usage:         getGoogleCount(searchTerms, language, ...)
#  arguments:
#                 searchterms   The terms searched for in
#                               vector form e.g. c("wikipedia")
#                               or  c("wikipedia","R")
#                 language      in which lnguage to search.
#                               Either "en" (english) or
#                               "de" (german)          

getGoogleCount <- function(searchTerms=NULL,
                           language="de",
                           ...){

    # check for arguments
    if(is.null(searchTerms)) stop("Please enter search terms!")
    if(!any(language==c("de","en"))) stop("Please enter correct
                                           language (de, en)!")

    # construct google like expression
    require(RCurl)
    # Collapse search terms.
    entry <- paste(searchTerms, collapse="+")
    siteHTML <- getForm("http://www.google.com/search",
                        hl=language, lr="", q=entry,
                        btnG="Search")
    
    # select language sepcific indicator word
    if(language=="de") indicatorWord <- "ungefähr" else
                       indicatorWord <- "of about"        

    # start extraction at indicator word position
    posExtractStart <- gregexpr(indicatorWord, siteHTML,
                                fixed = TRUE)[[1]]
    # extract string of 30 chracters length
    stringExtract <- substring(siteHTML, first=posExtractStart,
                               last = posExtractStart + 30)
    # search for <b>number</b> (can be left out, see above)
    posResults <- gregexpr('<b>[0-9.,]{1,20}</b>', stringExtract)
    posFirst <- posResults[[1]][1]
    textLength  <- attributes(posResults[[1]])$match.length
    stringExtract <- substring(stringExtract, first=posFirst,
                               last = posFirst + textLength)
    # erase everything but the numbers
    matchCount <- as.numeric(gsub("[^0-9]", "", stringExtract))

    return(matchCount)
}

# NOR RUN

getGoogleCount(c("r-project"), language="en")
getGoogleCount(c("r-project", "europe"), language="en")

###############################################################

Next time I will use this function to calculate the normalized google distance. Comments are welcome!

Happy New Year!

Mark

Filed under: R / R-Code | 8 Comments
Tags: NGD, normalized google distance

8 Responses to “R: Retrieving information from google using the RCurl package”

Feed for this Entry Trackback Address

1 deevybee on July 22, 2012 said:

Thanks Mark. I am a beginner and slowly working through your code and learning a lot.
I think there may be a missing close brackets near the top of your code?
Should it be?

library(XML)
htmlTreeParse(getURL(“http://www.markheckmann.de”))
2 markheckmann on July 22, 2012 said:

yes, thanks. wherever a bracket is opened, it mus be closed again :)
– Mark
3 Matt on May 22, 2013 said:

This no longer works? Not certain if the change is with google or the XML package

Feeds

All postings
Authors
- markheckmann
- nattomi
Ecosia – eco-friendly web search
Archives
Archives
Blogroll
Links
apply building functions combine data frame dot-dot-dot fields footnote graphics histogram interactive plot intro jackknife LaTex Lyx maps Matlab merge News NGD normalized google distance permutation playwith plotrix plyr progress bar R R.basic regression rggobi strings sweave tables trellis visualization zip fastener
Categories

Categories
Mark Heckmann’s blog
R-help latest postings
- An error has occurred; the feed is probably down. Try again later.
Recent Comments
Mark on Twitter
Tweets by markheckmann
Terms of use

The contents of the "R" you ready? blog are offered under a creative commons license.
Meta

R: Retrieving information from google using the RCurl package

Share this:

Related

8 Responses to “R: Retrieving information from google using the RCurl package”

Feeds

Authors

Ecosia – eco-friendly web search

Archives

Blogroll

Links

Categories

Mark Heckmann’s blog

R-help latest postings

Recent Comments

Mark on Twitter

Terms of use

Meta