R: Normalized Google distance (NGD) in R part II

12Jan09

brainAfter my last posting on how to extract the google number count I was searching the web and found a nice website allowing you to calculate many semantic relatedness measures. On request it seems to be possible to get free access to their API. The API allows you to post a request via the GET or POST method which can be implemented in R using the RCurl package.

Anyway, I will post the code to do the normalized google distance (NGD) calculation using R only. As last time the code for the google count extration implemented in R was posted as a first step, here comes the second step, the calculation, using the function described last time.

The calculation formula might look a bit scary at a first glance:

google_distance_formula1

Looking at its step-by-step development in the article Automatic Meaning Discovery Using Google it gets quite easy to understand the rationale behind it. What we need to know here is that M is the total number of web pages searched by Google.  f(x) and f(y) are the counts for search terms x and y, respectively. f(xy) is the number of web pages found on which both x and y occur (also see Wikipedia). So the ingredients are clear. Here comes the function.

###############################################################
#
#  description:  returns the normalized google distance as
#                numeric value
#
#  usage:        NGD(words, language, print, list, ...)
#
#  arguments:
#                words      TWO terms to measure for in
#                           vector form e.g. c("wiki","R")
#                language   in which lnguage to search.
#                           Either "en" (english) or
#                           "de" (german)
#                print      print alls results (NGD, counts)
#                           to console (no default)
#                list       returns list of results (no default)
#                           containing NGD and all counts.
#                ...        at the moment nothing


NGD <- function(words, language="de", print=FALSE,
                list=FALSE, ...){

    # check for arguments
    if(!hasArg(words)) stop('NGD needs TWO strings like
                       c("word","word2") as word argument!')
    if(length(words)!=2) stop('word arguments has to be of
                         length two, e.g. c("word","word2")')
    
    # M: total number of web pages searched by google (2007)
    if(hasArg(M)) M <- list(...)$M else M <- 8058044651    

    x <- words[1]
    y <- words[2]

    # using getGoogleCount() function (see here)
    freq.x  <- getGoogleCount(x, language=language)
    freq.y  <- getGoogleCount(y, language=language)
    freq.xy <- getGoogleCount(c(x,y), language=language)

    # apply formula
    NGD = (max(log(freq.x), log(freq.y)) - log(freq.xy)) /
          (log(M) - min( log(freq.x), log(freq.y)) )

    # print results to console if requested
    if(print==TRUE){
        cat("\t", x,":", freq.x, "\n",
            "\t", y,":", freq.y, "\n",
            "\t", x,"+", y,":", freq.xy, "\n",
            "\t", "normalized google distance (NGD):",
                                          NGD, "\n", "\n")
    }

    
    # return list of results if requested (no default)
    # containing NGD and all counts. As default only one
    # the NGD is returned as numeric value
    
    results <- list(NGD=NGD,
                    x=c(x, freq.x),
                    y=c(y, freq.y),
                    xy=c(paste(x,"+",y), freq.xy)) 

    if(list==TRUE) return(results) else  return(NGD)
}


# NOT RUN:

NGD(c("rider","horse"), print=T)
NGD(c("rider","horse"), list=TRUE)             # returns a list
# may be applied to dataframes
DF <- data.frame(A=c("rider","religion"), B=c("horse","god"))
apply(DF, 1, NGD, print=TRUE)    

###############################################################

The function returns the normalized google distance and can be applied onto a data frame cointaining pairs of words (like DF, see above). I am not sure if the calculation renders correct results. So if someone might notice a flaw, please comment.

Mark

About these ads


9 Responses to “R: Normalized Google distance (NGD) in R part II”

  1. 1 Mike

    Hi Mark,
    thanks for your blog. I have been experimenting with the web capabilities R offers and I found your blog entry very useful.
    Thanks a lot.

  2. 2 qingyuyue

    Hi Mark,
    Your blog is very helpful for me ,thanks a lot.I want to know more about your R and the google distance.can you give me your email ?

  3. 3 qingyuyue

    My email : qingyuyue@gmail.com

  4. Can this technique be used to determine the NGD distance between two web pages?, i.e. to tell if they are dissimilar enough to be considered non-duplicates?

  5. 5 markheckmann

    Hi,
    I am not familiar with how to determine semantic similarity between several web pages as a whole. This script is designed to work with some words only. So, I do not think it will be useful for your purpose.

    HTH

  6. 6 JianYi

    Hi,
    Is total number of web pages searched by google accurate? I think the value of M will change in real-time. So , is there a way to determine this “M”? using Google API?
    thank you

  7. 7 markheckmann

    I think this value is obsolete by now. Unfortunately I was not able to find an accurate value published by Google. This M was taken from literature. I do not know how sensitive the results are to value changes in M. If you find a way to properly determine the value (maybe via the API) please let me know.
    Mark

  8. 8 Matt

    M is roughly 4.5*10^9 according to worldwidewebsize.com, which looks like it has a good method for estimating

  9. 9 Matt

    Sorry, that should read 4.5*10^10



Follow

Get every new post delivered to your Inbox.

Join 56 other followers

%d bloggers like this: