R: Normalized Google distance (NGD) in R part II
After my last posting on how to extract the google number count I was searching the web and found a nice website allowing you to calculate many semantic relatedness measures. On request it seems to be possible to get free access to their API. The API allows you to post a request via the GET or POST method which can be implemented in R using the RCurl package.
Anyway, I will post the code to do the normalized google distance (NGD) calculation using R only. As last time the code for the google count extration implemented in R was posted as a first step, here comes the second step, the calculation, using the function described last time.
The calculation formula might look a bit scary at a first glance:

Looking at its step-by-step development in the article Automatic Meaning Discovery Using Google it gets quite easy to understand the rationale behind it. What we need to know here is that M is the total number of web pages searched by Google. f(x) and f(y) are the counts for search terms x and y, respectively. f(x, y) is the number of web pages found on which both x and y occur (also see Wikipedia). So the ingredients are clear. Here comes the function.
############################################################### # # description: returns the normalized google distance as # numeric value # # usage: NGD(words, language, print, list, ...) # # arguments: # words TWO terms to measure for in # vector form e.g. c("wiki","R") # language in which lnguage to search. # Either "en" (english) or # "de" (german) # print print alls results (NGD, counts) # to console (no default) # list returns list of results (no default) # containing NGD and all counts. # ... at the moment nothing NGD <- function(words, language="de", print=FALSE, list=FALSE, ...){ # check for arguments if(!hasArg(words)) stop('NGD needs TWO strings like c("word","word2") as word argument!') if(length(words)!=2) stop('word arguments has to be of length two, e.g. c("word","word2")') # M: total number of web pages searched by google (2007) if(hasArg(M)) M <- list(...)$M else M <- 8058044651 x <- words[1] y <- words[2] # using getGoogleCount() function (see here) freq.x <- getGoogleCount(x, language=language) freq.y <- getGoogleCount(y, language=language) freq.xy <- getGoogleCount(c(x,y), language=language) # apply formula NGD = (max(log(freq.x), log(freq.y)) - log(freq.xy)) / (log(M) - min( log(freq.x), log(freq.y)) ) # print results to console if requested if(print==TRUE){ cat("\t", x,":", freq.x, "\n", "\t", y,":", freq.y, "\n", "\t", x,"+", y,":", freq.xy, "\n", "\t", "normalized google distance (NGD):", NGD, "\n", "\n") } # return list of results if requested (no default) # containing NGD and all counts. As default only one # the NGD is returned as numeric value results <- list(NGD=NGD, x=c(x, freq.x), y=c(y, freq.y), xy=c(paste(x,"+",y), freq.xy)) if(list==TRUE) return(results) else return(NGD) } # NOT RUN: NGD(c("rider","horse"), print=T) NGD(c("rider","horse"), list=TRUE) # returns a list # may be applied to dataframes DF <- data.frame(A=c("rider","religion"), B=c("horse","god")) apply(DF, 1, NGD, print=TRUE) ###############################################################
The function returns the normalized google distance and can be applied onto a data frame cointaining pairs of words (like DF, see above). I am not sure if the calculation renders correct results. So if someone might notice a flaw, please comment.
Mark
Filed under: R / R-Code | 7 Comments
Tags: NGD, normalized google distance
All postings

Hi Mark,
thanks for your blog. I have been experimenting with the web capabilities R offers and I found your blog entry very useful.
Thanks a lot.
Hi Mark,
Your blog is very helpful for me ,thanks a lot.I want to know more about your R and the google distance.can you give me your email ?
My email : qingyuyue@gmail.com
Can this technique be used to determine the NGD distance between two web pages?, i.e. to tell if they are dissimilar enough to be considered non-duplicates?
Hi,
I am not familiar with how to determine semantic similarity between several web pages as a whole. This script is designed to work with some words only. So, I do not think it will be useful for your purpose.
HTH
Hi,
Is total number of web pages searched by google accurate? I think the value of M will change in real-time. So , is there a way to determine this “M”? using Google API?
thank you
I think this value is obsolete by now. Unfortunately I was not able to find an accurate value published by Google. This M was taken from literature. I do not know how sensitive the results are to value changes in M. If you find a way to properly determine the value (maybe via the API) please let me know.
Mark