### R: Normalized Google distance (NGD) in R part II

After my last posting on how to extract the google number count I was searching the web and found a nice website allowing you to calculate many semantic relatedness measures. On request it seems to be possible to get free access to their API. The API allows you to post a request via the GET or POST method which can be implemented in R using the RCurl package.

Anyway, I will post the code to do the *normalized google distance* (NGD) calculation using R only. As last time the code for the google count extration implemented in R was posted as a first step, here comes the second step, the calculation, using the function described last time.

The calculation formula might look a bit scary at a first glance:

Looking at its step-by-step development in the article Automatic Meaning Discovery Using Google it gets quite easy to understand the rationale behind it. What we need to know here is that *M* is the total number of web pages searched by Google. *f*(*x*) and *f*(*y*) are the counts for search terms *x* and *y*, respectively. *f*(*x*, *y*) is the number of web pages found on which both *x* and *y* occur (also see Wikipedia). So the ingredients are clear. Here comes the function.

############################################################### # # description: returns the normalized google distance as # numeric value # # usage: NGD(words, language, print, list, ...) # # arguments: # words TWO terms to measure for in # vector form e.g. c("wiki","R") # language in which lnguage to search. # Either "en" (english) or # "de" (german) # print print alls results (NGD, counts) # to console (no default) # list returns list of results (no default) # containing NGD and all counts. # ... at the moment nothing NGD <- function(words, language="de", print=FALSE, list=FALSE, ...){ # check for arguments if(!hasArg(words)) stop('NGD needs TWO strings like c("word","word2") as word argument!') if(length(words)!=2) stop('word arguments has to be of length two, e.g. c("word","word2")') # M: total number of web pages searched by google (2007) if(hasArg(M)) M <- list(...)$M else M <- 8058044651 x <- words[1] y <- words[2] # using getGoogleCount() function (see here) freq.x <- getGoogleCount(x, language=language) freq.y <- getGoogleCount(y, language=language) freq.xy <- getGoogleCount(c(x,y), language=language) # apply formula NGD = (max(log(freq.x), log(freq.y)) - log(freq.xy)) / (log(M) - min( log(freq.x), log(freq.y)) ) # print results to console if requested if(print==TRUE){ cat("\t", x,":", freq.x, "\n", "\t", y,":", freq.y, "\n", "\t", x,"+", y,":", freq.xy, "\n", "\t", "normalized google distance (NGD):", NGD, "\n", "\n") } # return list of results if requested (no default) # containing NGD and all counts. As default only one # the NGD is returned as numeric value results <- list(NGD=NGD, x=c(x, freq.x), y=c(y, freq.y), xy=c(paste(x,"+",y), freq.xy)) if(list==TRUE) return(results) else return(NGD) } # NOT RUN: NGD(c("rider","horse"), print=T) NGD(c("rider","horse"), list=TRUE) # returns a list # may be applied to dataframes DF <- data.frame(A=c("rider","religion"), B=c("horse","god")) apply(DF, 1, NGD, print=TRUE) ###############################################################

The function returns the *normalized google distance* and can be applied onto a data frame cointaining pairs of words (like DF, see above). I am not sure if the calculation renders correct results. So if someone might notice a flaw, please comment.

Mark

Filed under: R / R-Code | 9 Comments

Tags: NGD, normalized google distance

Hi Mark,

thanks for your blog. I have been experimenting with the web capabilities R offers and I found your blog entry very useful.

Thanks a lot.

Hi Mark,

Your blog is very helpful for me ,thanks a lot.I want to know more about your R and the google distance.can you give me your email ?

My email : qingyuyue@gmail.com

Can this technique be used to determine the NGD distance between two web pages?, i.e. to tell if they are dissimilar enough to be considered non-duplicates?

Hi,

I am not familiar with how to determine semantic similarity between several web pages as a whole. This script is designed to work with some words only. So, I do not think it will be useful for your purpose.

HTH

Hi,

Is total number of web pages searched by google accurate? I think the value of M will change in real-time. So , is there a way to determine this “M”? using Google API?

thank you

I think this value is obsolete by now. Unfortunately I was not able to find an accurate value published by Google. This M was taken from literature. I do not know how sensitive the results are to value changes in M. If you find a way to properly determine the value (maybe via the API) please let me know.

Mark

M is roughly 4.5*10^9 according to worldwidewebsize.com, which looks like it has a good method for estimating

Sorry, that should read 4.5*10^10