### R: Normalized Google distance (NGD) in R part II

12Jan09

After my last posting on how to extract the google number count I was searching the web and found a nice website allowing you to calculate many semantic relatedness measures. On request it seems to be possible to get free access to their API. The API allows you to post a request via the GET or POST method which can be implemented in R using the RCurl package.

Anyway, I will post the code to do the normalized google distance (NGD) calculation using R only. As last time the code for the google count extration implemented in R was posted as a first step, here comes the second step, the calculation, using the function described last time.

The calculation formula might look a bit scary at a first glance:

Looking at its step-by-step development in the article Automatic Meaning Discovery Using Google it gets quite easy to understand the rationale behind it. What we need to know here is that M is the total number of web pages searched by Google.  f(x) and f(y) are the counts for search terms x and y, respectively. f(xy) is the number of web pages found on which both x and y occur (also see Wikipedia). So the ingredients are clear. Here comes the function.

```###############################################################
#
#  description:  returns the normalized google distance as
#                numeric value
#
#  usage:        NGD(words, language, print, list, ...)
#
#  arguments:
#                words      TWO terms to measure for in
#                           vector form e.g. c("wiki","R")
#                language   in which lnguage to search.
#                           Either "en" (english) or
#                           "de" (german)
#                print      print alls results (NGD, counts)
#                           to console (no default)
#                list       returns list of results (no default)
#                           containing NGD and all counts.
#                ...        at the moment nothing

NGD <- function(words, language="de", print=FALSE,
list=FALSE, ...){

# check for arguments
if(!hasArg(words)) stop('NGD needs TWO strings like
c("word","word2") as word argument!')
if(length(words)!=2) stop('word arguments has to be of
length two, e.g. c("word","word2")')

# M: total number of web pages searched by google (2007)
if(hasArg(M)) M <- list(...)\$M else M <- 8058044651

x <- words[1]
y <- words[2]

# using getGoogleCount() function (see here)

# apply formula
NGD = (max(log(freq.x), log(freq.y)) - log(freq.xy)) /
(log(M) - min( log(freq.x), log(freq.y)) )

# print results to console if requested
if(print==TRUE){
cat("\t", x,":", freq.x, "\n",
"\t", y,":", freq.y, "\n",
"\t", x,"+", y,":", freq.xy, "\n",
NGD, "\n", "\n")
}

# return list of results if requested (no default)
# containing NGD and all counts. As default only one
# the NGD is returned as numeric value

results <- list(NGD=NGD,
x=c(x, freq.x),
y=c(y, freq.y),
xy=c(paste(x,"+",y), freq.xy))

if(list==TRUE) return(results) else  return(NGD)
}

# NOT RUN:

NGD(c("rider","horse"), print=T)
NGD(c("rider","horse"), list=TRUE)             # returns a list
# may be applied to dataframes
DF <- data.frame(A=c("rider","religion"), B=c("horse","god"))
apply(DF, 1, NGD, print=TRUE)

###############################################################

```

The function returns the normalized google distance and can be applied onto a data frame cointaining pairs of words (like DF, see above). I am not sure if the calculation renders correct results. So if someone might notice a flaw, please comment.

Mark

#### 9 Responses to “R: Normalized Google distance (NGD) in R part II”

1. 1 Mike

Hi Mark,
thanks for your blog. I have been experimenting with the web capabilities R offers and I found your blog entry very useful.
Thanks a lot.

2. 2 qingyuyue

Hi Mark,

3. 3 qingyuyue

My email : qingyuyue@gmail.com

4. Can this technique be used to determine the NGD distance between two web pages?, i.e. to tell if they are dissimilar enough to be considered non-duplicates?

5. 5 markheckmann

Hi,
I am not familiar with how to determine semantic similarity between several web pages as a whole. This script is designed to work with some words only. So, I do not think it will be useful for your purpose.

HTH

6. 6 JianYi

Hi,
Is total number of web pages searched by google accurate? I think the value of M will change in real-time. So , is there a way to determine this “M”? using Google API?
thank you

7. 7 markheckmann

I think this value is obsolete by now. Unfortunately I was not able to find an accurate value published by Google. This M was taken from literature. I do not know how sensitive the results are to value changes in M. If you find a way to properly determine the value (maybe via the API) please let me know.
Mark

8. 8 Matt

M is roughly 4.5*10^9 according to worldwidewebsize.com, which looks like it has a good method for estimating

9. 9 Matt