Thursday, July 5, 2012

2012-07-05: R Package Recommendation

 Much of my research is focused on data-mining the Collective Intelligence of the Internet to see if any type of group intelligence emerges out of vast amount of data present on the Web. Sifting through even a portion of the data available is a daunting task and I frequently rely on Python to handle most of the heavy lifting. Over the past year I have been attempting to increase my use of R and many of the interesting packages available on CRAN.

In my quest to become more proficient in R and to use it in more of my research, I am continually experimenting with new and interesting code and data examples. Of particular interest to me are data mashups where an example of real-world data is collected, some form of intelligence is  extracted via data-mining or machine learning and then an informative graphic is produced that shares the information obtained from the data.

A recently published book that grabbed my attention is Machine Learning for Hackers. This book is a down to earth guide on how to use R to manipulate real world examples of data. There are some typographical errors in the book and a few bugs in the code but otherwise it is a great reference on using R. One item that caught my attention was Kaggle . Kaggle is a platform for predictive modelling and analytics competitions. Anyone can create an account and participate in the competitions. Many of them are similar to the famous Netflix prize.

One of the competitions on Kaggle was an R Package Recommendation Engine whose aim was to suggest R packages that a user may find useful. The training data was the list of installed packages for a group of professional R users. The book addressed this competition and gave a simple solution but I felt it needed something more so I wrote some R code to read your currently installed packages and compare it to the training data and suggest a list of packages that might be useful based upon the packages you already have installed.

The main method used to create the suggested packages in the K Nearest Neighbor algorithm. The k-nearest neighbor algorithm attempts to classify an object based upon the objects that are closest to it. The code is available on Bit Bucket - R Package Recommendation. Instructions are located in the header.

The easiest way to obtain the code is to use mercurial to import the code or download it from Bitbucket. The code will download the training data from
http://www.cs.odu.edu/~gszalkow/data/r_pkg_rec/mod_installations.csv if the file is not already present in the same directory. Your currently installed packages are listed and compared to the training data using the Pearson Correlation. This value is then converted into a distance measurement and the K nearest neighbor algorithm is run. Each of the packages is then weighted with a probability of being installed based upon what is already installed and the top ten packages are displayed as suggestions.


Experimentally a k value of 25 seemed to work the best for me and that is the default value in the code. Your installed packages will of course be different and your results may differ. Please feel free to experiment and let us know how it works out.

If you source("r_pkg_rec.R") you can inspect some of the variables.

The variable similarities holds all of the pearson correlation coefficients, it is a big 2489 x 2489 matrix. You can view the first 10 packages like this:
similarities[1:10,1:10]
                              X         User       abind AcceptanceSampling
X                   1.000000000  0.997943655  0.09855207         0.01765247
User                0.997943655  1.000000000  0.09160935         0.02874688
abind               0.098552066  0.091609354  1.00000000        -0.05815526
AcceptanceSampling  0.017652470  0.028746882 -0.05815526         1.00000000
ACCLMA             -0.007927525  0.008845982 -0.20313123         0.35240989
accuracy            0.315251995  0.324641246 -0.08966437        -0.09487490
acepack            -0.386083121 -0.377992168  0.11554204        -0.16173938
aCGH.Spline         0.317747123  0.316254323  0.10010416        -0.09221389
actuar              0.113627857  0.105917964  0.11984743        -0.06081410
ada                 0.131430415  0.122268427  0.05051568         0.17611828
                         ACCLMA    accuracy      acepack  aCGH.Spline
X                  -0.007927525  0.31525199 -0.386083121  0.317747123
User                0.008845982  0.32464125 -0.377992168  0.316254323
abind              -0.203131229 -0.08966437  0.115542040  0.100104163
AcceptanceSampling  0.352409893 -0.09487490 -0.161739375 -0.092213889
ACCLMA              1.000000000 -0.12475847  0.111165896  0.104680881
accuracy           -0.124758469  1.00000000 -0.145665943  0.061481645
acepack             0.111165896 -0.14566594  1.000000000 -0.007184676
aCGH.Spline         0.104680881  0.06148164 -0.007184676  1.000000000
actuar              0.047385621 -0.02144286  0.111165896 -0.066029479
ada                 0.196401523 -0.19592961 -0.139737105 -0.008343770
                        actuar         ada
X                   0.11362786  0.13143042
User                0.10591796  0.12226843
abind               0.11984743  0.05051568
AcceptanceSampling -0.06081410  0.17611828
ACCLMA              0.04738562  0.19640152
accuracy           -0.02144286 -0.19592961
acepack             0.11116590 -0.13973710
aCGH.Spline        -0.06602948 -0.00834377
actuar              1.00000000 -0.07280401
ada                -0.07280401  1.00000000
You can see that the diagonal values have a correlation coefficient of 1.0 so each package is perfectly similar to itself.

Each of my computers is a little different and have slightly different package sets installed. That being said the suggested packages are similar to each other but there are differences. The machine that I am on know with k=25 suggested the following packages.
 listingA[1:10]
 [1] "anacor"        "CoxBoost"      "klin"          "surv2sample"  
 [5] "arules"        "BayHaz"        "blockmodeling" "cclust"       
 [9] "cmrutils"      "DiagnosisMed" 


The packages klin, surv2sample, arules, cclust, cmrutils, and DiagnosisMed have been common to all of my machines. The result seems logical as I have a base set of packages that I install on all of my machines.

I would like to thank Justin Brunelle for helping me test the code.



-- Greg Szalkowski

No comments:

Post a Comment