In my quest to become more proficient in R and to use it in more of my research, I am continually experimenting with new and interesting code and data examples. Of particular interest to me are data mashups where an example of real-world data is collected, some form of intelligence is extracted via data-mining or machine learning and then an informative graphic is produced that shares the information obtained from the data.
A recently published book that grabbed my attention is Machine Learning for Hackers. This book is a down to earth guide on how to use R to manipulate real world examples of data. There are some typographical errors in the book and a few bugs in the code but otherwise it is a great reference on using R. One item that caught my attention was Kaggle . Kaggle is a platform for predictive modelling and analytics competitions. Anyone can create an account and participate in the competitions. Many of them are similar to the famous Netflix prize.
One of the competitions on Kaggle was an R Package Recommendation Engine whose aim was to suggest R packages that a user may find useful. The training data was the list of installed packages for a group of professional R users. The book addressed this competition and gave a simple solution but I felt it needed something more so I wrote some R code to read your currently installed packages and compare it to the training data and suggest a list of packages that might be useful based upon the packages you already have installed.
The main method used to create the suggested packages in the K Nearest Neighbor algorithm. The k-nearest neighbor algorithm attempts to classify an object based upon the objects that are closest to it. The code is available on Bit Bucket - R Package Recommendation. Instructions are located in the header.
The easiest way to obtain the code is to use mercurial to import the code or download it from Bitbucket. The code will download the training data from
http://www.cs.odu.edu/~gszalkow/data/r_pkg_rec/mod_installations.csv if the file is not already present in the same directory. Your currently installed packages are listed and compared to the training data using the Pearson Correlation. This value is then converted into a distance measurement and the K nearest neighbor algorithm is run. Each of the packages is then weighted with a probability of being installed based upon what is already installed and the top ten packages are displayed as suggestions.
Experimentally a k value of 25 seemed to work the best for me and that is the default value in the code. Your installed packages will of course be different and your results may differ. Please feel free to experiment and let us know how it works out.
If you source("r_pkg_rec.R") you can inspect some of the variables.
The variable similarities holds all of the pearson correlation coefficients, it is a big 2489 x 2489 matrix. You can view the first 10 packages like this:
You can see that the diagonal values have a correlation coefficient of 1.0 so each package is perfectly similar to itself.
similarities[1:10,1:10] X User abind AcceptanceSampling X 1.000000000 0.997943655 0.09855207 0.01765247 User 0.997943655 1.000000000 0.09160935 0.02874688 abind 0.098552066 0.091609354 1.00000000 -0.05815526 AcceptanceSampling 0.017652470 0.028746882 -0.05815526 1.00000000 ACCLMA -0.007927525 0.008845982 -0.20313123 0.35240989 accuracy 0.315251995 0.324641246 -0.08966437 -0.09487490 acepack -0.386083121 -0.377992168 0.11554204 -0.16173938 aCGH.Spline 0.317747123 0.316254323 0.10010416 -0.09221389 actuar 0.113627857 0.105917964 0.11984743 -0.06081410 ada 0.131430415 0.122268427 0.05051568 0.17611828 ACCLMA accuracy acepack aCGH.Spline X -0.007927525 0.31525199 -0.386083121 0.317747123 User 0.008845982 0.32464125 -0.377992168 0.316254323 abind -0.203131229 -0.08966437 0.115542040 0.100104163 AcceptanceSampling 0.352409893 -0.09487490 -0.161739375 -0.092213889 ACCLMA 1.000000000 -0.12475847 0.111165896 0.104680881 accuracy -0.124758469 1.00000000 -0.145665943 0.061481645 acepack 0.111165896 -0.14566594 1.000000000 -0.007184676 aCGH.Spline 0.104680881 0.06148164 -0.007184676 1.000000000 actuar 0.047385621 -0.02144286 0.111165896 -0.066029479 ada 0.196401523 -0.19592961 -0.139737105 -0.008343770 actuar ada X 0.11362786 0.13143042 User 0.10591796 0.12226843 abind 0.11984743 0.05051568 AcceptanceSampling -0.06081410 0.17611828 ACCLMA 0.04738562 0.19640152 accuracy -0.02144286 -0.19592961 acepack 0.11116590 -0.13973710 aCGH.Spline -0.06602948 -0.00834377 actuar 1.00000000 -0.07280401 ada -0.07280401 1.00000000
Each of my computers is a little different and have slightly different package sets installed. That being said the suggested packages are similar to each other but there are differences. The machine that I am on know with k=25 suggested the following packages.
listingA[1:10]  "anacor" "CoxBoost" "klin" "surv2sample"  "arules" "BayHaz" "blockmodeling" "cclust"  "cmrutils" "DiagnosisMed"
The packages klin, surv2sample, arules, cclust, cmrutils, and DiagnosisMed have been common to all of my machines. The result seems logical as I have a base set of packages that I install on all of my machines.
I would like to thank Justin Brunelle for helping me test the code.
-- Greg Szalkowski