Pages

Friday, October 3, 2014

Introduction to plyr package


plyr: split-apply-combine strategy

plyr package is a framework that uses splitting of larger dataset into subset, applying method to the subset, and combining the result. It provides a simplified alternative to base “apply” function. plyr makes writing the code easier and faster.
When to use plyr:
1. calculate mean of subset
When not to use plyr:
1. Dynamic simulation
2. Overlapping data

Basic Format of plyr:

plyr command has the format **ply. The first * refers to input data format, the second * is for the output data format.
Data format notation:
d - data frame
l - list
a - array
_ - discard
Example
ddply – input is data frame, output is data frame
dlply – input is data frame, output is list
Usage:
ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)
The key arguments are:
  • .data - dataset to be processed
  • .variables - variables to split data frame by
  • .fun - function to apply to each piece
plyr package comes with baseball dataset - Yearly batting record of all major league baseball players. I am using baseball package to demonstrate the functionality of plyr package.
library(plyr)
## 
## Attaching package: 'plyr'
## 
data(baseball)

## fiinding year wise  mean of number of games, ## no of times at bat, no of runs, no of hits 

yearWiseMean <- ddply(baseball, .(year), function(df) colMeans(df[,6:9]))

head(yearWiseMean)
##   year     g    ab     r     h
## 1 1871 28.00 135.9 33.57 42.14
## 2 1872 29.46 140.8 32.15 42.92
## 3 1873 46.31 217.6 48.46 68.54
## 4 1874 49.00 226.7 44.00 64.87
## 5 1875 57.82 256.9 47.18 73.29
## 6 1876 58.67 258.4 43.27 72.40

Using transform and summarise and other helper function like mutate in plyr

If we are looking at the record of one player.
## Looking at the record of one player
baberuth <- subset(baseball, id == "ruthba01")
baberuth <- transform(baberuth, cyear = year - min(year) + 1)
plyr makes it easier to do the same for all the players.
baseball_with_cyear <- ddply(baseball, .(id), transform, cyear=year-min(year) +1)

## summaizing the home run be team for all players
homerun_summary_byTeam <- ddply(baseball, c("year", "team"), summarize,
           homeruns = sum(hr))
head(homerun_summary_byTeam)
##   year team homeruns
## 1 1871  CL1        4
## 2 1871  FW1        0
## 3 1871  NY2        1
## 4 1871  RC1        0
## 5 1871  TRO        2
## 6 1871  WS3        0
plyr helper function Mutate allows to do transformation iteratively.
## define a function to calculate career year played and 
calculate_cyear <- function(df) {
  mutate(df,
         cyear = year - min(year),
         cpercent = cyear / (max(year) - min(year))
  )
}

baseball <- ddply(baseball, .(id), calculate_cyear)
Using transform will throw an error, cyear not found. Mutate allows the use of newly created variable in its subsequent transformation. Mutate is the helper function in plyr package

Fitting a regression model for each player

### Fitting a regression model
model <- function(df) {
  lm(rbi / ab ~ cyear, data=df)
}

baseball <- subset(baseball, ab >= 25)

## Break up be player, and fit linear regression  model to each piece and 
## returns a list of model
bmodels <- dlply(baseball, .(id), model)

## Getting slope and intercept of the model
bcoefs <- ldply(bmodels, coef)
names(bcoefs)[2:3] <- c("intercept", "slope")

head(bcoefs)
##          id intercept      slope
## 1 aaronha01   0.18344  0.0001478
## 2 abernte02   0.00000         NA
## 3 adairje01   0.08599 -0.0007119
## 4 adamsba01   0.06025  0.0012002
## 5 adamsbo03   0.08675 -0.0019239
## 6 adcocjo01   0.14839  0.0027383

Thursday, September 4, 2014

Movie Recommender System using recommenderlab

Recommendation system is used in day to day life. It is used in book search, online shopping, movie search, social networking, to name a few. Recommendation system applies statistical and knowledge discovery techniques to provide recommendation to new item to the user based on previously recorded data. The recommendation information can be used to increase customer retention, promote cross-selling, and add value to buyer-seller relationship.

Broadly recommender systems are classified into two categories:

  • Content based: recommending items that shares some common attributes based on user preferences
  • Collaborative filtering: recommending item from users sharing common preferences.

Commonly used metrics to quantify the performace of recommender systems are Root Mean Squared Error (RMSE), precision and Recall.

R has a nice package recommenderlab that provides infrastructure to develop and test recommender algorithm. recommenderlab focusses on recommender algorithm based on collaborative filtering.

I used recommenderlab to get insight into collaborative filtering algorithms and evalaute the performace of different algorithm available in the framework on Movie Lens 100k dataset. The dataset is downloaded from here.

###### Recommender System algorithm implementaion on Movie Lens 100k data ###

## load libraries ####
library(recommenderlab)
library(reshape2)


# Load Movie Lens data
dataList<- readData()
# data cleansing and preprocessing
ratingDF<- preProcess(dataList$ratingDF, dataList$movieDF)
# create movie rating matrix
movieRatingMat<- createRatingMatrix(ratingDF)
# evaluate models
evalList <- evaluateModels(movieRatingMat)
## RANDOM run 
##   1  [0.01sec/0.47sec] 
## POPULAR run 
##   1  [0.04sec/0.09sec] 
## UBCF run 
##   1  [0.02sec/20.99sec]

The plot for comparing “Random”, “Popular”, “UBCF” based recommender algorithm is shown:

# plot evaluation result
visualise(evalList)

plot of chunk unnamed-chunk-3 plot of chunk unnamed-chunk-3

The visualisation shows “UBCF” algorithm has highest precision. So I picked “UBCF” to predicts top 10 recommendation of user with userID = 1.

## on visualization, looks like UBCF has highest precision.
# get Confusion matrix for "UBCF"
getConfusionMatrix(evalList[["UBCF"]])[[1]][,1:4]
##        TP      FP    FN   TN
## 1  0.4316  0.5579 50.80 1602
## 3  1.3684  1.6000 49.86 1601
## 5  2.0000  2.9474 49.23 1600
## 10 3.6632  6.2316 47.57 1597
## 15 4.9368  9.9053 46.29 1593
## 20 6.0947 13.6947 45.14 1589
## run "UBCF" recommender
rec_model <- createModel(movieRatingMat, "UBCF")
userID <- 1
topN <- 5
recommendations(movieRatingMat, rec_model, userID, topN)
## [[1]]
## [1] "Glory (1989)"             "Schindler's List (1993)" 
## [3] "Close Shave, A (1995)"    "Casablanca (1942)"       
## [5] "Leaving Las Vegas (1995)"

The complete R code can be found here.

Thursday, August 21, 2014

Analyzing Median Income and School Rating in Kansas City Metro

Data Plot

I had been exploring the schools around the kansas city metro area. I thought of using R to analyse the school rating in the area and see if there is any link between school rating and median income of the population by zipcode. I used US census - ACS data API for median income data and Eduction.com data API to retrieve school rating data. I found this as a good resource to help me get to my objective.

In the process, I gained good insight into working with data API in R, choropleth map and overlaying different kind of information in single map.

Data Plot

plot of chunk unnamed-chunk-1

Conclusion

Eduacation.com API data does not have test rating information for the school at many zip codes. From the overlay of income and rating data (when available), it looks like the zip codes with lower median income has lower school rating. There are outliers too. At this point however, I can not conclude though if I see any pattern in median income and school rating. There could be other factors that affect the school ratings in an area.

This report is generated using knitr package in R