Machine Learning and Data Mining in R: Introduction to plyr package

plyr: split-apply-combine strategy

plyr package is a framework that uses splitting of larger dataset into subset, applying method to the subset, and combining the result. It provides a simplified alternative to base “apply” function. plyr makes writing the code easier and faster.

When to use plyr:
1. calculate mean of subset

When not to use plyr:
1. Dynamic simulation
2. Overlapping data

Basic Format of plyr:

plyr command has the format **ply. The first * refers to input data format, the second * is for the output data format.

Data format notation:
d - data frame
l - list
a - array
_ - discard

Example
ddply – input is data frame, output is data frame
dlply – input is data frame, output is list

Usage:

ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)

The key arguments are:

.data - dataset to be processed

.variables - variables to split data frame by

.fun - function to apply to each piece

plyr package comes with baseball dataset - Yearly batting record of all major league baseball players. I am using baseball package to demonstrate the functionality of plyr package.

library(plyr)

## 
## Attaching package: 'plyr'
##

data(baseball)

## fiinding year wise  mean of number of games, ## no of times at bat, no of runs, no of hits 

yearWiseMean <- ddply(baseball, .(year), function(df) colMeans(df[,6:9]))

head(yearWiseMean)

##   year     g    ab     r     h
## 1 1871 28.00 135.9 33.57 42.14
## 2 1872 29.46 140.8 32.15 42.92
## 3 1873 46.31 217.6 48.46 68.54
## 4 1874 49.00 226.7 44.00 64.87
## 5 1875 57.82 256.9 47.18 73.29
## 6 1876 58.67 258.4 43.27 72.40

Using transform and summarise and other helper function like mutate in plyr

If we are looking at the record of one player.

## Looking at the record of one player
baberuth <- subset(baseball, id == "ruthba01")
baberuth <- transform(baberuth, cyear = year - min(year) + 1)

plyr makes it easier to do the same for all the players.

baseball_with_cyear <- ddply(baseball, .(id), transform, cyear=year-min(year) +1)

## summaizing the home run be team for all players
homerun_summary_byTeam <- ddply(baseball, c("year", "team"), summarize,
           homeruns = sum(hr))
head(homerun_summary_byTeam)

##   year team homeruns
## 1 1871  CL1        4
## 2 1871  FW1        0
## 3 1871  NY2        1
## 4 1871  RC1        0
## 5 1871  TRO        2
## 6 1871  WS3        0

plyr helper function Mutate allows to do transformation iteratively.

## define a function to calculate career year played and 
calculate_cyear <- function(df) {
  mutate(df,
         cyear = year - min(year),
         cpercent = cyear / (max(year) - min(year))
  )
}

baseball <- ddply(baseball, .(id), calculate_cyear)

Using transform will throw an error, cyear not found. Mutate allows the use of newly created variable in its subsequent transformation. Mutate is the helper function in plyr package

Fitting a regression model for each player

### Fitting a regression model
model <- function(df) {
  lm(rbi / ab ~ cyear, data=df)
}

baseball <- subset(baseball, ab >= 25)

## Break up be player, and fit linear regression  model to each piece and 
## returns a list of model
bmodels <- dlply(baseball, .(id), model)

## Getting slope and intercept of the model
bcoefs <- ldply(bmodels, coef)
names(bcoefs)[2:3] <- c("intercept", "slope")

head(bcoefs)

##          id intercept      slope
## 1 aaronha01   0.18344  0.0001478
## 2 abernte02   0.00000         NA
## 3 adairje01   0.08599 -0.0007119
## 4 adamsba01   0.06025  0.0012002
## 5 adamsbo03   0.08675 -0.0019239
## 6 adcocjo01   0.14839  0.0027383

Machine Learning and Data Mining in R

Pages

Friday, October 3, 2014

Introduction to plyr package

plyr: split-apply-combine strategy

Basic Format of plyr:

Using transform and summarise and other helper function like mutate in plyr

Fitting a regression model for each player

No comments:

Post a Comment