plyr: split-apply-combine strategy
plyr package is a framework that uses splitting of larger dataset into subset, applying method to the subset, and combining the result. It provides a simplified alternative to base “apply” function. plyr makes writing the code easier and faster.
When to use plyr:
1. calculate mean of subset
1. calculate mean of subset
When not to use plyr:
1. Dynamic simulation
2. Overlapping data
1. Dynamic simulation
2. Overlapping data
Basic Format of plyr:
plyr command has the format **ply. The first * refers to input data format, the second * is for the output data format.
Data format notation:
d - data frame
l - list
a - array
_ - discard
d - data frame
l - list
a - array
_ - discard
Example
ddply – input is data frame, output is data frame
dlply – input is data frame, output is list
ddply – input is data frame, output is data frame
dlply – input is data frame, output is list
Usage:
ddply(.data, .variables, .fun = NULL, …, .progress = “none”, .inform = FALSE, .drop = TRUE, .parallel = FALSE, .paropts = NULL)
The key arguments are:
- .data - dataset to be processed
- .variables - variables to split data frame by
- .fun - function to apply to each piece
plyr package comes with baseball dataset - Yearly batting record of all major league baseball players. I am using baseball package to demonstrate the functionality of plyr package.
library(plyr)
##
## Attaching package: 'plyr'
##
data(baseball)
## fiinding year wise mean of number of games, ## no of times at bat, no of runs, no of hits
yearWiseMean <- ddply(baseball, .(year), function(df) colMeans(df[,6:9]))
head(yearWiseMean)
## year g ab r h
## 1 1871 28.00 135.9 33.57 42.14
## 2 1872 29.46 140.8 32.15 42.92
## 3 1873 46.31 217.6 48.46 68.54
## 4 1874 49.00 226.7 44.00 64.87
## 5 1875 57.82 256.9 47.18 73.29
## 6 1876 58.67 258.4 43.27 72.40
Using transform and summarise and other helper function like mutate in plyr
If we are looking at the record of one player.
## Looking at the record of one player
baberuth <- subset(baseball, id == "ruthba01")
baberuth <- transform(baberuth, cyear = year - min(year) + 1)
plyr makes it easier to do the same for all the players.
baseball_with_cyear <- ddply(baseball, .(id), transform, cyear=year-min(year) +1)
## summaizing the home run be team for all players
homerun_summary_byTeam <- ddply(baseball, c("year", "team"), summarize,
homeruns = sum(hr))
head(homerun_summary_byTeam)
## year team homeruns
## 1 1871 CL1 4
## 2 1871 FW1 0
## 3 1871 NY2 1
## 4 1871 RC1 0
## 5 1871 TRO 2
## 6 1871 WS3 0
plyr helper function Mutate allows to do transformation iteratively.
## define a function to calculate career year played and
calculate_cyear <- function(df) {
mutate(df,
cyear = year - min(year),
cpercent = cyear / (max(year) - min(year))
)
}
baseball <- ddply(baseball, .(id), calculate_cyear)
Using transform will throw an error, cyear not found. Mutate allows the use of newly created variable in its subsequent transformation. Mutate is the helper function in plyr package
Fitting a regression model for each player
### Fitting a regression model
model <- function(df) {
lm(rbi / ab ~ cyear, data=df)
}
baseball <- subset(baseball, ab >= 25)
## Break up be player, and fit linear regression model to each piece and
## returns a list of model
bmodels <- dlply(baseball, .(id), model)
## Getting slope and intercept of the model
bcoefs <- ldply(bmodels, coef)
names(bcoefs)[2:3] <- c("intercept", "slope")
head(bcoefs)
## id intercept slope
## 1 aaronha01 0.18344 0.0001478
## 2 abernte02 0.00000 NA
## 3 adairje01 0.08599 -0.0007119
## 4 adamsba01 0.06025 0.0012002
## 5 adamsbo03 0.08675 -0.0019239
## 6 adcocjo01 0.14839 0.0027383
No comments:
Post a Comment