plyr: split-apply-combine strategy
1. calculate mean of subset
1. Dynamic simulation
2. Overlapping data
Basic Format of plyr:
d - data frame
l - list
a - array
_ - discard
ddply – input is data frame, output is data frame
dlply – input is data frame, output is list
- .data - dataset to be processed
- .variables - variables to split data frame by
- .fun - function to apply to each piece
library(plyr)
##
## Attaching package: 'plyr'
##
data(baseball)
## fiinding year wise mean of number of games, ## no of times at bat, no of runs, no of hits
yearWiseMean <- ddply(baseball, .(year), function(df) colMeans(df[,6:9]))
head(yearWiseMean)
## year g ab r h
## 1 1871 28.00 135.9 33.57 42.14
## 2 1872 29.46 140.8 32.15 42.92
## 3 1873 46.31 217.6 48.46 68.54
## 4 1874 49.00 226.7 44.00 64.87
## 5 1875 57.82 256.9 47.18 73.29
## 6 1876 58.67 258.4 43.27 72.40
Using transform and summarise and other helper function like mutate in plyr
## Looking at the record of one player
baberuth <- subset(baseball, id == "ruthba01")
baberuth <- transform(baberuth, cyear = year - min(year) + 1)
baseball_with_cyear <- ddply(baseball, .(id), transform, cyear=year-min(year) +1)
## summaizing the home run be team for all players
homerun_summary_byTeam <- ddply(baseball, c("year", "team"), summarize,
homeruns = sum(hr))
head(homerun_summary_byTeam)
## year team homeruns
## 1 1871 CL1 4
## 2 1871 FW1 0
## 3 1871 NY2 1
## 4 1871 RC1 0
## 5 1871 TRO 2
## 6 1871 WS3 0
## define a function to calculate career year played and
calculate_cyear <- function(df) {
mutate(df,
cyear = year - min(year),
cpercent = cyear / (max(year) - min(year))
)
}
baseball <- ddply(baseball, .(id), calculate_cyear)
Fitting a regression model for each player
### Fitting a regression model
model <- function(df) {
lm(rbi / ab ~ cyear, data=df)
}
baseball <- subset(baseball, ab >= 25)
## Break up be player, and fit linear regression model to each piece and
## returns a list of model
bmodels <- dlply(baseball, .(id), model)
## Getting slope and intercept of the model
bcoefs <- ldply(bmodels, coef)
names(bcoefs)[2:3] <- c("intercept", "slope")
head(bcoefs)
## id intercept slope
## 1 aaronha01 0.18344 0.0001478
## 2 abernte02 0.00000 NA
## 3 adairje01 0.08599 -0.0007119
## 4 adamsba01 0.06025 0.0012002
## 5 adamsbo03 0.08675 -0.0019239
## 6 adcocjo01 0.14839 0.0027383