Machine Learning and Data Mining in R: July 2014

Thursday, July 31, 2014

k-means clustering algorithm

k-means clustering is popular method for partitioning a set of observations into k clusters such that each observation belongs to cluster with the nearest mean. It is an unsupervised method.

Algorithm:

The most common k-means algorithm repeatedly does these steps until convergence
Assignment step: Determine the centroid coordinates. Assign each observation to the centroid whose means yields the least within-cluster sum of squares.

Update step: Recalculate the new means to be the centroids of the observations assigned to the clusters.

I used iris data set from UCI machine learning repository to implement k-means algorithm in R. Here is the code in R.

iris <- read.csv("iris.csv")
View(iris)
iris.features <- iris
#remove classification from dataset
iris.features$species <- NULL
km <-kmeans(iris.features,3)
print(km)
# plot clusters on all features
plot (iris.features, col=km$cluster)
#plot cluster on petal length and petal width
plot(iris[c("petal_l","petal_w")], col=km$cluster)

The cluster plot is shown below:

Monday, July 21, 2014

Why R?

I started working on R recently. The thing I like most about R is it is open source. R has many contributed packages for various domains. I am still exploring the capability and limitations of R in Machine learning and data mining domain. R comes with lots of statistical and machine learning tools. I ran k-means clustering algorithm on a sample dataset, the script was short and easier to code. I have to see how the k-means algorithm in R will scale for larger dataset.

Pages

Thursday, July 31, 2014

k-means clustering algorithm

Monday, July 21, 2014

Why R?