手记

Cluster Analysis with Iris Dataset

Data Science Day 19:

In Supervised Learning, we specify the possible categorical values and train the models for pattern recognition.  However, *what if we don’t have the existing classified data model to learn from? *

[caption id=“attachment_1074” align=“alignnone” width=“750”]

Radfotosonn / Pixabay[/caption]

The case we model the data in order to discover the way it clusters, based on certain attributes is Unsupervised Learning.

Clustering Analysis in one of the Unsupervised Techniques, it rather than learning by example, learn by observation.

There are 3 types of clustering methods in general, Partitioning, Hierarchical, and Density-based clustering.

1.Partitioning: n objects is grouped into k ≤ n disjoint clusters.
   Partitioning methods are based on a distance measure, it applies iterative relocation until some distance-based error metric is minimized.

2.Hierarchical: either combining(agglomerative) or splitting(divisive) cluster based on some measure (distance, density or continuity), in a stepwise fashion.

Agglomerative starts with each point in its own cluster and combine them in steps, and divisive starts with the data in one cluster and divide it up

3. The density-based method is based on its density; it measures the cluster “goodness”.

Example with Iris Dataset

  1. Partitioning: K-Means=3
#Iris dataset
iris=datasets.load_iris()
x=iris.data
y=iris.target

#Plotting
fig = plt.figure(1, figsize=(7,7))
ax = Axes3D(fig, rect=[0, 0, 0.95, 1], elev=48, azim=134)
ax.scatter(x[:, 3], x[:, 0], x[:, 2],
          c=labels.astype(np.float), edgecolor="k", s=50)
ax.set_xlabel("Petal width")
ax.set_ylabel("Sepal length")
ax.set_zlabel("Petal length")
plt.title("Iris Clustering K Means=3", fontsize=14)
plt.show()

2.   **Hierarchical **

#Hierachy Clustering 
hier=linkage(x,"ward")
max_d=7.08
plt.figure(figsize=(25,10))
plt.title('Iris Hierarchical Clustering Dendrogram')
plt.xlabel('Species')
plt.ylabel('distance')
dendrogram(
    hier,
    truncate_mode='lastp',  
    p=50,                  
    leaf_rotation=90.,      
    leaf_font_size=8.,     
)
plt.axhline(y=max_d, c='k')
plt.show()

3. Density-based method DBSCAN

dbscan=DBSCAN()
dbscan.fit(x)
pca=PCA(n_components=2).fit(x)
pca_2d=pca.transform(x)

for i in range(0, pca_2d.shape[0]):
    if dbscan.labels_[i] == 0:
        c1 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='r', marker='+')
    elif dbscan.labels_[i] == 1:
        c2 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='g', marker='o')
    elif dbscan.labels_[i] == -1:
        c3 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='b', marker='*')

plt.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2', 'Noise'])
plt.title('DBSCAN finds 2 clusters and Noise')
plt.show()

Thanks very much to Dr.Rumbaugh’s clustering analysis notes!

Happy studying! 😊

0人推荐
随时随地看视频
慕课网APP