Displaying a clustering with CLUSPLOT
Introduction
There are two main types of clustering methods. The hierarchical methods construct a dendrogram, which is a tree of which the leaves are the data objects and the branches can be seen as clusters. On the other hand, a partitioning method divides the data into k nonoverlapping clusters, so that objects of the same cluster are close to each other and objects of different clusters are dissimilar.
The output of a partitioning method is simply a list of clusters and their objects, which may be difficult to interpret. It would therefore be useful to have a graphical display which describes the objects with their interrelations, and at the same time shows the clusters. This would allow us to picture the size and shape of the clusters, as well as their relative position. Following a suggestion on p. 318 of (Kaufman and Rousseeuw, 1990, henceforth KR) we will construct such a display, called CLUSPLOT, and an algorithm for its implementation.
For instance, let us consider the bivariate data set of Ruspini (1970) which contains n=75 objects, and partition it into k=4 clusters. For this we have used the Partitioning Around Medoids (PAM) method of [KR], but of course also other clustering methods can be applied. Then CLUSPLOT uses the resulting partition, as well as the original data, to produce Fig. 1. The ellipses are based on the average and the covariance matrix of each cluster, and their size is such that they contain all the points of their cluster. This explains why there is always an object on the boundary of each ellipse. It is also possible to draw the spanning ellipse of each cluster, i.e. the smallest ellipse that covers all its objects. The spanning ellipse can be computed with the algorithm of Titterington (1976).
To get an idea of the distances between the clusters, we can draw segments of the lines between the cluster centers. In the plot, the shading intensity is proportional to the density of the cluster, i.e. its number of objects divided by the area of the ellipse.
For higher-dimensional data sets we apply a dimension reduction technique before constructing the plot, as described in Section 2. Section 3 concentrates on dissimilarity data, where we will represent the objects as bivariate points by means of multidimensional scaling. Section 4 describes the implementation of CLUSPLOT in S-Plus, with the available options. Section 5 formulates some conclusions and several proposals for further extensions.
Section snippets
Higher-dimensional data
Let us take a p-dimensional data set . We can reduce the dimension of the data by principal component analysis (PCA), which yields a first component with maximal variance, then a second component with maximal variance among all components perpendicular to the first, and so on. The principal components lie in the directions of the eigenvectors of a scatter matrix, which can be the classical covariance matrix or the corresponding correlation matrix. Another possibility
Dissimilarity data
When the data set consists of inter-object dissimilarities (together forming an n by n matrix D), another method will be used to obtain a two-dimensional plot of the n objects.
Dissimilarities are nonnegative numbers d(i,j) that are small when i and j are ‘near’ to each other and that become large when i and j are very different. They have two properties:
- •
d(i,i)=0,
- •
d(i,j)=d(j,i).
Implementation and availability
We have implemented CLUSPLOT as an S-PLUS function, taking full advantage of the powerful statistical, numerical and graphical tools available in the S-PLUS environment. In particular, we could use the intrinsic functions princomp for PCA and cmdscale for MDS. Moreover, S-PLUS has incorporated several clustering algorithms, including (since version 3.4) the functions daisy, pam, fanny and clara implemented by (Struyf et al., 1997; henceforth SHR).
The S-PLUS call to clusplot is of the form
Conclusions and outlook
We have developed a new S-PLUS function clusplot providing a bivariate graphical representation of a clustering partition. Earlier graphical tools include the distance plot (Chen et al., 1974) and the silhouette plot (Rousseeuw, 1987). An important advantage of the clusplot is that it shows both the objects and the clusters. The clusplot can also be seen as a generalized and automated version of the taxometric map (Carmichael and Sneath, 1969), which showed the clusters but not the objects.
References (16)
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)Numerical methods for fuzzy clustering
Inform. Sci.
(1970)- et al.
Development and construct validation of a set of student rating-of-instruction items
Educational and Psychological Measurement
(1978) - et al.
Taxometric maps
Systematic Zoology
(1969) - et al.
Statistical methods for grouping corporations
Sankhyā Ser. B
(1974) A new convex hull algorithm for planar sets
ACM Trans. Math. Software
(1977)- Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, Part II,...
- Harman, H.H., 1967. Modern Factor Analysis. University of Chicago Press,...
Cited by (86)
A K-means clustering model for analyzing the Bitcoin extreme value returns
2023, Decision Analytics JournalDevelopment of collision risk assessment model for bridge across waterways based on traffic probability distribution
2022, Ocean EngineeringCitation Excerpt :Therefore, to derive the optimal number of clusters, the Clusplot code of the Cluster package of the R program was used. Using this code, the multidimensional relationship between factors can be projected to 2D using the principal component analysis, and the cluster radius and relationship can be checked (Pison et al., 1999). In this study, using the above factors to calculate the annual collision frequency estimation formula for the target bridge, we attempted to present a risk assessment standard for the annual collision frequency classified by ship size.
Estimating plant nitrogen content in tomato using a smartphone
2022, Field Crops ResearchModelling the influence of peers’ attitudes on choice behaviour: Theory and empirical application on electric vehicle preferences
2020, Transportation Research Part A: Policy and PracticeClose encounters of the dolphin kind: Contrasting tourist support for feeding based interactions with concern for dolphin welfare
2020, Tourism ManagementCitation Excerpt :Visual representation of the cluster analyses was done using silhouette plot. The mean silhouette width is a measure of the average distance between clusters, provides validation of the clustering accuracy, and was used to select the most suitable number of clusters (Pison, Struyf, & Rousseeuw, 1999). Silhouette width can range from −1 to +1, with values towards one corresponding to well clustered objects, and values towards 0 indicating an object lies between two clusters.
9Cr steel visualization and predictive modeling
2019, Computational Materials ScienceCitation Excerpt :Santos & Zarate [16] reported that Gower’s similarity coefficient provided the best overall performance on their 15 datasets considered. PAM clustering technique assigns each n-dimensional vector to one of the k clusters, so that the objective function, total sum of the intra-cluster distances is minimized [17]. The t-Distributed Stochastic Neighbor Embedding (t-SNE, which is also used here) is a high-dimensional visualization technique that preserves local topology from higher to lower dimension.