Displaying a clustering with CLUSPLOT

https://doi.org/10.1016/S0167-9473(98)00102-9Get rights and content

Abstract

In a bivariate data set it is easy to represent clusters, e.g. by manually circling them or separating them by lines. But many data sets have more than two variables, or they come in the form of inter-object dissimilarities. There exist methods to partition such a data set into clusters, but the resulting partition is not visual by itself. In this paper we construct a new graphical display called CLUSPLOT, in which the objects are represented as points in a bivariate plot and the clusters as ellipses of various sizes and shapes. The algorithm is implemented as an S-PLUS function. Several options are available, e.g. labelling of objects and clusters, drawing lines connecting clusters, and the use of color. We illustrate this new tool with several examples.

Introduction

There are two main types of clustering methods. The hierarchical methods construct a dendrogram, which is a tree of which the leaves are the data objects and the branches can be seen as clusters. On the other hand, a partitioning method divides the data into k nonoverlapping clusters, so that objects of the same cluster are close to each other and objects of different clusters are dissimilar.

The output of a partitioning method is simply a list of clusters and their objects, which may be difficult to interpret. It would therefore be useful to have a graphical display which describes the objects with their interrelations, and at the same time shows the clusters. This would allow us to picture the size and shape of the clusters, as well as their relative position. Following a suggestion on p. 318 of (Kaufman and Rousseeuw, 1990, henceforth KR) we will construct such a display, called CLUSPLOT, and an algorithm for its implementation.

For instance, let us consider the bivariate data set of Ruspini (1970) which contains n=75 objects, and partition it into k=4 clusters. For this we have used the Partitioning Around Medoids (PAM) method of [KR], but of course also other clustering methods can be applied. Then CLUSPLOT uses the resulting partition, as well as the original data, to produce Fig. 1. The ellipses are based on the average and the covariance matrix of each cluster, and their size is such that they contain all the points of their cluster. This explains why there is always an object on the boundary of each ellipse. It is also possible to draw the spanning ellipse of each cluster, i.e. the smallest ellipse that covers all its objects. The spanning ellipse can be computed with the algorithm of Titterington (1976).

To get an idea of the distances between the clusters, we can draw segments of the lines between the cluster centers. In the plot, the shading intensity is proportional to the density of the cluster, i.e. its number of objects divided by the area of the ellipse.

For higher-dimensional data sets we apply a dimension reduction technique before constructing the plot, as described in Section 2. Section 3 concentrates on dissimilarity data, where we will represent the objects as bivariate points by means of multidimensional scaling. Section 4 describes the implementation of CLUSPLOT in S-Plus, with the available options. Section 5 formulates some conclusions and several proposals for further extensions.

Section snippets

Higher-dimensional data

Let us take a p-dimensional data set X={(xi1,xi2,…,xip);i=1,…,n}. We can reduce the dimension of the data by principal component analysis (PCA), which yields a first component with maximal variance, then a second component with maximal variance among all components perpendicular to the first, and so on. The principal components lie in the directions of the eigenvectors of a scatter matrix, which can be the classical covariance matrix or the corresponding correlation matrix. Another possibility

Dissimilarity data

When the data set consists of inter-object dissimilarities (together forming an n by n matrix D), another method will be used to obtain a two-dimensional plot of the n objects.

Dissimilarities are nonnegative numbers d(i,j) that are small when i and j are ‘near’ to each other and that become large when i and j are very different. They have two properties:

  • d(i,i)=0,

  • d(i,j)=d(j,i).

Dissimilarities between two objects can be computed in different ways, e.g. when the data contain nominal or ordinal

Implementation and availability

We have implemented CLUSPLOT as an S-PLUS function, taking full advantage of the powerful statistical, numerical and graphical tools available in the S-PLUS environment. In particular, we could use the intrinsic functions princomp for PCA and cmdscale for MDS. Moreover, S-PLUS has incorporated several clustering algorithms, including (since version 3.4) the functions daisy, pam, fanny and clara implemented by (Struyf et al., 1997; henceforth SHR).

The S-PLUS call to clusplot is of the form

Conclusions and outlook

We have developed a new S-PLUS function clusplot providing a bivariate graphical representation of a clustering partition. Earlier graphical tools include the distance plot (Chen et al., 1974) and the silhouette plot (Rousseeuw, 1987). An important advantage of the clusplot is that it shows both the objects and the clusters. The clusplot can also be seen as a generalized and automated version of the taxometric map (Carmichael and Sneath, 1969), which showed the clusters but not the objects.

References (16)

  • P.J. Rousseeuw

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • E.H. Ruspini

    Numerical methods for fuzzy clustering

    Inform. Sci.

    (1970)
  • R.D. Abbott et al.

    Development and construct validation of a set of student rating-of-instruction items

    Educational and Psychological Measurement

    (1978)
  • J.W. Carmichael et al.

    Taxometric maps

    Systematic Zoology

    (1969)
  • H. Chen et al.

    Statistical methods for grouping corporations

    Sankhyā Ser. B

    (1974)
  • W.F. Eddy

    A new convex hull algorithm for planar sets

    ACM Trans. Math. Software

    (1977)
  • Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, Part II,...
  • Harman, H.H., 1967. Modern Factor Analysis. University of Chicago Press,...
There are more references available in the full text version of this article.

Cited by (86)

  • Development of collision risk assessment model for bridge across waterways based on traffic probability distribution

    2022, Ocean Engineering
    Citation Excerpt :

    Therefore, to derive the optimal number of clusters, the Clusplot code of the Cluster package of the R program was used. Using this code, the multidimensional relationship between factors can be projected to 2D using the principal component analysis, and the cluster radius and relationship can be checked (Pison et al., 1999). In this study, using the above factors to calculate the annual collision frequency estimation formula for the target bridge, we attempted to present a risk assessment standard for the annual collision frequency classified by ship size.

  • Close encounters of the dolphin kind: Contrasting tourist support for feeding based interactions with concern for dolphin welfare

    2020, Tourism Management
    Citation Excerpt :

    Visual representation of the cluster analyses was done using silhouette plot. The mean silhouette width is a measure of the average distance between clusters, provides validation of the clustering accuracy, and was used to select the most suitable number of clusters (Pison, Struyf, & Rousseeuw, 1999). Silhouette width can range from −1 to +1, with values towards one corresponding to well clustered objects, and values towards 0 indicating an object lies between two clusters.

  • 9Cr steel visualization and predictive modeling

    2019, Computational Materials Science
    Citation Excerpt :

    Santos & Zarate [16] reported that Gower’s similarity coefficient provided the best overall performance on their 15 datasets considered. PAM clustering technique assigns each n-dimensional vector to one of the k clusters, so that the objective function, total sum of the intra-cluster distances is minimized [17]. The t-Distributed Stochastic Neighbor Embedding (t-SNE, which is also used here) is a high-dimensional visualization technique that preserves local topology from higher to lower dimension.

View all citing articles on Scopus
View full text