Displaying a clustering with CLUSPLOT

doi:10.1016/S0167-9473(98)00102-9

Computational Statistics & Data Analysis

Volume 30, Issue 4, 28 June 1999, Pages 381-392

https://doi.org/10.1016/S0167-9473(98)00102-9 Get rights and content

Abstract

In a bivariate data set it is easy to represent clusters, e.g. by manually circling them or separating them by lines. But many data sets have more than two variables, or they come in the form of inter-object dissimilarities. There exist methods to partition such a data set into clusters, but the resulting partition is not visual by itself. In this paper we construct a new graphical display called CLUSPLOT, in which the objects are represented as points in a bivariate plot and the clusters as ellipses of various sizes and shapes. The algorithm is implemented as an S-PLUS function. Several options are available, e.g. labelling of objects and clusters, drawing lines connecting clusters, and the use of color. We illustrate this new tool with several examples.

Introduction

There are two main types of clustering methods. The hierarchical methods construct a dendrogram, which is a tree of which the leaves are the data objects and the branches can be seen as clusters. On the other hand, a partitioning method divides the data into k nonoverlapping clusters, so that objects of the same cluster are close to each other and objects of different clusters are dissimilar.

The output of a partitioning method is simply a list of clusters and their objects, which may be difficult to interpret. It would therefore be useful to have a graphical display which describes the objects with their interrelations, and at the same time shows the clusters. This would allow us to picture the size and shape of the clusters, as well as their relative position. Following a suggestion on p. 318 of (Kaufman and Rousseeuw, 1990, henceforth KR) we will construct such a display, called CLUSPLOT, and an algorithm for its implementation.

For instance, let us consider the bivariate data set of Ruspini (1970) which contains n=75 objects, and partition it into k=4 clusters. For this we have used the Partitioning Around Medoids (PAM) method of [KR], but of course also other clustering methods can be applied. Then CLUSPLOT uses the resulting partition, as well as the original data, to produce Fig. 1. The ellipses are based on the average and the covariance matrix of each cluster, and their size is such that they contain all the points of their cluster. This explains why there is always an object on the boundary of each ellipse. It is also possible to draw the spanning ellipse of each cluster, i.e. the smallest ellipse that covers all its objects. The spanning ellipse can be computed with the algorithm of Titterington (1976).

To get an idea of the distances between the clusters, we can draw segments of the lines between the cluster centers. In the plot, the shading intensity is proportional to the density of the cluster, i.e. its number of objects divided by the area of the ellipse.

For higher-dimensional data sets we apply a dimension reduction technique before constructing the plot, as described in Section 2. Section 3 concentrates on dissimilarity data, where we will represent the objects as bivariate points by means of multidimensional scaling. Section 4 describes the implementation of CLUSPLOT in S-Plus, with the available options. Section 5 formulates some conclusions and several proposals for further extensions.

Section snippets

Higher-dimensional data

Let us take a p-dimensional data set $X={(x_{i1},x_{i2},…,x_{ip}); i=1,…,n}$ . We can reduce the dimension of the data by principal component analysis (PCA), which yields a first component with maximal variance, then a second component with maximal variance among all components perpendicular to the first, and so on. The principal components lie in the directions of the eigenvectors of a scatter matrix, which can be the classical covariance matrix or the corresponding correlation matrix. Another possibility

Dissimilarity data

When the data set consists of inter-object dissimilarities (together forming an n by n matrix D), another method will be used to obtain a two-dimensional plot of the n objects.

Dissimilarities are nonnegative numbers d(i,j) that are small when i and j are ‘near’ to each other and that become large when i and j are very different. They have two properties:

•
d(i,i)=0,
•
d(i,j)=d(j,i).

Dissimilarities between two objects can be computed in different ways, e.g. when the data contain nominal or ordinal

Implementation and availability

We have implemented CLUSPLOT as an S-PLUS function, taking full advantage of the powerful statistical, numerical and graphical tools available in the S-PLUS environment. In particular, we could use the intrinsic functions princomp for PCA and cmdscale for MDS. Moreover, S-PLUS has incorporated several clustering algorithms, including (since version 3.4) the functions daisy, pam, fanny and clara implemented by (Struyf et al., 1997; henceforth SHR).

The S-PLUS call to clusplot is of the form

Conclusions and outlook

We have developed a new S-PLUS function clusplot providing a bivariate graphical representation of a clustering partition. Earlier graphical tools include the distance plot (Chen et al., 1974) and the silhouette plot (Rousseeuw, 1987). An important advantage of the clusplot is that it shows both the objects and the clusters. The clusplot can also be seen as a generalized and automated version of the taxometric map (Carmichael and Sneath, 1969), which showed the clusters but not the objects.

References (16)

P.J. Rousseeuw
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
E.H. Ruspini
Numerical methods for fuzzy clustering
Inform. Sci.
(1970)
R.D. Abbott et al.
Development and construct validation of a set of student rating-of-instruction items
Educational and Psychological Measurement
(1978)
J.W. Carmichael et al.
Taxometric maps
Systematic Zoology
(1969)
H. Chen et al.
Statistical methods for grouping corporations
Sankhyā Ser. B
(1974)
W.F. Eddy
A new convex hull algorithm for planar sets
ACM Trans. Math. Software
(1977)
Fisher, R.A., 1936. The use of multiple measurements in taxonomic problems. Ann. Eugenics 7, Part II,...
Harman, H.H., 1967. Modern Factor Analysis. University of Chicago Press,...

There are more references available in the full text version of this article.

Cited by (86)

A K-means clustering model for analyzing the Bitcoin extreme value returns
2023, Decision Analytics Journal
Bitcoin prices are highly volatile and have extreme upper tails of the return distributions. One important component of Bitcoin price jumps is that it does not follow a normal distribution. This present study aims to reduce the extreme value data available on Bitcoin into simple clusters based on extreme value returns. The study first measures the excessive volatility and then estimates the extreme value returns of Bitcoin between November 2013 and August 2022 to achieve this objective. For robustness checks, extreme value returns are estimated using both the Rogers and Satchell (RS) and the Variance Ratio (VRatio) estimators that embed jumps in the model. Further, K-means clustering is used to form clusters based on the estimated Bitcoin’s extreme value returns as the probable good days (extreme days), medium days, and bad days. The study observes that K-means clustering can explain 65 percent point return variability. The study findings will be highly useful for crypto investors, policymakers, and future studies in data mining.
Development of collision risk assessment model for bridge across waterways based on traffic probability distribution
2022, Ocean Engineering
Citation Excerpt :
Therefore, to derive the optimal number of clusters, the Clusplot code of the Cluster package of the R program was used. Using this code, the multidimensional relationship between factors can be projected to 2D using the principal component analysis, and the cluster radius and relationship can be checked (Pison et al., 1999). In this study, using the above factors to calculate the annual collision frequency estimation formula for the target bridge, we attempted to present a risk assessment standard for the annual collision frequency classified by ship size.
Korea identifies in advance the impact of maritime development, including bridges across waterways, on maritime traffic safety and evaluates changes in maritime traffic flow and ship navigation safety by Maritime Traffic Safety Assessment. However, this system adopts the probability of collision per 10,000 vessels as a safety standard based on the number of collisions per vessel rather than the annual collision frequency during collision risk assessments through ship handling simulation, which is unsuitable for safety verification in sea areas with high traffic. Therefore, in this study, the annual frequency of collapse (AASHTO Method II) was used to convert collision frequency per ship into an annual collision frequency equation, maritime traffic distribution was analyzed for 4 bridges across waterways in Korea, and a normality test was performed. Through k-means clustering, passing ships were clustered by tonnage, and the annual collision frequency for each cluster was calculated. As a result, the annual collision frequency of once every 50–100 years was suitable depending on the ship size. This is similar to the annual collision frequency reported by International Maritime Organization, and it is judged to be suitable for use as an annual collision frequency safety standard not only in Korea but also worldwide.
Estimating plant nitrogen content in tomato using a smartphone
2022, Field Crops Research
Optimizing nitrogen (N) fertilization is increasingly becoming a key issue to maximize productivity and farmers’ income while reducing environmental impact of agricultural productions. Among the most sophisticated approaches to support variable rate N applications, a central role is played by frameworks that integrate satellite images and smart-scouting driven ground estimates of plant N content (PNC) and critical N concentration. Among the approaches to estimate PNC, the smartphone application PocketN demonstrated its suitability for cereals as well as its great integrability within digital platforms. In this study, we developed genotype-specific calibration curves to derive PNC of tomato crops from PocketN readings and we compared the performance of PocketN with the SPAD ones. Five commercial genotypes were grown in two field experiments in Northern and Southern Italy and four PocketN/SPAD readings and sampling events were carried out along the season. The most reliable relationships between PocketN/SPAD readings and PNC values from the laboratory were obtained for the readings carried out on the apical leaflet of the lower leaves of three plants. Mean R² for all genotypes was 0.75 and 0.62 for PocketN and SPAD, respectively. This allows considering PocketN as a suitable tool for PNC estimates in light of its adoption within digital frameworks aimed at transferring precision agriculture principles to operational farming contexts.
Modelling the influence of peers’ attitudes on choice behaviour: Theory and empirical application on electric vehicle preferences
2020, Transportation Research Part A: Policy and Practice
While the importance of social influence on transport-related choices is commonly acknowledged within the transport and travel behaviour research community, there remain several challenges in modelling influence in practice. This paper proposes a new analytical approach to measure the effects of attitudes of peers on the decision making process of the individual. Indeed, while most of the previous literature focused its attention on capturing conformity to a certain real or hypothetical choice, we investigate the subtle effect of attitudes that underlies this choice. Specifically, the suggested measure enables us to model the correlated effect that might indirectly affect the individual’s choice within a social group. It combines detailed information on the attitudes in the individual’s social network and the social proximity of the individuals in the social network. To understand its behavioural implications on the individual’s choice, the individual’s peer attitude variable is tested in different components of a hybrid choice model. Our results show that the inclusion of this variable indirectly affects the decision making process of the individual as the peers’ attitudes are significantly related to the latent attitude of the individual. On the other hand, it does not seem to directly affect the utility of an alternative as a source of systematic heterogeneity nor does it work as a manifestation of the latent variable, i.e. as an indicator.
Close encounters of the dolphin kind: Contrasting tourist support for feeding based interactions with concern for dolphin welfare
2020, Tourism Management
Citation Excerpt :
Visual representation of the cluster analyses was done using silhouette plot. The mean silhouette width is a measure of the average distance between clusters, provides validation of the clustering accuracy, and was used to select the most suitable number of clusters (Pison, Struyf, & Rousseeuw, 1999). Silhouette width can range from −1 to +1, with values towards one corresponding to well clustered objects, and values towards 0 indicating an object lies between two clusters.
The tourism demand for close interactions with wildlife has increased in the last few decades. At the same time, public concern for animal welfare has also increased. Tourists are drawn to the thrill of close encounters with charismatic wildlife in their natural setting which depend on the reliability of wildlife being in a certain place at a given time. Food provisioning is a form of operant conditioning that uses food rewards to attract wildlife, promoting spatially and temporally reliable wildlife encounters that satisfy the desire for close encounters with wildlife. However, a range of effects counter to wildlife welfare and conservation may result from both the provisioning and close encounters. Our study examines visitors' attitude and support towards regulated provisioning and identifies a gap between visitors' desire for close-up encounters, their concern for dolphin welfare and the documented negative impacts of close encounters and food-provisioning.
9Cr steel visualization and predictive modeling
2019, Computational Materials Science
Citation Excerpt :
Santos & Zarate [16] reported that Gower’s similarity coefficient provided the best overall performance on their 15 datasets considered. PAM clustering technique assigns each n-dimensional vector to one of the k clusters, so that the objective function, total sum of the intra-cluster distances is minimized [17]. The t-Distributed Stochastic Neighbor Embedding (t-SNE, which is also used here) is a high-dimensional visualization technique that preserves local topology from higher to lower dimension.
The goals of this work were to develop tools to visualize and characterize materials information, to explore new alloy compositions and/or processing, to better predict tensile strength and other mechanical properties. The 9–12 wt% Cr (9Cr) steel test data were compiled from several sources. About 3000 records with 30 predictor variables represent 82 unique steel alloy compositions, variations of thermo-mechanical processing steps and temperatures (homogenization, normalization and tempering cycles), test conditions and outcomes. Detailed data processing steps such as visualization and exploratory analysis, including univariate and bivariate analysis, different clustering techniques used to segment the data, statistical post-hoc analysis to verify significance of the findings, feature engineering to identify predictors of importance, and predictive modeling with cross-validation were performed. The outcome of analysis at each step was reviewed in the context of the domain knowledge of this class of steel, to see if there were underlying physical mechanisms that explain statistical relationships. Data analytics techniques and their parameters were fine-tuned to facilitate interpretation of the results as aligned with the insights from domain experts. The ensemble predictive modeling using Random Forest regressor and post-hoc means comparison corroborated domain knowledge on the role of Co which is known to increase strength of steel alloys through solid-solution strengthening and to affect diffusion of alloying elements and precipitates. Alloys with Co content of 0.7–8 wt% had significantly higher mean strength than the ones without (while having Cr locked within 10–11 wt% range in either group and keeping other compositional element ratios fixed). End products of the computational techniques exploration are presented as the tools that can be used in iterative workflow of materials development and testing.

View all citing articles on Scopus

View full text

Displaying a clustering with CLUSPLOT

Abstract

Introduction

Section snippets

Higher-dimensional data

Dissimilarity data

Implementation and availability

Conclusions and outlook

J. Comput. Appl. Math.

Inform. Sci.

Development and construct validation of a set of student rating-of-instruction items

Educational and Psychological Measurement

Taxometric maps

Systematic Zoology

Statistical methods for grouping corporations

Sankhyā Ser. B

A new convex hull algorithm for planar sets

ACM Trans. Math. Software