Title: | Selection of Number of Clusters via Normalized Clustering Instability |
---|---|
Description: | Selection of the number of clusters in cluster analysis using stability methods. |
Authors: | Jonas M. B. Haslbeck [aut, cre], Dirk U. Wulff [aut] |
Maintainer: | Jonas M. B. Haslbeck <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2-4 |
Built: | 2024-11-07 03:25:35 UTC |
Source: | https://github.com/jmbh/cstab |
Selection of number of clusters via gap statistic, jump statistic, and slope statistic
cDistance( data, kseq, method = "kmeans", linkage = "complete", kmIter = 10, gapIter = 10 )
cDistance( data, kseq, method = "kmeans", linkage = "complete", kmIter = 10, gapIter = 10 )
data |
a n x p data matrix of type numeric. |
kseq |
a vector with considered numbers clusters k > 1 |
method |
character string indicating the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering. |
linkage |
character specifying the linkage criterion, in case
|
kmIter |
integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima. |
gapIter |
integer specifying the number of simulated datasets to compute the gap statistic (see Tibshirani et al., 2001). |
a list with the optimal numbers of cluster determined by the gap statistic
(Tibshirani et al., 2001), the jump Statistic (Sugar & James, 2011) and the
slope statistic (Fujita et al., 2014). Along the function returns the gap,
jump and slope
for each k in kseq
.
Dirk U. Wulff <[email protected]> Jonas M. B. Haslbeck <[email protected]>
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
Sugar, C. A., & James, G. M. (2011). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763,
Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Distance-based Measures cDistance(data, kseq=2:10) ## End(Not run)
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Distance-based Measures cDistance(data, kseq=2:10) ## End(Not run)
An example, 2-dimensional dataset containing the 100 points for each of five bivariate normal distributions arranged equidistant along the outline of a circle.
cluster_example
cluster_example
An object of class matrix
(inherits from array
) with 500 rows and 2 columns.
To inspect execute plot(cluster_example)
.
Selection of number of clusters via model-based or model-free, normalized or unnormalized clustering instability.
cStability( data, kseq = 2:20, nB = 10, norm = TRUE, predict = TRUE, method = "kmeans", linkage = "complete", kmIter = 5, pbar = TRUE )
cStability( data, kseq = 2:20, nB = 10, norm = TRUE, predict = TRUE, method = "kmeans", linkage = "complete", kmIter = 5, pbar = TRUE )
data |
a n x p data matrix of type numeric. |
kseq |
a vector with considered numbers clusters k > 1 |
nB |
an integer specifying the number of bootstrap comparisons. |
norm |
logical specifying whether the instability path should be normalized. If TRUE, the instability path is normalized, accounting for a trivial decrease in instability due to a increasing k (see Haslbeck & Wulff, 2016). |
predict |
boolean specifying whether the model-based or the model-free variant should be used (see Haslbeck & Wulff, 2016). |
method |
character string specifying the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering. |
linkage |
character specifying the linkage criterion, in case
|
kmIter |
integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima. |
pbar |
logical |
a list that contains the optimal k selected by the unnormalized and normalized instability method. It also includes a vector containing the averaged instability path (over bootstrap samples), k-wise confidence intervals around these paths and a matrix containing the instability path of each bootstrap sample for both the normalized and the unnormalized method.
Dirk U. Wulff <[email protected]> Jonas M. B. Haslbeck <[email protected]>
Ben-Hur, A., Elisseeff, A., & Guyon, I. (2001). A stability based method for discovering structure in clustered data. Pacific symposium on biocomputing, 7, 6-17.
Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528.
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Instability-based Measures stab_obj <- cStability(data, kseq=2:10) print(stab_obj) ## End(Not run)
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Instability-based Measures stab_obj <- cStability(data, kseq=2:10) print(stab_obj) ## End(Not run)
!Deprecated! Selection of number of clusters via model-based or model-free, normalized or unnormalized clustering instability.
cStability_mEst( data, kseq = 2:20, nB = 10, norm = TRUE, predict = TRUE, method = "kmeans", linkage = "complete", kmIter = 5, pbar = TRUE )
cStability_mEst( data, kseq = 2:20, nB = 10, norm = TRUE, predict = TRUE, method = "kmeans", linkage = "complete", kmIter = 5, pbar = TRUE )
data |
a n x p data matrix of type numeric. |
kseq |
a vector with considered numbers clusters k > 1 |
nB |
an integer specifying the number of bootstrap comparisons. |
norm |
logical specifying whether the instability path should be normalized. If TRUE, the instability path is normalized, accounting for a trivial decrease in instability due to a increasing k (see Haslbeck & Wulff, 2016). |
predict |
boolean specifying whether the model-based or the model-free variant should be used (see Haslbeck & Wulff, 2016). |
method |
character string specifying the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering. |
linkage |
character specifying the linkage criterion, in case
|
kmIter |
integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima. |
pbar |
logical |
a list that contains the optimal k selected by the unnormalized and normalized instability method. It also includes a vector containing the averaged instability path (over bootstrap samples) and a matrix containing the instability path of each bootstrap sample for both the normalized and the unnormalized method.
Dirk U. Wulff <[email protected]> Jonas M. B. Haslbeck <[email protected]>
Ben-Hur, A., Elisseeff, A., & Guyon, I. (2001). A stability based method for discovering structure in clustered data. Pacific symposium on biocomputing, 7, 6-17.
Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528.
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Instability-based Measures stab_obj <- cStability(data, kseq=2:10) print(stab_obj) ## End(Not run)
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Instability-based Measures stab_obj <- cStability(data, kseq=2:10) print(stab_obj) ## End(Not run)
!Deprecated! Selection of number of clusters via model-based or model-free, normalized or unnormalized clustering instability.
cStability_orig( data, kseq = 2:20, nB = 10, norm = TRUE, predict = TRUE, method = "kmeans", linkage = "complete", kmIter = 5, pbar = TRUE )
cStability_orig( data, kseq = 2:20, nB = 10, norm = TRUE, predict = TRUE, method = "kmeans", linkage = "complete", kmIter = 5, pbar = TRUE )
data |
a n x p data matrix of type numeric. |
kseq |
a vector with considered numbers clusters k > 1 |
nB |
an integer specifying the number of bootstrap comparisons. |
norm |
logical specifying whether the instability path should be normalized. If TRUE, the instability path is normalized, accounting for a trivial decrease in instability due to a increasing k (see Haslbeck & Wulff, 2016). |
predict |
boolean specifying whether the model-based or the model-free variant should be used (see Haslbeck & Wulff, 2016). |
method |
character string specifying the clustering algorithm. 'kmeans' for the k-means algorithm, 'hierarchical' for hierarchical clustering. |
linkage |
character specifying the linkage criterion, in case
|
kmIter |
integer specifying the the number of restarts of the k-means algorithm in order to avoid local minima. |
pbar |
logical |
a list that contains the optimal k selected by the unnormalized and normalized instability method. It also includes a vector containing the averaged instability path (over bootstrap samples) and a matrix containing the instability path of each bootstrap sample for both the normalized and the unnormalized method.
Dirk U. Wulff <[email protected]> Jonas M. B. Haslbeck <[email protected]>
Ben-Hur, A., Elisseeff, A., & Guyon, I. (2001). A stability based method for discovering structure in clustered data. Pacific symposium on biocomputing, 7, 6-17.
Tibshirani, R., & Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3), 511-528.
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Instability-based Measures stab_obj <- cStability(data, kseq=2:10) print(stab_obj) ## End(Not run)
## Not run: # Generate Data from Gaussian Mixture s <- .1 n <- 50 data <- rbind(cbind(rnorm(n, 0, s), rnorm(n, 0, s)), cbind(rnorm(n, 1, s), rnorm(n, 1, s)), cbind(rnorm(n, 0, s), rnorm(n, 1, s)), cbind(rnorm(n, 1, s), rnorm(n, 0, s))) plot(data) # Selection of Number of Clusters using Instability-based Measures stab_obj <- cStability(data, kseq=2:10) print(stab_obj) ## End(Not run)
Create lookup table for faculties
lookup(n = 10000L, root = 200)
lookup(n = 10000L, root = 200)
n |
integer specifying the number of |
root |
numeric specifying the root used to avoid machine limit. |
plot.cstab
plots instability path.
## S3 method for class 'cstab' plot(x, ...)
## S3 method for class 'cstab' plot(x, ...)
x |
a cstab object (output of functions |
... |
additional arguments passed to print. |
Jonas M. B. Haslbeck <[email protected]> Dirk U. Wulff <[email protected]>
print.cstab
prints key variables of cstab objects.
## S3 method for class 'cstab' print(x, ...)
## S3 method for class 'cstab' print(x, ...)
x |
a cstab object (output of functions |
... |
additional arguments passed to print. |
Jonas M. B. Haslbeck <[email protected]> Dirk U. Wulff <[email protected]>