| Title: | An Extension of the Gap Statistic for Ordinal/Categorical Data |
|---|---|
| Description: | The gap statistic approach is extended to estimate the number of clusters for categorical response format data. This approach and accompanying software is designed to be used with the output of any clustering algorithm and with distances specifically designed for categorical (i.e. multiple choice) or ordinal survey response data. |
| Authors: | Jeffrey Miecznikowski [aut], Eduardo Cortes [aut, cre] (ORCID: <https://orcid.org/0000-0002-0966-6488>) |
| Maintainer: | Eduardo Cortes <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.2 |
| Built: | 2026-06-05 06:49:14 UTC |
| Source: | https://github.com/ecortesgomez/discretegapstatistic |
Computes pairwise Bhattacharyya distance between rows.
x |
n X p character matrix. |
offset |
small offset for log(0*0) cases. |
Bhattacharyya Distance
Distance matrix between rows.
Computes pairwise Chi-square distance between rows.
x |
n X p character matrix. |
Chi-square Distance
Distance matrix between rows.
Based on the implementation of the function found in the cluster R package.
clusGapDiscr( x, clusterFUN, K.max, B = nrow(x), value.range = "DS", verbose = interactive(), distName = "hamming", useLog = TRUE, dataClass = "nom", offset = 1e-07, ... )clusGapDiscr( x, clusterFUN, K.max, B = nrow(x), value.range = "DS", verbose = interactive(), distName = "hamming", useLog = TRUE, dataClass = "nom", offset = 1e-07, ... )
x |
A matrix object specifying category attributes in the columns and observations in the rows. |
clusterFUN |
Character string with one of the available clustering implementations.
Available options are: 'pam' (default) from |
K.max |
Integer. Maximum number of clusters |
B |
Number of bootstrap samples. By default |
value.range |
A length 1 character string, a character string vector or a list of character vector with the length matching the number of columns (nQ) of the array. By DEFAULT value.range = 'DS' (Data Support null model). A vector with all categories (either character for nominal or integer ordinal data) to consider when bootstrapping the null distribution sample (KS: Known Support option). If a list with category vectors is provided, it has to have the same number of columns as the input array. The order of list element corresponds to the array's columns. |
verbose |
Integer or logical. Determines whether progress output should printed while running. By DEFAULT one bit is printed per bootstrap sample. |
distName |
String. Name of categorical distance to apply. Available distances: 'bhattacharyya', 'chisquare', 'cramerV', 'hamming' and 'hellinger'. |
useLog |
Logical. Use log function after estimating |
dataClass |
character. Either 'nom' for nominal or 'ord' for ordinal. |
offset |
numerical. A small constant value added to W.k to avoid NAs when running |
... |
optionally further arguments for |
a matrix with K.max rows and 4 columns, named "logW", "E.logW", "gap", and "SE.sim",
where gap = E.logW - logW, and SE.sim correspond to the standard error of gap.
Based on the implementation of the function found in the cluster R package.
This function assumes that all attributes have identical categories.
clusGapDiscr0( x, FUNcluster, K.max, B = nrow(x), value.range = "DS", verbose = interactive(), distName = "hamming", useLog = TRUE, Input2Alg = "distMatr", dataClass = "nom", offset = 0, ... )clusGapDiscr0( x, FUNcluster, K.max, B = nrow(x), value.range = "DS", verbose = interactive(), distName = "hamming", useLog = TRUE, Input2Alg = "distMatr", dataClass = "nom", offset = 0, ... )
x |
A matrix object specifying category attributes in the columns and observations in the rows. |
FUNcluster |
a function that accepts as first argument a matrix like |
K.max |
Integer. Maximum number of clusters |
B |
Number of bootstrap samples. By default |
value.range |
A length 1 character string, a character string vector or a list of character vector with the length matching the number of columns (nQ) of the array. By DEFAULT value.range = 'DS' (Data Support null model). A vector with all categories (either character for nominal or integer ordinal data) to consider when bootstrapping the null distribution sample (KS: Known Support option). If a list with category vectors is provided, it has to have the same number of columns as the input array. The order of list element corresponds to the array's columns. |
verbose |
Integer or logical. Determines whether progress output should printed while running. By DEFAULT one bit is printed per bootstrap sample. |
distName |
String. Name of categorical distance to apply. Available distances: 'bhattacharyya', 'chisquare', 'cramerV', 'hamming' and 'hellinger'. |
useLog |
Logical. Use log function after estimating |
Input2Alg |
Specifies the kind of input provided to the algorithm function in |
dataClass |
character. Either 'nom' for nominal or 'ord' for ordinal. |
offset |
numerical. A small constant value added to W.k to avoid NAs when running |
... |
optionally further arguments for |
a matrix with K.max rows and 4 columns, named "logW", "E.logW", "gap", and "SE.sim",
where gap = E.logW - logW, and SE.sim correspond to the standard error of gap.
A function that generates formatted algorithmic functions that can be plugged
to enable run a wide variety of clustering algorithm for clusGapDiscr function.
clusterFunSel(clustFun)clusterFunSel(clustFun)
clustFun |
A character string with the following possible options:
'pam' (default) from |
An object of class kmodes as found in klaR packages.
An additional component specifies the categorical distance function found in distFun.
A data frame with 109 observations and 21 questions. Severity rating recorded as categorical responses from c1 (none) to c7 (severe).
concussionconcussion
data.frameHeadache
Nausea
Balance problems
Dizziness
Fatigue
Sleeping more than usual
Drowsiness
Sensibility to light
Sensibility to noice
Irritability
Sadness
Nervousness/Anxiousness
Feeling more emotional
Feeling slowed down
Feeling mentally foggy
Difficulty concentrating
Difficulty remembering
Visual problems
Confusion
Feeling clumsy
Answer slowlier
High-level wrapper that (i) saves a distance heatmap, (ii) runs
clusGapDiscr, (iii) saves a Gap Statistic plot (with the chosen
K marked), (iv) saves the resulting categorical heatmap, and (v) attempts
to save an MDS plot colored by the chosen clusters.
DGSrun( x, catVals, dataClass, clusterFUN = "pam", B = 100, K.max = 7, value.range = "DS", distName = "hamming", useLog = TRUE, title = NULL, outDir = "./" )DGSrun( x, catVals, dataClass, clusterFUN = "pam", B = 100, K.max = 7, value.range = "DS", distName = "hamming", useLog = TRUE, title = NULL, outDir = "./" )
x |
A matrix of categorical observations. Must be a matrix (enforced). |
catVals |
Vector of possible category values for the variables in |
dataClass |
Data class indicator passed to |
clusterFUN |
Character name of a clustering algorithm (default |
B |
Integer number of bootstrap samples for the gap statistic. |
K.max |
Maximum number of clusters considered. |
value.range |
Character string passed to |
distName |
Character name of the distance metric (default |
useLog |
Logical; passed to |
title |
Optional title prefix used in plot titles and output filenames. |
outDir |
Output directory where PNG files will be written. |
This function includes a device "safety rail" that closes any graphics devices
opened during its execution (useful if downstream plotting code forgets to call
dev.off()).
The object returned by clusGapDiscr.
sample-to-sample heatmap clustering samples according to a given categorical distance
Exploratory tool that helps to visualize/cluster blocks of observations across
columns ordered according to given categorical distance. The final output is
a clustered distance matrix.
This plot is aimed to guide the DiscreteClusGap user to give an idea which
type of categorical distance would accommodate better to the inputted data.
sample2sampleHeat is based on the pheatmap function from the pheatmap
R package. Thus, any parameter found in pheatmap can be specified to sample2sampleHeat.
distanceHeat( x, distName, clustering_method = "complete", border_color = NA, ... )distanceHeat( x, distName, clustering_method = "complete", border_color = NA, ... )
x |
matrix object or data.frame |
distName |
Name of categorical distance to apply. |
clustering_method |
string; clustering method used by pheatmap |
border_color |
string; color cell borders. By default, border_color = NA, where no border colors are shown. |
... |
other valid arguments in pheatmap function Available distances: 'bhattacharyya', 'chisquare', 'cramerV', 'hamming' and 'hellinger'. |
clustered heatmap
Dispatcher for discrete distance functions (nominal + ordinal).
distancematrix(X, d)distancematrix(X, d)
X |
Matrix where rows are observations and columns are discrete features. |
d |
Character scalar naming the distance. |
An object of class dist.
X <- rbind(matrix(paste0("a", rpois(7*5, 1)), nrow=5), matrix(paste0("a", rpois(7*5, 3)), nrow=5)) distancematrix(X = X, d = "hellinger")X <- rbind(matrix(paste0("a", rpois(7*5, 1)), nrow=5), matrix(paste0("a", rpois(7*5, 3)), nrow=5)) distancematrix(X = X, d = "hellinger")
Similar to maxSE function found the cluster package.
findK(cG_obj, meth = "Tibs2001SEmax", SE.fact = 1)findK(cG_obj, meth = "Tibs2001SEmax", SE.fact = 1)
cG_obj |
Output object obtained from |
meth |
Method to use to determine optimal k number of clusters. |
SE.fact |
Standard Error Factor generalizing the 1-SE rule. |
A numerical value from 1 to K.max, contained in the input cG_obj object.
K-modes function to accept any categorical distance based on
the function found in klaR:kmodes.
kmodesD(data, modes, distFun, iter.max = 10)kmodesD(data, modes, distFun, iter.max = 10)
data |
A matrix or data frame of categorical data. Objects have to be in rows, variables in columns. |
modes |
The number of modes |
distFun |
Pairwise categorical distance function. A function accepting two categorical vectors. |
iter.max |
The maximum number of iterations allowed. |
An object of class kmodes as found in klaR packages.
An additional component specifies the categorical distance function found in distFun.
Heatmap representation summarizing categorical/likert data.
Modified version of likert.heat.plot from likert package.
Does not allow different categorical ranges across questions.
The function outputs a ggplot object where additional layers can be added for customization purposes.
The output plot preserves the question order given by columns of x.
likert.heat.plot2( x, allLevels, low.color = "white", high.color = "blue", text.color = "black", text.size = 4, textLen = 50 )likert.heat.plot2( x, allLevels, low.color = "white", high.color = "blue", text.color = "black", text.size = 4, textLen = 50 )
x |
matrix object or data.frame with categorical data. Columns are questions and rows are observations. |
allLevels |
vector with all categorical (ordered) levels. |
low.color |
string; name of color assigned to the first level found in |
high.color |
string; name of color assigned to the last level found in |
text.color |
string; text color of numbers within cells. |
text.size |
string; text size for numbers within cells. |
textLen |
string; maximum length of text-length for question labels (column names) |
ggplot object.
Data extracted from the likert R package.
Results from an administration of the Math Anxiety Scale Survey.
First Column records student gender either Female or Male.
All statement answers have 5 possible ordinal categorical items:
Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.
massmass
data.frameGender
Math interesting
Uptight with math tests
Use math in the future
Mind goes blank in math tests
Math relates to own life
Worry about ability math problem solving
Sinking feeling doing math problems
Math is challenging
Nervousness with math
Take more math classes
Uneasy feeling with math
Favorite subject is math
Enjoy learning math
Confused with math
https://rdrr.io/cran/likert/man/mass.html
Function to visualize the distribution and spread of categorical data in two dimensions starting from the distance matrix. A cluster assignment vector needs to be provided to show color coded points.
plotMDS2( x, cl, type = "MDS", cols = NULL, dotSize = 3, LabTitle = type, outDir = NULL, filename = NULL, addRowNames = FALSE, labSize = 3, out = "plot" )plotMDS2( x, cl, type = "MDS", cols = NULL, dotSize = 3, LabTitle = type, outDir = NULL, filename = NULL, addRowNames = FALSE, labSize = 3, out = "plot" )
x |
dist object |
cl |
character Vector with clustering assignments. Important: the vector order should match
number (and if possible names) as the observed data matrix used to generate the distance object |
type |
character String specifying the class of MDS: either classic |
cols |
character Vector of colors to use with clustering labels. Cluster names should match.
to the ones provided in |
dotSize |
numerical String specifying the point size. |
LabTitle |
character String for the plot's title. By default |
outDir |
character string with the directory path to save output file.
|
filename |
character string with name of file output.
|
addRowNames |
logical Single value indicating whether to place observation names next to points.
The labels used are the names found in |
labSize |
numeric Single value indicating the size of labels. |
out |
character String specify output to obtain: |
png file and either a data.frame with coordinates and labels or either a list with
related MDS results or a metaMDS object, depending on the out option.
Heatmap assuming a given a distance function and a known number of clusters.
Function to display a categorical data matrix given a user defined number of
clusters nCl, a categorical distance distName and a predefined clustering
method FUNcluster.
The output displays a heatmap separating and color-labelling resulting
clusters vertically in the rows and allowing unsupervised clustering on
questions in the columns. Each cell is colored according to the categorical
values provided or found in the data.
The clustergram is based on the pheatmap function from the pheatmap R package.
Thus, any parameter found in pheatmap can be specified to clusGapDiscrHeat.
This function can be used to examine number of clusters before running
clusGapDiscrHeat but also after the number of clusters is determined.
ResHeatmap( x, nCl, distName, catVals, clusterFUN, out = "heatmap", seed = NULL, clusterNames = NULL, prefObs = NULL, rowNames = rownames(x), filename = NULL, outDir = NULL, height = 10, width = 6 )ResHeatmap( x, nCl, distName, catVals, clusterFUN, out = "heatmap", seed = NULL, clusterNames = NULL, prefObs = NULL, rowNames = rownames(x), filename = NULL, outDir = NULL, height = 10, width = 6 )
x |
matrix object |
nCl |
number of clusters to plot; if |
distName |
Name of categorical distance to apply. Available distances: Check available list. |
catVals |
vector with (ordered) categorical values. |
clusterFUN |
Character string with one of the available clustering implementations.
Available options are: 'pam' (default) from |
out |
Specifies the desired output between "heatmap" (default; produce a heatmap), "clusters" (return a |
seed |
Seed number. |
clusterNames |
Either |
prefObs |
character string vector of length 1 with a prefix for the observations, in case they come unlabelled or the user wants to anomymize sample IDs. |
rowNames |
character vector with names of rows according to |
filename |
character string with name of file output |
outDir |
character string with the directory path to save output file |
height |
numeric height of output plot in inches |
width |
numeric width of output plot in inches |
png file or ComplexHeatmap object
Convenience wrapper around ResHeatmap that extracts cluster labels
(optionally re-ordered to match the input data) and returns either a
data.frame or a named character vector.
RetrClustAssign( data, catVals, clusterFUN = "pam", distName, nCl, outFormat = c("data.frame", "vector"), clusterNames = NULL, ordering = c("data", "heatmap") )RetrClustAssign( data, catVals, clusterFUN = "pam", distName, nCl, outFormat = c("data.frame", "vector"), clusterNames = NULL, ordering = c("data", "heatmap") )
data |
A matrix or data.frame of categorical observations (rows are observations,
columns are variables). Row names should be present if |
catVals |
Vector of possible category values used by |
clusterFUN |
Character name of a clustering algorithm supported by the workflow
(e.g., |
distName |
Character name of the distance to use. |
nCl |
Integer number of clusters. |
outFormat |
Output format: |
clusterNames |
Optional cluster names passed to |
ordering |
If |
A data.frame (default) or a named character vector of cluster labels.
# x <- matrix(sample(letters[1:3], 120, TRUE), nrow = 30) # rownames(x) <- paste0("s", seq_len(nrow(x))) # RetrClustAssign(x, catVals = letters[1:3], distName = "hamming", nCl = 3) # RetrClustAssign(x, catVals = letters[1:3], distName = "hamming", nCl = 3, # outFormat = "vector")# x <- matrix(sample(letters[1:3], 120, TRUE), nrow = 30) # rownames(x) <- paste0("s", seq_len(nrow(x))) # RetrClustAssign(x, catVals = letters[1:3], distName = "hamming", nCl = 3) # RetrClustAssign(x, catVals = letters[1:3], distName = "hamming", nCl = 3, # outFormat = "vector")
A function to simulate data based on a multinomial vector parameter vector or a list of parameter vectors.
SimData(N, nQ, pi, dataClass = "nominal", seed = NULL)SimData(N, nQ, pi, dataClass = "nominal", seed = NULL)
N |
Integer. Number of observations. |
nQ |
Integer. Number of questions. |
pi |
Numeric vector. Vector of probabilities adding up to 1.
Alternatively, pi can be list of vectors as previously described with length equal to |
dataClass |
Character. Either 'nominal' or 'ordinal'. |
seed |
Integer. Numerical seed for the RNG. |
N x nQ matrix with simulated categories distributed according to vector pi
Pix <- setNames(c(0.1, 0.2, 0.3, 0.4, 0), paste0('a', 1:5)) X <- SimData(N=10, nQ=5, Pix, dataClass = 'nominal') head(X) Piy <- setNames(c(0.3, 0.2, 0.5), paste0('b', 1:3)) Y <- SimData(N=10, nQ=3, Piy, dataClass = 'nominal') head(Y) PiZ <- list(x1 = Pix, y1 = Piy, y2 = Piy) Z <- SimData(N = 10, nQ = length(PiZ), PiZ) Piw <- setNames(Piy, 1:3) W <- SimData(N=10, nQ=3, Piw, dataClass = 'ord') head(W)Pix <- setNames(c(0.1, 0.2, 0.3, 0.4, 0), paste0('a', 1:5)) X <- SimData(N=10, nQ=5, Pix, dataClass = 'nominal') head(X) Piy <- setNames(c(0.3, 0.2, 0.5), paste0('b', 1:3)) Y <- SimData(N=10, nQ=3, Piy, dataClass = 'nominal') head(Y) PiZ <- list(x1 = Pix, y1 = Piy, y2 = Piy) Z <- SimData(N = 10, nQ = length(PiZ), PiZ) Piw <- setNames(Piy, 1:3) W <- SimData(N=10, nQ=3, Piw, dataClass = 'ord') head(W)