Package 'DiscreteGapStatistic'

Title: An Extension of the Gap Statistic for Ordinal/Categorical Data
Description: The gap statistic approach is extended to estimate the number of clusters for categorical response format data. This approach and accompanying software is designed to be used with the output of any clustering algorithm and with distances specifically designed for categorical (i.e. multiple choice) or ordinal survey response data.
Authors: Jeffrey Miecznikowski [aut], Eduardo Cortes [aut, cre] (ORCID: <https://orcid.org/0000-0002-0966-6488>)
Maintainer: Eduardo Cortes <[email protected]>
License: MIT + file LICENSE
Version: 1.1.2
Built: 2026-06-05 06:49:14 UTC
Source: https://github.com/ecortesgomez/discretegapstatistic

Help Index


Bhattacharyya Distance via Rcpp

Description

Computes pairwise Bhattacharyya distance between rows.

Arguments

x

n X p character matrix.

offset

small offset for log(0*0) cases.

Details

Bhattacharyya Distance

Value

Distance matrix between rows.


Chi-square Distance via Rcpp

Description

Computes pairwise Chi-square distance between rows.

Arguments

x

n X p character matrix.

Details

Chi-square Distance

Value

Distance matrix between rows.


Discrete application of clusGap

Description

Based on the implementation of the function found in the cluster R package.

Usage

clusGapDiscr(
  x,
  clusterFUN,
  K.max,
  B = nrow(x),
  value.range = "DS",
  verbose = interactive(),
  distName = "hamming",
  useLog = TRUE,
  dataClass = "nom",
  offset = 1e-07,
  ...
)

Arguments

x

A matrix object specifying category attributes in the columns and observations in the rows.

clusterFUN

Character string with one of the available clustering implementations. Available options are: 'pam' (default) from cluster::pam, 'diana' from cluster::diana, 'fanny' from cluster::fanny, 'agnes-{average, single, complete, ward, weighted}' from cluster::fanny, 'hclust-{ward.D, ward.D2, single, complete, average, mcquitty, median, centroid}' from stats::hclust, 'kmodes' from klar::kmodes (iter.max = 10, weighted = FALSE and fast= TRUE). 'kmodes-N' enables to run the kmodes algorithm with a given number N of iterations where iter.max = N.

K.max

Integer. Maximum number of clusters k to consider

B

Number of bootstrap samples. By default B = nrow(x).

value.range

A length 1 character string, a character string vector or a list of character vector with the length matching the number of columns (nQ) of the array. By DEFAULT value.range = 'DS' (Data Support null model). A vector with all categories (either character for nominal or integer ordinal data) to consider when bootstrapping the null distribution sample (KS: Known Support option). If a list with category vectors is provided, it has to have the same number of columns as the input array. The order of list element corresponds to the array's columns.

verbose

Integer or logical. Determines whether progress output should printed while running. By DEFAULT one bit is printed per bootstrap sample.

distName

String. Name of categorical distance to apply. Available distances: 'bhattacharyya', 'chisquare', 'cramerV', 'hamming' and 'hellinger'.

useLog

Logical. Use log function after estimating W.k. Following the original formulation useLog=TRUE by default.

dataClass

character. Either 'nom' for nominal or 'ord' for ordinal.

offset

numerical. A small constant value added to W.k to avoid NAs when running useLog=TRUE for clusters with extremely low variability. offset = 0 is set by default.

...

optionally further arguments for FUNcluster()

Value

a matrix with K.max rows and 4 columns, named "logW", "E.logW", "gap", and "SE.sim", where gap = E.logW - logW, and SE.sim correspond to the standard error of gap.


Discrete application of clusGap - core function.

Description

Based on the implementation of the function found in the cluster R package. This function assumes that all attributes have identical categories.

Usage

clusGapDiscr0(
  x,
  FUNcluster,
  K.max,
  B = nrow(x),
  value.range = "DS",
  verbose = interactive(),
  distName = "hamming",
  useLog = TRUE,
  Input2Alg = "distMatr",
  dataClass = "nom",
  offset = 0,
  ...
)

Arguments

x

A matrix object specifying category attributes in the columns and observations in the rows.

FUNcluster

a function that accepts as first argument a matrix like x; second argument specifies number of k (k=>2) clusters This function should return a list with a component named cluster, a vector of length n=nrow(x) of integers from 1:k indicating observation cluster assignment. Make sure FUNcluster and Input2Alg agree.

K.max

Integer. Maximum number of clusters k to consider

B

Number of bootstrap samples. By default B = nrow(x).

value.range

A length 1 character string, a character string vector or a list of character vector with the length matching the number of columns (nQ) of the array. By DEFAULT value.range = 'DS' (Data Support null model). A vector with all categories (either character for nominal or integer ordinal data) to consider when bootstrapping the null distribution sample (KS: Known Support option). If a list with category vectors is provided, it has to have the same number of columns as the input array. The order of list element corresponds to the array's columns.

verbose

Integer or logical. Determines whether progress output should printed while running. By DEFAULT one bit is printed per bootstrap sample.

distName

String. Name of categorical distance to apply. Available distances: 'bhattacharyya', 'chisquare', 'cramerV', 'hamming' and 'hellinger'.

useLog

Logical. Use log function after estimating W.k. Following the original formulation useLog=TRUE by default.

Input2Alg

Specifies the kind of input provided to the algorithm function in FUNcluster. For algorithms that only accept a distance matrix use 'distMatr' option (default). For algorithms that require the dataset and a prespecified distance function (e.g. stats::dist) use the 'distFun' option. This case the distance function is defined internally and determined by parameter distName.

dataClass

character. Either 'nom' for nominal or 'ord' for ordinal.

offset

numerical. A small constant value added to W.k to avoid NAs when running useLog=TRUE for clusters with extremely low variability. offset = 0 is set by default.

...

optionally further arguments for FUNcluster()

Value

a matrix with K.max rows and 4 columns, named "logW", "E.logW", "gap", and "SE.sim", where gap = E.logW - logW, and SE.sim correspond to the standard error of gap.


Clustering generating function

Description

A function that generates formatted algorithmic functions that can be plugged to enable run a wide variety of clustering algorithm for clusGapDiscr function.

Usage

clusterFunSel(clustFun)

Arguments

clustFun

A character string with the following possible options: 'pam' (default) from cluster::pam, 'diana' from cluster::diana, 'fanny' from cluster::fanny, 'agnes-{average, single, complete, ward, weighted}' from cluster::agnes, 'hclust-{ward.D, ward.D2, single, complete, average, mcquitty, median, centroid}' from base::hclust, 'kmodes' from klar::kmodes (iter.max = 10, weighted = FALSE and fast= TRUE). 'kmodes-N' enables to run the kmodes algorithm with a given number N of iterations where iter.max = N.

Value

An object of class kmodes as found in klaR packages. An additional component specifies the categorical distance function found in distFun.


Concussion Data

Description

A data frame with 109 observations and 21 questions. Severity rating recorded as categorical responses from c1 (none) to c7 (severe).

Usage

concussion

Format

data.frame

Q1: Headache

Headache

Q2: Nausea

Nausea

Q3: Balance problems

Balance problems

Q4: Dizziness

Dizziness

Q5: Fatigue

Fatigue

Q6: Sleep more

Sleeping more than usual

Q7: Drowsiness

Drowsiness

Q8: Sensibility to light

Sensibility to light

Q9: Sensibility to noice

Sensibility to noice

Q10: Irritability

Irritability

Q11: Sadness

Sadness

Q12: Nervousness

Nervousness/Anxiousness

Q13: More emotional

Feeling more emotional

Q14: Feeling slowed down

Feeling slowed down

Q15: Feeling mentally foggy

Feeling mentally foggy

Q16: Difficulty concentrating

Difficulty concentrating

Q17: Difficulty remembering

Difficulty remembering

Q18: Visual problem

Visual problems

Q19: Confusion

Confusion

Q20: Feeling clumsy

Feeling clumsy

Q21: Answer slowlier

Answer slowlier


Run the Discrete Gap Statistic workflow and save plots

Description

High-level wrapper that (i) saves a distance heatmap, (ii) runs clusGapDiscr, (iii) saves a Gap Statistic plot (with the chosen K marked), (iv) saves the resulting categorical heatmap, and (v) attempts to save an MDS plot colored by the chosen clusters.

Usage

DGSrun(
  x,
  catVals,
  dataClass,
  clusterFUN = "pam",
  B = 100,
  K.max = 7,
  value.range = "DS",
  distName = "hamming",
  useLog = TRUE,
  title = NULL,
  outDir = "./"
)

Arguments

x

A matrix of categorical observations. Must be a matrix (enforced).

catVals

Vector of possible category values for the variables in x.

dataClass

Data class indicator passed to clusGapDiscr.

clusterFUN

Character name of a clustering algorithm (default "pam").

B

Integer number of bootstrap samples for the gap statistic.

K.max

Maximum number of clusters considered.

value.range

Character string passed to clusGapDiscr.

distName

Character name of the distance metric (default "hamming").

useLog

Logical; passed to clusGapDiscr.

title

Optional title prefix used in plot titles and output filenames.

outDir

Output directory where PNG files will be written.

Details

This function includes a device "safety rail" that closes any graphics devices opened during its execution (useful if downstream plotting code forgets to call dev.off()).

Value

The object returned by clusGapDiscr.


Sample-to-sample heatmap

Description

sample-to-sample heatmap clustering samples according to a given categorical distance Exploratory tool that helps to visualize/cluster blocks of observations across columns ordered according to given categorical distance. The final output is a clustered distance matrix. This plot is aimed to guide the DiscreteClusGap user to give an idea which type of categorical distance would accommodate better to the inputted data. sample2sampleHeat is based on the pheatmap function from the pheatmap R package. Thus, any parameter found in pheatmap can be specified to sample2sampleHeat.

Usage

distanceHeat(
  x,
  distName,
  clustering_method = "complete",
  border_color = NA,
  ...
)

Arguments

x

matrix object or data.frame

distName

Name of categorical distance to apply.

clustering_method

string; clustering method used by pheatmap

border_color

string; color cell borders. By default, border_color = NA, where no border colors are shown.

...

other valid arguments in pheatmap function Available distances: 'bhattacharyya', 'chisquare', 'cramerV', 'hamming' and 'hellinger'.

Value

clustered heatmap


Calculate categorical distance matrix for discrete data

Description

Dispatcher for discrete distance functions (nominal + ordinal).

Usage

distancematrix(X, d)

Arguments

X

Matrix where rows are observations and columns are discrete features.

d

Character scalar naming the distance.

Value

An object of class dist.

Examples

X <- rbind(matrix(paste0("a", rpois(7*5, 1)), nrow=5),
           matrix(paste0("a", rpois(7*5, 3)), nrow=5))
distancematrix(X = X, d = "hellinger")

Criteria to determine number of clusters k

Description

Similar to maxSE function found the cluster package.

Usage

findK(cG_obj, meth = "Tibs2001SEmax", SE.fact = 1)

Arguments

cG_obj

Output object obtained from clusGapDiscr

meth

Method to use to determine optimal k number of clusters.

SE.fact

Standard Error Factor generalizing the 1-SE rule.

Value

A numerical value from 1 to K.max, contained in the input cG_obj object.


Adapted k-modes algorithm

Description

K-modes function to accept any categorical distance based on the function found in klaR:kmodes.

Usage

kmodesD(data, modes, distFun, iter.max = 10)

Arguments

data

A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.

modes

The number of modes

distFun

Pairwise categorical distance function. A function accepting two categorical vectors.

iter.max

The maximum number of iterations allowed.

Value

An object of class kmodes as found in klaR packages. An additional component specifies the categorical distance function found in distFun.


Summary Heatmap for categorical data

Description

Heatmap representation summarizing categorical/likert data. Modified version of likert.heat.plot from likert package. Does not allow different categorical ranges across questions. The function outputs a ggplot object where additional layers can be added for customization purposes. The output plot preserves the question order given by columns of x.

Usage

likert.heat.plot2(
  x,
  allLevels,
  low.color = "white",
  high.color = "blue",
  text.color = "black",
  text.size = 4,
  textLen = 50
)

Arguments

x

matrix object or data.frame with categorical data. Columns are questions and rows are observations.

allLevels

vector with all categorical (ordered) levels.

low.color

string; name of color assigned to the first level found in allLevels.

high.color

string; name of color assigned to the last level found in allLevels.

text.color

string; text color of numbers within cells.

text.size

string; text size for numbers within cells.

textLen

string; maximum length of text-length for question labels (column names)

Value

ggplot object.


mass data

Description

Data extracted from the likert R package. Results from an administration of the Math Anxiety Scale Survey. First Column records student gender either Female or Male. All statement answers have 5 possible ordinal categorical items: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.

Usage

mass

Format

data.frame

Gender

Gender

I find math interesting.

Math interesting

I get uptight during math tests.

Uptight with math tests

I think that I will use math in the future.

Use math in the future

Mind goes blank and I am unable to think clearly when doing my math test.

Mind goes blank in math tests

Math relates to my life.

Math relates to own life

I worry about my ability to solve math problems.

Worry about ability math problem solving

I get a sinking feeling when I try to do math problems.

Sinking feeling doing math problems

I find math challenging.

Math is challenging

Mathematics makes me feel nervous.

Nervousness with math

I would like to take more math classes.

Take more math classes

Mathematics makes me feel uneasy.

Uneasy feeling with math

Math is one of my favorite subjects.

Favorite subject is math

I enjoy learning with mathematics.

Enjoy learning math

Mathematics makes me feel confused.

Confused with math

Source

https://rdrr.io/cran/likert/man/mass.html


MDS Plots for Categorical Data

Description

Function to visualize the distribution and spread of categorical data in two dimensions starting from the distance matrix. A cluster assignment vector needs to be provided to show color coded points.

Usage

plotMDS2(
  x,
  cl,
  type = "MDS",
  cols = NULL,
  dotSize = 3,
  LabTitle = type,
  outDir = NULL,
  filename = NULL,
  addRowNames = FALSE,
  labSize = 3,
  out = "plot"
)

Arguments

x

dist object

cl

character Vector with clustering assignments. Important: the vector order should match number (and if possible names) as the observed data matrix used to generate the distance object x.

type

character String specifying the class of MDS: either classic 'MDS' (default) or Non-metrical Dimensional Scaling 'NMDS'. The core MDS function used is stats::cmdscale(x, k = 2) and vegan::metaMDS(comm = x, distance = "none", k = 2, trymax = 100, autotransform = FALSE) for NMDS, which uses vegan::monoMDS by default.

cols

character Vector of colors to use with clustering labels. Cluster names should match. to the ones provided in cl. The by default (cols = NULL) de functions produces highly contrasting colors.

dotSize

numerical String specifying the point size. dotSize = 3 by default.

LabTitle

character String for the plot's title. By default LabTitle = type.

outDir

character string with the directory path to save output file. outDir = NULL is default option avoiding generating a plot.

filename

character string with name of file output. filename = NULL by default avoiding generating a plot. This parameter should have a valid ggplot2 output graphical format extension like 'FileName.{png,pdf,ps,...}'.

addRowNames

logical Single value indicating whether to place observation names next to points. The labels used are the names found in cl. If names(cl) == NULL, the samples will be labelled in number of appearance. ggrepel package is used to locate the labels.

labSize

numeric Single value indicating the size of labels. labSize = 3 by default. Non-functional if addRowNames = FALSE.

out

character String specify output to obtain: 'plot' (default), 'data.frame' or 'object'. Either plot return the resulting ggplot object or data.frame for a data.frame with coordinates and corresponding clustering assignments or object to obtain raw objects from the chosen type above.

Value

png file and either a data.frame with coordinates and labels or either a list with related MDS results or a metaMDS object, depending on the out option.


Discrete Data Heatmap

Description

Heatmap assuming a given a distance function and a known number of clusters. Function to display a categorical data matrix given a user defined number of clusters nCl, a categorical distance distName and a predefined clustering method FUNcluster. The output displays a heatmap separating and color-labelling resulting clusters vertically in the rows and allowing unsupervised clustering on questions in the columns. Each cell is colored according to the categorical values provided or found in the data. The clustergram is based on the pheatmap function from the pheatmap R package. Thus, any parameter found in pheatmap can be specified to clusGapDiscrHeat. This function can be used to examine number of clusters before running clusGapDiscrHeat but also after the number of clusters is determined.

Usage

ResHeatmap(
  x,
  nCl,
  distName,
  catVals,
  clusterFUN,
  out = "heatmap",
  seed = NULL,
  clusterNames = NULL,
  prefObs = NULL,
  rowNames = rownames(x),
  filename = NULL,
  outDir = NULL,
  height = 10,
  width = 6
)

Arguments

x

matrix object

nCl

number of clusters to plot; if nCl is a permutation vector of the first lN integers will rearrange clusters according to the original given ordering.

distName

Name of categorical distance to apply. Available distances: Check available list.

catVals

vector with (ordered) categorical values.

clusterFUN

Character string with one of the available clustering implementations. Available options are: 'pam' (default) from cluster::pam, 'diana' from cluster::diana, 'fanny' from cluster::fanny. 'agnes-{average, single, complete, ward, weighted}' from cluster::agnes, 'hclust-{ward.D, ward.D2, single, complete, average, mcquitty, median, centroid}' from stats::hclust, 'kmodes' from klar::kmodes (weighted = FALSE and fast= TRUE).

out

Specifies the desired output between "heatmap" (default; produce a heatmap), "clusters" (return a data.frame with clustering assignments) or "clustersReord" (return a data.frame with reorganized clusters)

seed

Seed number.

clusterNames

Either null or 'renumber'. When nCl is a numerical vector, the cluster ordering is rearranged. NULL leaves cluster names as their original cluster assignment. 'renumber' respects the rearrangements but relabels the cluster numbers from top to bottom in ascending order.

prefObs

character string vector of length 1 with a prefix for the observations, in case they come unlabelled or the user wants to anomymize sample IDs.

rowNames

character vector with names of rows according to x. By default, rownames(x) will be printed in the plot. rowNames=NULL prevents from showing names. prefObs option takes precedence if is different to NULL.

filename

character string with name of file output

outDir

character string with the directory path to save output file

height

numeric height of output plot in inches

width

numeric width of output plot in inches

Value

png file or ComplexHeatmap object


Retrieve cluster assignments from a DiscreteGapStatistic heatmap run

Description

Convenience wrapper around ResHeatmap that extracts cluster labels (optionally re-ordered to match the input data) and returns either a data.frame or a named character vector.

Usage

RetrClustAssign(
  data,
  catVals,
  clusterFUN = "pam",
  distName,
  nCl,
  outFormat = c("data.frame", "vector"),
  clusterNames = NULL,
  ordering = c("data", "heatmap")
)

Arguments

data

A matrix or data.frame of categorical observations (rows are observations, columns are variables). Row names should be present if ordering = "data".

catVals

Vector of possible category values used by ResHeatmap.

clusterFUN

Character name of a clustering algorithm supported by the workflow (e.g., "pam").

distName

Character name of the distance to use.

nCl

Integer number of clusters.

outFormat

Output format: "data.frame" (default) or "vector".

clusterNames

Optional cluster names passed to ResHeatmap.

ordering

If "data", re-order to match rownames(data); if "heatmap", keep ResHeatmap ordering.

Value

A data.frame (default) or a named character vector of cluster labels.

Examples

# x <- matrix(sample(letters[1:3], 120, TRUE), nrow = 30)
# rownames(x) <- paste0("s", seq_len(nrow(x)))
# RetrClustAssign(x, catVals = letters[1:3], distName = "hamming", nCl = 3)
# RetrClustAssign(x, catVals = letters[1:3], distName = "hamming", nCl = 3,
#                 outFormat = "vector")

Simulate Data

Description

A function to simulate data based on a multinomial vector parameter vector or a list of parameter vectors.

Usage

SimData(N, nQ, pi, dataClass = "nominal", seed = NULL)

Arguments

N

Integer. Number of observations.

nQ

Integer. Number of questions.

pi

Numeric vector. Vector of probabilities adding up to 1. Alternatively, pi can be list of vectors as previously described with length equal to nQ. This case, notice that that the vectors within the list can be different. The order of the pi vectors in the list will be reflected in the resulting column names. If dataClass = 'ordinal' it is required that vector names of pi be integers and decimals or other numeric values should be avoided. If dataClass = nominal and pi vector names are numerical, these will remain characters.

dataClass

Character. Either 'nominal' or 'ordinal'.

seed

Integer. Numerical seed for the RNG.

Value

N x nQ matrix with simulated categories distributed according to vector pi

Examples

Pix <- setNames(c(0.1, 0.2, 0.3, 0.4, 0), paste0('a', 1:5))
X <- SimData(N=10, nQ=5, Pix, dataClass = 'nominal')
head(X)

Piy <- setNames(c(0.3, 0.2, 0.5), paste0('b', 1:3))
Y <- SimData(N=10, nQ=3, Piy, dataClass = 'nominal')
head(Y)

PiZ <- list(x1 = Pix, y1 = Piy, y2 = Piy)
Z <- SimData(N = 10, nQ = length(PiZ), PiZ)

Piw <- setNames(Piy, 1:3)
W <- SimData(N=10, nQ=3, Piw, dataClass = 'ord')
head(W)