Lumps the levels of a nominal variable so as to preserve as much mutual
information as possible between the lumped variable and a supplied outcome.
The outcome may be discrete (a factor) or continuous (numeric); see
vignette("supervised") for the distinction.
Usage
lump_nominal_supervised(
data,
outcome,
threshold,
adj_matrix = NULL,
verbose = FALSE,
level_namer = default_level_namer,
outcome_mode = c("auto", "discrete", "continuous")
)Arguments
- data
Factor or character vector of the categorical data.
- outcome
Factor or character vector. Variable to be used as a source of information about
data.- threshold
The minimum number of samples each lumped level should contain.
- adj_matrix
Adjancency matrix of the preference graph. Default: a complete graph, allowing all lumpings.
- verbose
Logical value dictating if values should be printed. Default:
FALSE.- level_namer
Function that takes a character vector of the original levels in a lump and returns the name of the new lumped level. Default: concatenating the original levels with a "+" in between.
- outcome_mode
Whether to treat the outcome as discrete or continuous. Default: inferred based on the type of
outcome.
See also
maximum_mutual_information_nominal_supervised() for the underlying algorithm that this function wraps.
lump_hierarchical_supervised() for a version of this function that can take advantage of hierarchical structure in the data to speed up the execution time.
