
Perform supervised lumping on a hierarchical nominal variable
Source:R/lump.R
lump_hierarchical_supervised.RdLumps a hierarchical nominal variable (combining only levels within the same
cluster) so as to preserve as much mutual information as possible with a
supplied outcome.
Usage
lump_hierarchical_supervised(
data,
outcome,
threshold,
clusters,
verbose = FALSE,
level_namer = default_level_namer
)Arguments
- data
Factor or character vector of the categorical data.
- outcome
Factor or character vector. Variable to be used as a source of information about
data.- threshold
The minimum number of samples each lumped level should contain.
- clusters
List of character vectors representing the levels that are allowed to be lumped together.
- verbose
Logical value dictating if values should be printed. Default:
FALSE.- level_namer
Function that takes a character vector of the original levels in a lump and returns the name of the new lumped level. Default: concatenating the original levels with a "+" in between.
Details
Note that, unlike lump_nominal_supervised() and lump_ordinal_supervised(), this function only supports
discrete (factor) outcomes, not continuous ones.
See also
maximum_mutual_information_hierarchical_supervised() for the underlying algorithm that this function wraps.
lump_nominal_supervised() for a more general version of this function that does not need the hierarchical structure in the data, but may be slower.
Examples
data <- c("Utrecht", "Utrecht", "Friesland",
"Friesland", "Friesland", "Bayern",
"Bayern", "Bayern", "Sachsen")
outcome <- c(1, 0, 1, 1, 0, 0, 0, 1, 1)
clusters <- list(
c("Utrecht", "Friesland"),
c("Bayern", "Sachsen")
)
lump_hierarchical_supervised(data, outcome, threshold = 3, clusters = clusters)
#> [1] Utrecht+Friesland Utrecht+Friesland Utrecht+Friesland Utrecht+Friesland
#> [5] Utrecht+Friesland Bayern+Sachsen Bayern+Sachsen Bayern+Sachsen
#> [9] Bayern+Sachsen
#> Levels: Bayern+Sachsen Utrecht+Friesland