
Perform lumping on a hierarchical nominal variable
lump_hierarchical.RdPerform lumping on a hierarchical nominal variable
Usage
lump_hierarchical(
data,
threshold,
clusters,
verbose = FALSE,
alternative_metric = c("mutual information", "bin count", "surplus"),
level_namer = default_level_namer
)Arguments
- data
Factor or character vector of the categorical data.
- threshold
The minimum number of samples each lumped level should contain.
- clusters
List of character vectors representing the levels that are allowed to be lumped together.
- verbose
Logical value dictating if values should be printed. Default:
FALSE.- alternative_metric
The metric that should be optimised for, if it is different from the default, the mutual information. For an explanation of the metrics see
vignette("metrics").- level_namer
Function that takes a character vector of the original levels in a lump and returns the name of the new lumped level. Default: concatenating the original levels with a "+" in between.
See also
maximum_mutual_information_hierarchical() for the underlying algorithm that this function wraps.
lump_nominal() for a more general version of this function that does not need the hierarchical structure in the data, but may be slower.
Examples
country <- c("Germany", "Netherlands", "France", "France", "China", "China",
"China", "China", "Vietnam", "Vietnam", "Japan", "Japan")
lump_hierarchical(
country,
4,
list(c("Germany", "Netherlands", "France"), c("China", "Vietnam", "Japan"))
)
#> [1] Germany+Netherlands+France Germany+Netherlands+France
#> [3] Germany+Netherlands+France Germany+Netherlands+France
#> [5] China China
#> [7] China China
#> [9] Vietnam+Japan Vietnam+Japan
#> [11] Vietnam+Japan Vietnam+Japan
#> Levels: Germany+Netherlands+France Vietnam+Japan China