Skip to contents

Lumps a hierarchical nominal variable (combining only levels within the same cluster) so as to preserve as much mutual information as possible with a supplied outcome.

Usage

lump_hierarchical_supervised(
  data,
  outcome,
  threshold,
  clusters,
  verbose = FALSE,
  level_namer = default_level_namer
)

Arguments

data

Factor or character vector of the categorical data.

outcome

Factor or character vector. Variable to be used as a source of information about data.

threshold

The minimum number of samples each lumped level should contain.

clusters

List of character vectors representing the levels that are allowed to be lumped together.

verbose

Logical value dictating if values should be printed. Default: FALSE.

level_namer

Function that takes a character vector of the original levels in a lump and returns the name of the new lumped level. Default: concatenating the original levels with a "+" in between.

Value

A factor vector with the lumped levels.

Details

Note that, unlike lump_nominal_supervised() and lump_ordinal_supervised(), this function only supports discrete (factor) outcomes, not continuous ones.

See also

maximum_mutual_information_hierarchical_supervised() for the underlying algorithm that this function wraps.

lump_nominal_supervised() for a more general version of this function that does not need the hierarchical structure in the data, but may be slower.

Author

Daan Koning

Examples

data <- c("Utrecht", "Utrecht", "Friesland",
          "Friesland", "Friesland", "Bayern",
          "Bayern", "Bayern", "Sachsen")
outcome <- c(1, 0, 1, 1, 0, 0, 0, 1, 1)
clusters <- list(
  c("Utrecht", "Friesland"),
  c("Bayern", "Sachsen")
)
lump_hierarchical_supervised(data, outcome, threshold = 3, clusters = clusters)
#> [1] Utrecht+Friesland Utrecht+Friesland Utrecht+Friesland Utrecht+Friesland
#> [5] Utrecht+Friesland Bayern+Sachsen    Bayern+Sachsen    Bayern+Sachsen   
#> [9] Bayern+Sachsen   
#> Levels: Bayern+Sachsen Utrecht+Friesland