Skip to contents

Lumps the levels of an ordered categorical variable, combining only adjacent levels so that the ordering is respected. Each resulting level meets the sample-size threshold while preserving as much mutual information as possible. Unlike the nominal case, this runs in polynomial time.

Usage

lump_ordinal(
  data,
  threshold,
  levels = NULL,
  verbose = FALSE,
  alternative_metric = c("mutual information", "bin count", "surplus"),
  level_namer = default_level_namer
)

Arguments

data

Factor or character vector of the categorical data.

threshold

The minimum number of samples each lumped level should contain.

levels

Character vector specifying the strict ordinal hierarchy of the levels (from lowest to highest). Required if data is not already an ordered factor.

verbose

Logical value dictating if values should be printed. Default: FALSE.

alternative_metric

The metric that should be optimised for, if it is different from the default, the mutual information. For an explanation of the metrics see vignette("metrics").

level_namer

Function that takes a character vector of the original levels in a lump and returns the name of the new lumped level. Default: concatenating the original levels with a "+" in between.

Value

An ordered factor vector with the lumped levels.

See also

maximum_mutual_information_ordinal() for the underlying algorithm that this function wraps.

lump_ordinal_supervised() for a supervised version of this function.

Author

Daan Koning

Examples

risk_group <- c("low", "medium", "very low", "high", "medium", "low",
                 "high", "medium", "low", "very high", "very low", "medium")

# Provide the order of the levels:
strict_order <- c("very low", "low", "medium", "high", "very high")
lump_ordinal(risk_group, 3, levels = strict_order)
#>  [1] very low+low   medium         very low+low   high+very high medium        
#>  [6] very low+low   high+very high medium         very low+low   high+very high
#> [11] very low+low   medium        
#> Levels: very low+low < medium < high+very high

# Alternatively, pass a pre-ordered factor:
risk_ordered <- ordered(risk_group, levels = strict_order)
lump_ordinal(risk_ordered, 3)
#>  [1] very low+low   medium         very low+low   high+very high medium        
#>  [6] very low+low   high+very high medium         very low+low   high+very high
#> [11] very low+low   medium        
#> Levels: very low+low < medium < high+very high