
Introduction to optLump
optLump.RmdCategorical variables with many rare levels can often cause problems such as overfitting. A potential solution is to lump these rare categories together, to ensure every level meets a minimum sample size constraint.
Unlike naive approaches that simply lump the smallest levels into a
generic Other category, optLump uses
information theory to find the combinations that preserve the maximum
amount of mutual information from the original data distribution. For an
(abridged) explanation of mutual information, see
vignette("metrics").
Handling Different Types of Variables
Suppose we are given some data containing a response variable
outcome, based on some predictors, which are
categorical:
summary(data)
#> outcome education_level diagnosis country_of_origin
#> Min. :0.00 <High School:14 Diabetes :22 China :17
#> 1st Qu.:0.00 High School :41 Heart Disease:10 Japan :15
#> Median :1.00 Bachelor's :28 Hypertension :19 Mexico :12
#> Mean :1.07 Master's :14 None :49 Vietnam:12
#> 3rd Qu.:2.00 PhD : 3 US :11
#> Max. :4.00 Germany:10
#> (Other):23This means that we would be trying to model outcome as a
function of the predictors. The goal we want to achieve is to lump the
categories of the predictors together in order to reduce the
dimensionality of the problem. The treatment for each type of variable
is different, and optLump provides different methods for each type of
variable.
Ordinal Variables
The simplest case is that of ordinal variables, which are variables
that have a natural ordering of the categories. In our example, the
variable education_level is an ordinal variable. We might
decide that the smaller levels, <High School and
PhD are too small to be useful, and that we want to lump
them together with other levels. We would do this by specifying a
threshold value, in this instance 15, which is the minimum number of
observations that we want in each level after lumping.
lumped_levels <- lump_ordinal(data$education_level, 15)
summary(lumped_levels)
#> <High School+High School Bachelor's Master's+PhD
#> 55 28 17We see that the algorithm has lumped the level
<High School with High School, and the
level PhD with Master's. Of course, the
interpretation of the levels has changed (the new levels should perhaps
be interpreted as “High school or less” and “Postgraduate”), but we see
that all levels meet the threshold of 15 observations now.
Nominal Variables
For nominal variables, we lose the ordering and thus we can’t rely on the computer to automatically know what levels can be lumped together. The simplest, and most common approach is to simply allow all levels to be lumped with all other levels. This is the default behaviour:
level_count <- nlevels(data$diagnosis)
lumped_levels <- lump_nominal(data$diagnosis, 15)
summary(lumped_levels)
#> None Heart Disease+Hypertension
#> 49 29
#> Diabetes
#> 22A less common situation is when we have some prior knowledge about
which levels can be lumped together. Suppose we know based on domain
knowledge that Heart Disease should not be lumped with
Hypertension, since they are too different. This
information can also be encoded in the preference graph,1 by disallowing the
corresponding pairing. Notice that the preference graph is taken in as
its adjacency matrix.
preference_graph <- adjacency_from_edge_list(
levels(data$diagnosis),
disallow = list(c("Heart Disease", "Hypertension"))
)
lumped_levels <- lump_nominal(data$diagnosis, 15, preference_graph)
summary(lumped_levels)
#> None Hypertension Diabetes+Heart Disease
#> 49 19 32Of course, if the preference graph is more sparse, corresponding to a situation in which only a small number of pairings are admissible, it would make more sense to build the graph the other way around, by starting with an empty graph and adding edges for the pairings that are allowed.
Hierarchical Variables
An important subclass of nominal variables are hierarchical variables, where we have some structure inherited a hierarchy. A common example – and the one demonstrated here – is that of geographical variables. We know, a priori, that countries belong to continents, naturally clustering countries together, so it makes more sense to lump countries together that exist within the same continent. It is simple to pass this knowledge in:
clusters <- list(
c("US", "Canada", "Mexico"),
c("France", "Germany"),
c("China", "Japan", "South Korea", "Vietnam")
)
lumped_levels <- lump_hierarchical(data$country_of_origin, 15, clusters)
summary(lumped_levels)
#> US+Canada+Mexico France+Germany South Korea+Vietnam Japan
#> 30 17 21 15
#> China
#> 17The reason this function exists separately from
lump_nominal() – which can achieve the exact same behaviour
by passing an appropriate preference graph – is, in addition to the
ergonomic gains, that it allows certain speedups that are not possible
in the general case.
Dplyr Integration
The package works very naturally with dplyr pipelines. For instance,
we can easily apply the lump_ordinal() function to the
education_level variable in our dataset:
data |>
mutate(education_level = lump_ordinal(education_level, 15)) |>
slice_head(n = 10)
#> outcome education_level diagnosis country_of_origin
#> 1 1 Master's+PhD None Vietnam
#> 2 3 <High School+High School None France
#> 3 0 <High School+High School None Japan
#> 4 0 <High School+High School None US
#> 5 1 <High School+High School None Canada
#> 6 0 Bachelor's None Japan
#> 7 2 Master's+PhD None Canada
#> 8 0 Bachelor's Hypertension Mexico
#> 9 1 <High School+High School Diabetes US
#> 10 1 <High School+High School None GermanyComputation and Large Datasets
Due to inherent computational constraints,2
lump_nominal() might be very slow. In fact, the algorithm
has O\left(2^{2^m}\right) runtime,
where m is the number of levels. To
partially remedy this, lump_hierarchical() is substantially
faster when it can be used, having O\left(\frac{m}{c}2^{2^c}\right), where c is the number of levels in the largest
cluster. In our example, we would have c =
4, since we have 4 Asian countries. Since c \leq m, this is a substantial
improvement.
When even this is too slow, we can use
lump_nominal_heuristic(), which approximates a solution in
polynomial time, but does not guarantee an optimal solution.
Note that lump_ordinal() does not suffer from the same
issue and runs in polynomial time.