In the introductory vignette (vignette("optLump")) we
lumped a categorical variable by looking only at its own distribution.
This is the right thing to do when the variable is all we have. More
often, however, we are lumping a predictor because we eventually want to
model some outcome based on it. In that case we do not really care about
the predictor on its own; we care about what it tells us about the
outcome.
Supervised lumping takes this into account. Instead of preserving the mutual information between the variable before and after lumping, it preserves the mutual information between the lumped variable and an outcome that we supply.1 In practice this means that two small levels are lumped together when they say similar things about the outcome, rather than simply when doing so keeps the distribution balanced.
Each of the supervised functions corresponds to one of the
unsupervised functions from vignette("optLump"). They take
the same arguments, with one addition: the outcome that
supervises the lumping.
Lumping with an Outcome
Suppose we are given some data on hospital admissions. For each
patient we know a triage severity, the
admission route they came in through, and the
hospital that treated them. What we would like to model is
whether the patient was readmitted within a month, as well
as their length_of_stay in days.
summary(patients)
#> severity admission hospital readmitted length_of_stay
#> Minimal :22 GP :51 St Thomas' :66 0:108 Min. : 0.600
#> Mild :60 Specialist :58 Guy's :59 1: 92 1st Qu.: 2.800
#> Moderate:14 Emergency :56 Whittington:28 Median : 4.200
#> Severe :48 Self-referral :15 Royal Free :24 Mean : 5.937
#> Critical:56 Walk-in clinic:20 Homerton :23 3rd Qu.: 8.325
#> Max. :30.100As before, we want to lump the categories of the predictors together so that every level meets some minimum sample size. The difference is that we now want to do this without throwing away what the predictors tell us about the outcome.
Ordinal Variables
Just like in the unsupervised case, severity is ordinal,
so we are only allowed to lump levels that are next to each other in the
ordering. Suppose we want every level to contain at least 30 patients.
Lumping without the outcome looks only at the counts:
summary(lump_ordinal(severity, 30))
#> Minimal+Mild Moderate+Severe Critical
#> 82 62 56To meet the threshold, the algorithm has lumped the small
Moderate level in with Severe. This is a
sensible choice if all we know are the counts, but it is an unfortunate
one here: Moderate patients are rarely readmitted, whereas
Severe patients usually are.
round(tapply(readmitted == "1", severity, mean), 2)
#> Minimal Mild Moderate Severe Critical
#> 0.27 0.17 0.21 0.75 0.66If we supervise the lumping with the outcome, the algorithm can take this into account. We simply pass the outcome in as the second argument:
summary(lump_ordinal_supervised(severity, readmitted, 30))
#> Minimal+Mild+Moderate Severe Critical
#> 96 48 56This time Moderate is lumped with the lower-severity
levels that it actually resembles, and Severe is left on
its own. The resulting levels make a lot more sense if we are about to
use them to predict readmission.
Continuous Outcomes
The outcome does not have to be discrete. If we give it a numeric
outcome such as length_of_stay, the same function instead
preserves the mutual information between the lumped variable and that
continuous outcome, estimated using a nearest-neighbour approach (Ross 2014). By default, optLump infers the type
of the outcome from its class—a factor is treated as discrete, a numeric
vector as continuous—so there is nothing extra to specify:
summary(lump_ordinal_supervised(severity, length_of_stay, 30))
#> Minimal+Mild+Moderate Severe Critical
#> 96 48 56Since longer stays go hand in hand with higher severity, we end up
with the same grouping as before. Of course, if the automatic choice is
ever wrong – say the outcome is stored as integers but is really
categorical – we can set it by hand through the
outcome_mode argument.
Nominal Variables
For nominal variables we lose the ordering, so in principle any two
levels can be lumped together. admission is such a
variable. Asking once more for at least 25 patients per level, the
unsupervised method lumps together the two smallest levels:
summary(lump_nominal(admission, 25))
#> GP Specialist
#> 51 58
#> Emergency Self-referral+Walk-in clinic
#> 56 35Self-referral and Walk-in clinic are indeed
the two rarest routes. The trouble is that they disagree about the
outcome: self-referred patients are readmitted often, whereas walk-in
patients hardly ever are.
round(tapply(readmitted == "1", admission, mean), 2)
#> GP Specialist Emergency Self-referral Walk-in clinic
#> 0.47 0.50 0.39 0.67 0.35The supervised version keeps them apart, and instead lumps each rare level in with a larger one that behaves like it:
summary(lump_nominal_supervised(admission, readmitted, 25))
#> GP Specialist+Self-referral Emergency+Walk-in clinic
#> 51 73 76Just as in the unsupervised case, we can supply a preference graph
through the adj_matrix argument to forbid certain pairings,
and a continuous outcome is handled exactly as it was for ordinal
variables. See vignette("optLump") for how to build a
preference graph.
Hierarchical Variables
When the levels come with a known grouping,
lump_hierarchical_supervised() can take advantage of it,
just like lump_hierarchical() does in the unsupervised
case. Suppose our five hospitals belong to two networks, and that we
only want to lump patients together within the same network:
networks <- list(
c("St Thomas'", "Guy's"),
c("Whittington", "Royal Free", "Homerton")
)
summary(lump_hierarchical_supervised(hospital, readmitted, 25, clusters = networks))
#> St Thomas' Guy's Whittington Royal Free+Homerton
#> 66 59 28 47Within each network, the rarest hospitals are lumped together in the way that holds on to the most information about readmission. Unlike the ordinal and nominal versions, the hierarchical function only supports discrete outcomes.
Computation and Large Datasets
The supervised functions behave computationally much like their
unsupervised counterparts. lump_ordinal_supervised() runs
in time cubic in the number of levels.
lump_nominal_supervised(), on the other hand, solves an
NP-hard problem, so it is only practical for a modest number of levels.
When the data has a hierarchy,
lump_hierarchical_supervised() is substantially faster,
since it only ever works on one cluster at a time. For more on these
trade-offs, and on the heuristics available when the nominal problem
becomes too slow, see vignette("optLump").
