
Approximate maximum information preservable by nominal lumping
maximum_mutual_information_nominal_heuristic.RdSince the proper optimisation function, maximum_mutual_information_nominal(), has superpolynomial time complexity,
this function provides a heuristic to find a good lumping in polynomial time.
Usage
maximum_mutual_information_nominal_heuristic(
counts,
threshold,
adj_matrix = NULL,
verbose = FALSE,
heuristic = c("smart", "largest", "other")
)Arguments
- counts
Named numeric vector containing the number of times each level is observed.
- threshold
Minimum number of samples each level must contain.
- adj_matrix
Adjancency matrix of the preference graph. Default: a complete graph, allowing all lumpings.
- verbose
Whether to print diagnostic messages or not. Default:
FALSE.- heuristic
Character string specifying the algorithm to use. See
vignette("metrics")for their behaviour. Default:"smart".
Value
A list containing information about the optimal lumping:
- mutual_information
Double representing the mutual information between the lumped and unlumped variable.
- loss
Double representing the amount of entropy lost in the lumping process.
- lumping
A list of character vectors, where each vector contains the names of the original levels that have been lumped together.
Details
The lumping returned is guaranteed to satisfy the constraints, but the mutual information conserved is not guaranteed to be maximal. Additionally, since the the clique cover problem is itself NP-complete, it is not guaranteed that a lumping is found at all, even when it exists.
See also
maximum_mutual_information_nominal() for the non-approximate version of this function.
lump_nominal_heuristic() for a more user-friendly wrapper around this function that actually carries out the lumping.