Imbalanced-Learn is a Scikit-Learn extension package for re-sampling datasets.
Re-sampling derives a new dataset with specific properties from the original dataset. It is commonly used in classification workflows to optimize the distribution of class labels.
Consider, for example, a binary classification problem where the ratio of "event" vs. "no-event" labels is heavily skewed and fluctuates across datasets. Re-sampling can be used to enrich the dataset (by either over-sampling the "event" label, or under-sampling the "no-event" label) at a stable, desired level, which is crucial for repeatable and reproducible data science experiments.
imblearn package provides samplers and sampling-aware classifiers.
Imbalanced-Learn samplers are similar to Scikit-Learn selectors, except they operate on data matrix rows rather than columns. A sampler may lower the height of a data matrix by removing undesired rows, or increase it by inserting desired rows (either by duplicating existing rows or generating new rows from scratch).
There are popular ensemble classification algorithms that perform extra re-sampling as part of their "business logic". For example, the random forest algorithm draws a unique subsample for training each member decision tree as a means to improve the predictive accuracy and control over-fitting.
Imbalanced-Learn classifiers such as
imblearn.ensemble.BalancedRandomForestClassifier extend Scikit-Learn classifiers with basic re-sampling functionality.
This blog post demonstrates how to incorporate Imbalanced-Learn samplers into PMML pipelines.
The "audit" dataset contains 1899 data records; 447 of them are labeled as "event" and 1452 as "no-event".
In this exercise, the dataset shall be enriched from the initial ~1/4 event ratio to 1/3 event ratio by randomly sampling 1000 "event" data records and 2000 "no-event" data records using the
The sampler step is typically placed between feature engineering and classifier steps:
It should be pointed out that a sampler step creates new internal data matrices during fitting that shall live in computer memory side-by-side with incoming data matrices. This is not a problem with the "audit" dataset, but may become an issue when working with Big Data-scale datasets.
However, any attempt to insert a sampler step directly into a Scikit-Learn pipeline fails with the following type error:
Imbalanced-Learn samplers are completely separate from Scikit-Learn transformers.
They inherit from the
imblearn.base.SamplerMixing base class, and their API is centered around the
fit_resample(X, y) method that operates both on feature and label data.
imblearn package provides the
imblearn.pipeline.Pipeline class, which extends the
sklearn.pipeline.Pipeline class with support for sampler steps.
Switching pipeline implementations:
In principle, it is only the sampler step and the subsequent steps that must be "escaped" by wrapping them into the Imbalanced-Learn pipeline. All steps preceding the sampler step may be left out of it.
Combining pipeline implementations:
sklearn2pmml package provides an
sklearn2pmml.sklearn2pmml(pipeline: Pipeline, pmml_output_path: str) utility function for converting Scikit-Learn pipelines to the Predictive Model Markup Language (PMML) representation.
This utility function refuses to accept Imbalanced-Learn pipeline objects as the first argument.
The associated type error suggests using the
sklearn2pmml.make_pmml_pipeline(obj) utility function for transforming custom objects to a PMML pipeline object.
However, it is better to ignore this advice, and construct and fit a
sklearn2pmml.pipeline.PMMLPipeline object explicitly:
Re-sampling is solely a training-time phenomenon. Imbalanced-Learn samplers act as identity transformers during prediction. It means that they pass through testing and validation datasets unchanged.
Consequently, samplers are functionally void in the PMML representation.
The only trace left of them are differing data record counts as reported by different pipeline steps.
For example, the initial domain decorator steps (eg.
CategoricalDomain classes) report a record count of 1899, whereas the final estimator step (ie. the
DecisionTreeClassifier class) reports it as 3000.