Latest Scikit-Learn releases have made significant advances in the area of ensemble methods.
Scikit-Learn version 0.21 introduced HistGradientBoostingClassifier
and HistGradientBoostingRegressor
models, which implement histogram-based decision tree ensembles.
They are based on a completely new TreePredictor
decision tree representation.
The claimed benefits over the traditional Tree
decision tree representation include support for missing values and the ability to process bigger datasets faster.
Scikit-Learn version 0.22 introduced StackingClassifier
and StackingRegressor
models, which aggregate multiple child estimators into an integral whole using a parent (aka final) estimator.
Stacking is closely related to voting.
The main difference is about how the weights for individual child estimators are obtained.
Namely, stacking estimators are “active” as they learn optimal weights autonomously during training, whereas voting estimators are “passive” as they expect optimal weights to be supplied.
Scikit-Learn implements two stacking modes.
In the default non-passthrough mode, the parent estimator is limited to seeing only the predictions of child estimators (predict_proba
for classifiers and predict
for regressors).
In the passthrough mode, the parent estimator also sees the input dataset.
Stacking homogeneous estimators
The pipeline is very simple and straightforward when dealing with homogeneous estimators.
from sklearn_pandas import DataFrameMapper
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
import numpy
import pandas
df = pandas.read_csv("auto.csv")
cat_columns = ["cylinders", "model_year", "origin"]
cont_columns = ["acceleration", "displacement", "horsepower", "weight"]
df_X = df[cat_columns + cont_columns]
df_y = df["mpg"]
mapper = DataFrameMapper(
[([cat_column], [CategoricalDomain(), OneHotEncoder()]) for cat_column in cat_columns] +
[([cont_column], [ContinuousDomain(), StandardScaler()]) for cont_column in cont_columns]
)
estimator = StackingRegressor([
("dt", DecisionTreeRegressor(max_depth = 7, random_state = 13)),
("mlp", MLPRegressor(hidden_layer_sizes = (7), solver = "lbfgs", max_iter = 2000, random_state = 13))
], final_estimator = LinearRegression(), passthrough = True)
pipeline = PMMLPipeline([
("mapper", mapper),
("estimator", estimator)
])
The qualifier “homogeneous” means that all child estimators have the same data pre-processing requirements. The opposite of “homogeneous” is “heterogeneous”, which means that different child estimators have different data pre-processing requirements.
Consider, for example, the preparation of continuous features. Linear models assume that the magnitude of continuous values is roughly the same. Decision tree models do not make such an assumption, because they can identify an optimal split threshold value for a continuous feature irrespective of its transformation status (original scale vs. transformed scale). Owing to this discrepany, linear models and decision tree models (and ensembles thereof) are incompatible with each other by default.
It is often possible to simplify a “heterogeneous” collection of estimators to a “homogeneous” one by performing data pre-processing following the strictest requirements. Linear models and decision tree models become compatible with each other after all continuous features have been scaled (ie. a requirement of linear models, which does not make any difference for decision tree models).
Stacking heterogeneous estimators
The pipeline needs considerable redesign when dealing with heterogeneous estimators.
Stacking LightGBM and XGBoost estimators is challenging due to their different categorical data pre-processing requirements.
LightGBM performs the histogram-based binning of categorical values internally, and therefore expects categorical features to be kept as-is, or at most be encoded into categorical integer features using the LabelEncoder
transformer.
XGBoost does not have such capabilities, and therefore expects categorical features to be binarized using either LabelBinarizer
or OneHotEncoder
transformers.
The “homogenisation” of LightGBM and XGBoost estimators is possible by enforcing the binarization of categorical features. However, this reduces the predictive performance of LightGBM. For more information, please refer to the blog post about converting Scikit-Learn LightGBM pipelines to PMML.
The solution is to perform data pre-processing for each child estimator (and in the passthrough mode, also for the parent estimator) separately:
from sklearn_pandas import DataFrameMapper
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn2pmml.pipeline import PMMLPipeline
mapper = DataFrameMapper(..)
estimator = StackingClassifier([
("first", Pipeline(..)),
("second", Pipeline(..)),
("third", Pipeline(..))
], final_estimator = Pipeline(..))
pipeline = PMMLPipeline([
("mapper", mapper),
("estimator", estimator)
])
If all child pipelines perform common data pre-processing work, then it should be extracted into the first step of the pipeline.
In this exercise, it is limited to capturing domain of features using CategoricalDomain
and ContinuousDomain
decorators.
The initial column transformer changes the representation of the dataset from pandas.DataFrame
to 2-D Numpy array, which lacks adequate column-level metadata (eg. names, data types) for setting up subsequent column transformers.
A suitable array descriptor is created manually, by copying the value of the DataFrame.dtypes
attribute, and changing its index from column names to column positions:
from sklearn.compose import ColumnTransformer
df = pandas.read_csv("audit-NA.csv", na_values = ["N/A", "NA"])
cat_columns = ["Education", "Employment", "Marital", "Occupation"]
cont_columns = ["Age", "Hours", "Income"]
df_X = df[cat_columns + cont_columns]
df_y = df["Adjusted"]
dtypes = df_X.dtypes
mapper = ColumnTransformer(
[(cat_column, CategoricalDomain(), [cat_column]) for cat_column in cat_columns] +
[(cont_column, ContinuousDomain(), [cont_column]) for cont_column in cont_columns]
)
dtypes = Series(dtypes.values, index = [0, 1, 2, 3, 4, 5, 6])
Column transformers for LightGBM and XGBoost child pipelines can be constructed using sklearn2pmml.preprocessing.lightgbm.make_lightgbm_column_transformer
and sklearn2pmml.preprocessing.xgboost.make_xgboost_column_transformer
utility functions, respectively.
LightGBM estimators are able to detect categorical features based on their data type.
However, when dealing with more complex datasets, then it is advisable to overrule this functionality by supplying the indices of categorical features manually.
This is typically done by specifying a categorical_feature
(prefixed with one or more levels of step identifiers) parameter to the (PMML)Pipeline.fit(X, y, **fit_params)
method.
Unfortunately, this route is currently blocked, because the fit methods of StackingClassifier
and StackingRegressor
classes do not support the propagation of fit parameters.
The workaround is to pass the categorical_feature
parameter directly to the constructor.
Constructing the LightGBM child pipeline:
from lightgbm import LGBMClassifier
from sklearn2pmml.preprocessing.lightgbm import make_lightgbm_column_transformer
lightgbm_mapper, lightgbm_categorical_feature = make_lightgbm_column_transformer(dtypes, missing_value_aware = True)
lightgbm_pipeline = Pipeline([
("mapper", lightgbm_mapper),
("classifier", LGBMClassifier(n_estimators = 31, max_depth = 3, random_state = 13, categorical_feature = lightgbm_categorical_feature))
])
Constructing the XGBoost child pipeline:
from sklearn2pmml.preprocessing.xgboost import make_xgboost_column_transformer
from xgboost import XGBClassifier
xgboost_mapper = make_xgboost_column_transformer(dtypes, missing_value_aware = True)
xgboost_pipeline = Pipeline([
("mapper", xgboost_mapper),
("classifier", XGBClassifier(n_estimators = 31, max_depth = 3, random_state = 13))
])
The Scikit-Learn child pipeline has exactly the same data pre-processing requirements as the XGBoost one (ie. continuous features should be kept as-is, whereas categorical features should be binarized).
Currently, the corresponding column transformer needs to be set up manually.
In future Scikit-Learn releases, when the fit methods of HistGradientBoostingClassifier
and HistGradientBoostingRegressor
classes add support for sparse datasets, then it should be possible to reuse the make_xgboost_column_transformer
utility function here.
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn2pmml.preprocessing import PMMLLabelBinarizer
sklearn_mapper = ColumnTransformer(
[(str(cat_index), PMMLLabelBinarizer(sparse_output = False), [cat_index]) for cat_index in range(0, len(cat_columns))] +
[(str(cont_index), "passthrough", [cont_index]) for cont_index in range(len(cat_columns), len(cat_columns + cont_columns))]
, remainder = "drop")
sklearn_pipeline = Pipeline([
("mapper", sklearn_mapper),
("classifier", HistGradientBoostingClassifier(max_iter = 31, max_depth = 3, random_state = 13))
])
Stacking offers an interesting opportunity to rank LightGBM, XGBoost and Scikit-Learn estimators based on their predictive performance.
The idea is to grow all child decision tree ensemble models under similar structural constraints, and use a linear model as the parent estimator (LogisticRegression
for classifiers and LinearRegression
for regressors).
The importance of child estimators is then proportional to their estimated coefficient values.
To further boost signal over noise, the stacking is performed in non-passthrough mode and its cross-validation functionality is disabled by supplying a no-op cross-validation generator:
from sklearn.linear_model import LogisticRegression
# See https://stackoverflow.com/a/55326439
class DisabledCV:
def __init__(self):
self.n_splits = 1
def split(self, X, y, groups = None):
yield (numpy.arange(len(X)), numpy.arange(len(y)))
def get_n_splits(self, X, y, groups=None):
return self.n_splits
final_estimator = LogisticRegression(multi_class = "ovr", random_state = 13)
pipeline = PMMLPipeline([
("mapper", mapper),
("ensemble", StackingClassifier([
("lightgbm", lightgbm_pipeline),
("xgboost", xgboost_pipeline),
("sklearn", sklearn_pipeline)
], final_estimator = final_estimator, cv = DisabledCV(), passthrough = False))
])
A fitted PMMLPipeline
object can be converted to a PMML XML file using the sklearn2pmml.sklearn2pmml
utility function.
However, it is highly advisable to first enhance it with verification data (for automated quality-assurance purposes) by calling the PMMLPipeline.verify(X)
method with a representative sample of the input dataset:
from sklearn2pmml import sklearn2pmml
pipeline = PMMLPipeline(..)
pipeline.fit(df_X, df_y)
pipeline.verify(df_X.sample(n = 10))
sklearn2pmml(pipeline, "StackingEnsemble.pmml")
Resources
- Datasets:
audit-NA.csv
andauto.csv
- Python scripts:
train-classification.py
andtrain-regression.py
Related blog posts
-
2023-02-06Upgrading Scikit-Learn XGBoost pipelines