Latest Scikit-Learn releases have made significant advances in the area of ensemble methods.
Scikit-Learn version 0.21 introduced
HistGradientBoostingRegressor classes, which implement histogram-based decision tree ensembles.
They are based on a completely new
TreePredictor decision tree representation.
The claimed benefits over the traditional
Tree decision tree representation include support for missing values and the ability to process bigger datasets faster.
Scikit-Learn version 0.22 introduced
StackingRegressor classes, which aggregate multiple child estimators into an integral whole using a parent (aka final) estimator.
Stacking is closely related to voting.
The main difference is about how the weights for individual child estimators are obtained.
Namely, stacking estimators are "active" as they learn optimal weights autonomously during training, whereas voting estimators are "passive" as they expect optimal weights to be supplied.
Scikit-Learn implements two stacking modes.
In the default non-passthrough mode, the parent estimator is limited to seeing only the predictions of child estimators (
predict_proba for classifiers and
predict for regressors).
In the passthrough mode, the parent estimator also sees the input dataset.
Stacking homogeneous estimators
The pipeline is very simple and straightforward when dealing with homogeneous estimators.
The qualifier "homogeneous" means that all child estimators have the same data pre-processing requirements. The opposite of "homogeneous" is "heterogeneous", which means that different child estimators have different data pre-processing requirements.
Consider, for example, the preparation of continuous features. Linear models assume that the magnitude of continuous values is roughly the same. Decision tree models do not make such an assumption, because they can identify an optimal split threshold value for a continuous feature irrespective of its transformation status (original scale vs. transformed scale). Owing to this discrepany, linear models and decision tree models (and ensembles thereof) are incompatible with each other by default.
It is often possible to simplify a "heterogeneous" collection of estimators to a "homogeneous" one by performing data pre-processing following the strictest requirements. Linear models and decision tree models become compatible with each other after all continuous features have been scaled (ie. a requirement of linear models, which does not make any difference for decision tree models).
Stacking heterogeneous estimators
The pipeline needs considerable redesign when dealing with heterogeneous estimators.
Stacking LightGBM and XGBoost estimators is challenging due to their different categorical data pre-processing requirements.
LightGBM performs the histogram-based binning of categorical values internally, and therefore expects categorical features to be kept as-is, or at most be encoded into categorical integer features using the
XGBoost does not have such capabilities, and therefore expects categorical features to be binarized using either
The "homogenisation" of LightGBM and XGBoost estimators is possible by enforcing the binarization of categorical features. However, this reduces the predictive performance of LightGBM. For more information, please refer to the blog post about converting Scikit-Learn based LightGBM pipelines to PMML documents.
The solution is to perform feature engineering for each child estimator (and in the passthrough mode, also for the parent estimator) separately:
If all child pipelines perform common feature engineering work, then it should be extracted into the first step of the pipeline.
In this exercise, it is limited to capturing domain of features using
The initial column transformer changes the representation of the dataset from
pandas.DataFrame to 2-D Numpy array, which is lacking adequate column-level metadata (eg. names, data types) for setting up subsequent column transformers.
A suitable array descriptor is created manually, by copying the value of the
DataFrame.dtypes attribute, and changing its index from column names to column positions:
Column transformers for LightGBM and XGBoost child pipelines can be constructed using
sklearn2pmml.preprocessing.xgboost.make_xgboost_column_transformer utility functions, respectively.
LightGBM estimators are able to detect categorical features based on their data type.
However, when dealing with more complex datasets, then it is advisable to overrule this functionality by supplying the indices of categorical features manually.
This is typically done by specifying a
categorical_feature (prefixed with one or more levels of step identifiers) parameter to the
(PMML)Pipeline.fit(X, y, **fit_params) method.
Unfortunately, this route is currently blocked, because the fit methods of
StackingRegressor classes do not support the propagation of fit parameters.
The workaround is to pass the
categorical_feature parameter directly to the constructor.
Constructing the LightGBM child pipeline:
Constructing the XGBoost child pipeline:
The Scikit-Learn child pipeline has exactly the same data pre-processing requirements as the XGBoost one (ie. continuous features should be kept as-is, whereas categorical features should be binarized).
Currently, the corresponding column transformer needs to be set up manually.
In future Scikit-Learn releases, when the fit methods of
HistGradientBoostingRegressor classes add support for sparse datasets, then it should be possible to reuse the
make_xgboost_column_transformer utility function here.
Stacking provides an interesting opportunity to rank LightGBM, XGBoost and Scikit-Learn estimators based on their predictive performance.
The idea is to grow all child decision tree ensemble models under similar structural constraints, and use a linear model as the parent estimator (
LogisticRegression for classifiers and
LinearRegression for regressors).
The importance of child estimators is then proportional to their estimated coefficient values.
To further boost signal over noise, the stacking is performed in non-passthrough mode and its cross-validation functionality is disabled by supplying a no-op cross-validation generator:
PMMLPipeline object can be converted to a PMML XML file using the
sklearn2pmml.sklearn2pmml utility function.
However, it is highly advisable to first enhance it with verification data (for automated quality-assurance purposes) by calling the
PMMLPipeline.verify(X) method with a representative sample of the input dataset:
- "Auto" dataset:
- "Audit-NA" dataset:
- Python scripts: