Extending Scikit-Learn with GBDT+LR ensemble models

Logistic regression (LR) is often the go-to choice for binary classification. Owing to extreme simplicity, LR models are fast to train and easy to deploy, and readily lend themselves for human interpretation.

The predictive performance of LR models depends on the quality and sophistication of data pre-processing, specifically feature engineering. There are two major work areas. First, delineating and generating the intended feature space. LR algorithms operate on the feature space they are given. They are not designed to independently discover non-linearities along individual dimensions, or interactions between multiple dimensions. Second, filtering down the feature space. Specialized LR algorithms can prioritize and eliminate dimensions using regularization. However, the most common ones estimate coefficients for all dimensions of the feature space.

Facebook Research has demonstrated how feature engineering can be automated using a gradient boosted decision tree (GBDT) model: Practical Lessons from Predicting Clicks on Ads at Facebook

The idea is to train a GBDT model on a raw feature space and collect and examine the “decision paths” of its member decision tree models. A decision path which operates on a single feature can be regarded as a non-linear transformation on it (eg. binning a continuous feature to a pseudo-categorical feature). A decision path which operates on multiple features can be regarded as an interaction between them.

GBDT algorithms typically grow shallow decision trees. Shallow trees contain short decision paths, which generally lead to easily interpretable derived features.

A boosting algorithm can be swapped for a bagging algorithm. Random forest (RF) algorithms typically grow much deeper decision trees. Longer decision paths lead to more complex derived features (eg. interactions between multiple non-linearly transformed features), which lose in interpretability but gain in information content. For example, discovering cliffs and other anomalies in the decision space by observing which derived features become associated with extreme node scores.

Scikit-Learn perspective

Scikit-Learn documentation dedicates a separate page to GBDT plus LR (GBDT+LR) ensemble models: Feature transformations with ensembles of trees

While the concept and its implementation are discussed in great detail, there is no reusable GBDT+LR estimator available within Scikit-Learn. Interested parties are either expected to copy-paste the example code, or rely on third-party libraries.

The sklearn2pmml package version 0.47.0 introduced sklearn2pmml.ensemble.GBDTLRClassifier and sklearn2pmml.ensemble.GBDTLMRegressor ensemble models to address this deficiency.

PMML perspective

The Predictive Model Markup Language (PMML) provides standardized data structures for representing all common data pre- and post-processing operations and model types, including the GBDT model type and the LR model type.

If all the parts of a GBDT+LR model are PMML compatible, then it should follow that the GBDT+LR model itself is PMML compatible too? The answer is a definite yes. Better yet, the PMML representation of a GBDT+LR model is reducible to an ordinary GBDT model, which leads to significant conversion- and run-time savings.

The reduction is based on the realization that GBDT+LR is a mechanism for replacing original GBDT leaf node scores with LR coefficients (and the GBDT base score with the LR intercept). Scikit-Learn does not provide an API for modifying fitted decision trees. The workaround is to make individual leaf nodes addressable using the one-hot encoding approach (the OneHotEncoder.categories attribute is a list of arrays; the size of the list equals the number of decision trees in the GBDT; the size of each array equals the number of leaf nodes in the corresponding decision tree), and then assigning a new score to each address (the LogisticRegression.coef_ attribute is an array whose size equals the flat-mapped size of the OneHotEncoder.categories attribute).

The PMML representation does not need such layer of indirection, because it is possible to replace leaf node scores in place.

The JPMML-Model library provides Visitor API for traversing, updating and transforming PMML class model objects. In the current case, the Visitor API is used to transform the GBDT side of the GBDT+LR model to a regression-type boosting model. All leaf nodes are assigned new score values as extracted from the LR side.

Example workflow

The GBDT+LR workflow is much simpler than traditional workflows. Specifically, there is no need to perform dedicated feature engineering work, because the GBDT+LR estimator will do it automatically and in a very thorough manner.

Boilerplate for assembling and fitting a GBDT+LR pipeline using user-specified gbdt and lr components:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelBinarizer
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.ensemble import GBDTLRClassifier
from sklearn2pmml.pipeline import PMMLPipeline

import pandas

df = pandas.read_csv(..)

# The names of categorical and continuous feature columns
cat_columns = [...]
cont_columns = [...]

# The name of the label column
label_column = ..

def make_fit_gbdtlr(gbdt, lr):
  mapper = DataFrameMapper(
    [([cat_column], [CategoricalDomain(), LabelBinarizer()]) for cat_column in cat_columns] +
    [(cont_columns, ContinuousDomain())]
  )
  classifier = GBDTLRClassifier(gbdt, lr)
  pipeline = PMMLPipeline([
    ("mapper", mapper),
    ("classifier", classifier)
  ])
  pipeline.fit(df[cat_columns + cont_columns], df[label_column])
  return pipeline

The most common configuration is to use GradientBoostingClassifier as the gbdt component. The “boosting” behaviour can be promoted by growing a larger number of shallower decision trees.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn2pmml import sklearn2pmml

pipeline = make_fit_gbdtlr(GradientBoostingClassifier(n_estimators = 499, max_depth = 2), LogisticRegression())
sklearn2pmml(pipeline, "GBDT+LR.pmml")

Conversely, the “bagging” behaviour can be promoted by growing a smaller number of deeper decision trees. The GBDTLRClassifier ensemble model accepts any PMML compatible classifier as the gbdt component. For example, switching from GradientBoostingClassifier to alternative classifier classes such as ExtraTreesClassifier or RandomForestClassifier would reduce the risk of overfitting:

from sklearn.ensemble import RandomForestClassifier

pipeline = make_fit_gbdtlr(RandomForestClassifier(n_estimators = 31, max_depth = 6), LogisticRegression())
sklearn2pmml(pipeline, "RF+LR.pmml")

The XGBoost plugin library provides the xgboost.XGBClassifier model, which can be used as a drop-in replacement for Scikit-Learn classifier classes:

from xgboost import XGBClassifier

pipeline = make_fit_gbdtlr(XGBClassifier(n_estimators = 299, max_depth = 3), LogisticRegression())
sklearn2pmml(pipeline, "XGB+LR.pmml")

The LightGBM plugin library provides the lightgbm.LGBMClassifier model. One of its major selling points is proper support for categorical features. If the training dataset contains a significant number of (high-cardinality-) categorical features, then the above make_fit_gbdtlr utility function should be tailored to maintain this information.

As discussed in a recent blog post, the fit method of LightGBM estimators takes an optional categorical_feature fit parameter. The next challenge is about passing this parameter to a LGBMClassifier object, which is contained in the GBDTLRClassifier object, which is in turn contained in the (PMML)Pipeline object. The solution follows Scikit-Learn conventions. Namely, the fit method of the GBDTLRClassifier class also takes fit parameters, which are passed on to the correct component based on the prefix.

Boilerplate for assembling and fitting an LightGBM+LR pipeline:

from sklearn.preprocessing import LabelEncoder

def make_fit_lgbmlr(gbdt, lr):
  mapper = DataFrameMapper(
    [([cat_column], [CategoricalDomain(), LabelEncoder()]) for cat_column in cat_columns] +
    [(cont_columns, ContinuousDomain())]
  )
  classifier = GBDTLRClassifier(gbdt, lr)
  pipeline = PMMLPipeline([
    ("mapper", mapper),
    ("classifier", classifier)
  ])
  # The 'gbdt' component can be addressed using the `classifier__gbdt` prefix
  # The 'lr' component can be addressed using the `classifier__lr` prefix
  pipeline.fit(df[cat_columns + cont_columns], df[label_column], classifier__gbdt__categorical_feature = range(0, len(cat_columns)))
  return pipeline

Sample usage:

from lightgbm import LGBMClassifier

pipeline = make_fit_lgbmlr(LGBMClassifier(n_estimators = 71, max_depth = 5), LogisticRegression())
sklearn2pmml(pipeline, "LGBM+LR.pmml")

Both XGBoost and LightGBM classifiers support missing values. When working with sparse datasets, then it is possible to make make_fit_gbdtlr and make_fit_lgbmlr utility functions missing value-aware by replacing the default LabelBinarizer and LabelEncoder transformers with sklearn2pmml.preprocessing.PMMLLabelBinarizer and sklearn2pmml.preprocessing.PMMLLabelEncoder transformers, respectively.

Resources

Dataset: audit.csv
Python script: train.py

Extending Scikit-Learn with GBDT+LR ensemble models

Scikit-Learn perspective

PMML perspective

Example workflow

Resources

Feedback