Logistic regression (LR) is often the go-to choice for binary classification. Owing to extreme simplicity, LR models are fast to train and easy to deploy, and readily lend themselves for human interpretation.
The predictive performance of LR models depends on the quality and sophistication of feature engineering. There are two major work areas. First, delineating and generating the intended feature space. LR algorithms operate on the feature space they are given. They are not designed to independently discover non-linearities along individual dimensions, or interactions between multiple dimensions. Second, filtering down the feature space. Specialized LR algorithms can prioritize and eliminate dimensions using regularization. However, the most common ones estimate coefficients for all dimensions of the feature space.
Facebook Research has demonstrated how feature engineering can be automated using a gradient boosted decision tree (GBDT) model: Practical Lessons from Predicting Clicks on Ads at Facebook
The idea is to train a GBDT model on a raw feature space and collect and examine the "decision paths" of its member decision tree models. A decision path which operates on a single feature can be regarded as a non-linear transformation on it (eg. binning a continuous feature to a pseudo-categorical feature). A decision path which operates on multiple features can be regarded as an interaction between them.
GBDT algorithms typically grow shallow decision trees. Shallow trees contain short decision paths, which generally lead to easily interpretable derived features.
A boosting algorithm can be swapped for a bagging algorithm. Random forest (RF) algorithms typically grow much deeper decision trees. Longer decision paths lead to more complex derived features (eg. interactions between multiple non-linearly transformed features), which lose in interpretability but gain in information content. For example, discovering cliffs and other anomalies in the decision space by observing which derived features become associated with extreme node scores.
Scikit-Learn documentation dedicates a separate page to GBDT plus LR ensemble models: Feature transformations with ensembles of trees
While the concept and its implementation are discussed in great detail, there is no reusable GBDT+LR estimator available within Scikit-Learn. Interested parties are either expected to copy-paste the example code, or rely on third-party libraries.
sklearn2pmml package version 0.47.0 introduced
sklearn2pmml.ensemble.GBDTLMRegressor ensemble models to address this deficiency.
The Predictive Model Markup Language (PMML) provides standardized data structures for representing all common data pre- and post-processing operations and model types, including the GBDT model type and the LR model type.
If all the parts of a GBDT+LR model are PMML compatible, then it should follow that the GBDT+LR model itself is PMML compatible too? The answer is a definite yes. Better yet, the PMML representation of a GBDT+LR model is reducible to an ordinary GBDT model, which leads to significant conversion- and run-time savings.
The reduction is based on the realization that GBDT+LR is a mechanism for replacing original GBDT leaf node scores with LR coefficients (and the GBDT base score with the LR intercept).
Scikit-Learn does not provide an API for modifying fitted decision trees.
The workaround is to make individual leaf nodes addressable using the one-hot-encoding approach (the
OneHotEncoder.categories attribute is a list of arrays; the size of the list equals the number of decision trees in the GBDT; the size of each array equals the number of leaf nodes in the corresponding decision tree), and then assigning a new score to each address (the
LogisticRegression.coef_ attribute is an array whose size equals the flat-mapped size of the
The PMML representation does not need such layer of indirection, because it is possible to replace leaf node scores in place.
The JPMML-Model library provides Visitor API for traversing, updating and transforming PMML class model objects. In the current case, the Visitor API is used to transform the GBDT side of the GBDT+LR model to a regression-type boosting model. All leaf nodes are assigned new score values as extracted from the LR side.
The GBDT+LR workflow is much simpler than traditional workflows. Specifically, there is no need to perform dedicated feature engineering work, because the GBDT+LR estimator will do it automatically and in a very thorough manner.
Boilerplate for assembling and fitting a GBDT+LR pipeline using user-specified
from sklearn_pandas import DataFrameMapper from sklearn.preprocessing import LabelBinarizer from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain from sklearn2pmml.ensemble import GBDTLRClassifier from sklearn2pmml.pipeline import PMMLPipeline import pandas df = pandas.read_csv(..) # The names of categorical and continuous feature columns cat_columns = [..] cont_columns = [..] # The name of the label column label_column = .. def make_fit_gbdtlr(gbdt, lr): mapper = DataFrameMapper( [([cat_column], [CategoricalDomain(), LabelBinarizer()]) for cat_column in cat_columns] + [(cont_columns, ContinuousDomain())] ) classifier = GBDTLRClassifier(gbdt, lr) pipeline = PMMLPipeline([ ("mapper", mapper), ("classifier", classifier) ]) pipeline.fit(df[cat_columns + cont_columns], df[label_column]) return pipeline
The most common configuration is to use
GradientBoostingClassifier as the
The "boosting" behaviour can be promoted by growing a larger number of shallower decision trees.
from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn2pmml import sklearn2pmml pipeline = make_fit_gbdtlr(GradientBoostingClassifier(n_estimators = 499, max_depth = 2), LogisticRegression()) sklearn2pmml(pipeline, "GBDT+LR.pmml")
Conversely, the "bagging" behaviour can be promoted by growing a smaller number of deeper decision trees.
GBDTLRClassifier ensemble model accepts any PMML compatible classifier as the
For example, switching from
GradientBoostingClassifier to alternative classifier classes such as
RandomForestClassifier would reduce the risk of overfitting:
from sklearn.ensemble import RandomForestClassifier pipeline = make_fit_gbdtlr(RandomForestClassifier(n_estimators = 31, max_depth = 6), LogisticRegression()) sklearn2pmml(pipeline, "RF+LR.pmml")
The XGBoost plugin library provides an
xgboost.XGBClassifier model, which can be used as a drop-in replacement for Scikit-Learn classifier classes:
from xgboost import XGBClassifier pipeline = make_fit_gbdtlr(XGBClassifier(n_estimators = 299, max_depth = 3), LogisticRegression()) sklearn2pmml(pipeline, "XGB+LR.pmml")
The LightGBM plugin library provides a
One of its major selling points is proper support for categorical features.
If the training dataset contains a significant number of (high-cardinality-) categorical features, then the above
make_fit_gbdtlr utility function should be tailored to maintain this information.
As discussed in a recent blog post, the fit method of LightGBM estimators takes an optional
categorical_feature fit parameter.
The problem is passing this parameter to a
LGBMClassifier object, which is contained in the
GBDTLRClassifier object, which is in turn contained in the
The solution follows Scikit-Learn conventions.
Namely, the fit method of the
GBDTLRClassifier class also takes fit parameters, which are passed on to the correct component based on the prefix.
Boilerplate for assembling and fitting an LightGBM+LR pipeline:
from sklearn.preprocessing import LabelEncoder def make_fit_lgbmlr(gbdt, lr): mapper = DataFrameMapper( [([cat_column], [CategoricalDomain(), LabelEncoder()]) for cat_column in cat_columns] + [(cont_columns, ContinuousDomain())] ) classifier = GBDTLRClassifier(gbdt, lr) pipeline = PMMLPipeline([ ("mapper", mapper), ("classifier", classifier) ]) # The 'gbdt' component can be addressed using the `classifier__gbdt` prefix # The 'lr' component can be addressed using the `classifier__lr` prefix pipeline.fit(df[cat_columns + cont_columns], df[label_column], classifier__gbdt__categorical_feature = range(0, len(cat_columns))) return pipeline
from lightgbm import LGBMClassifier pipeline = make_fit_lgbmlr(LGBMClassifier(n_estimators = 71, max_depth = 5), LogisticRegression()) sklearn2pmml(pipeline, "LGBM+LR.pmml")
Both XGBoost and LightGBM classifiers support missing values.
When working with sparse datasets, then it is possible to make
make_fit_lgbmlr utility functions missing value-aware by replacing the default
LabelEncoder transformers with
sklearn2pmml.preprocessing.PMMLLabelEncoder transformers, respectively.