Upgrading Scikit-Learn XGBoost pipelines

XGBoost is one of the top algorithms for solving Tabular ML problems.

The xgboost package provides XGBoost functionality in two flavours. First, the low-level Python Learning API aims at API parity with the underlying C++ library. Second, the high-level Scikit-Learn API aims at making the most popular parts accessible via Scikit-Learn style wrappers.

According to the XGBoost PyPI release history, the xgboost package has been publicly available since mid-2015. Initial releases (0.4, 0.6) carry “pre-release” markers. The first production-ready release is XGBoost version 0.7 (aka 0.70), which is dated 1st of January, 2018.

This blog post walks through all major XGBoost releases from 0.7 to 1.7, with the intent of pinpointing any Scikit-Learn API changes, and making observations and comparisons from the end user perspective. The focus is on handling categorical data. Any measurable changes in predictive or computational performance are ignored. The general expectation is that each new major XGBoost release version is smarter and faster than all the previous ones.

XGBoost version 0.7 (v0.72.1)

The “audit” dataset deals with a binary classification problem.

The label column is originally a two-valued integer, but it is translated into a two-valued string in order to make the experiment more challenging.

There are nine feature columns, three continuous and six categorical.

import pandas

df = pandas.read_csv("audit.csv")
df["Adjusted"] = df["Adjusted"].apply(lambda x: ("yes" if x else "no"))

cat_cols = ["Deductions", "Education", "Employment", "Gender", "Marital", "Occupation"]
cont_cols = ["Age", "Income", "Hours"]

X = df[cat_cols + cont_cols]
y = df["Adjusted"]

A good Scikit-Learn pipeline should start with an “initializer” step that ensures that the incoming data matrix is correct (eg. right columns, in the right order).

Both sklearn_pandas.DataFrameMapper and sklearn.compose.ColumnTransformer (meta-)transformers offer combined data matrix canonicalization and transformation functionality. This experiment goes with the DataFrameMapper class, because it has simpler and more intuitive API, is purpose-built for dealing with pandas.DataFrame inputs and outputs, and works with any Scikit-Learn version.

Constructing an OHE-style initializer:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

import numpy

def make_mapper():
  return DataFrameMapper(
    [([cont_col], [ContinuousDomain()]) for cont_col in cont_cols] +
    [([cat_col], [CategoricalDomain(), OneHotEncoder(sparse_output = False, dtype = numpy.int8)]) for cat_col in cat_cols]
  , input_df = True, df_out = True)

Numeric columns can be mapped as-is. However, string and boolean columns must be transformed into numeric columns.

A data scientist is free to employ any third-party library, any string-to-numeric encoding algorithm here. For example, the category_encoders package provides a plethora of Scikit-Learn compatible transformers, which are often based on the state-of-the-art ML research publications.

The scikit-learn package is rather hapless in comparison. In fact, the only option is the sklearn.preprocessing.OneHotEncoder transformer, which performs a deep, two-level encoding of categorical columns (ie. first from string to ordinal integer, and then from ordinal integer to a list of binary indicators).

The other candidate, the sklearn.preprocessing.OrdinalEncoder transformer fails in the vetting process, because it performs a shallow, one-level encoding (ie. from string to ordinal integer). XGBoost is not programmed to handle such “partially” encoded columns. If ordinal integers slip through, then they will be treated analogously to continuous integers, which qualifies as a serious mistake.

Important: A pipeline where Scikit-Learn’s OneHotEncoder transformer is combined with an XGBoost estimator can only be applied to dense datasets, and never to sparse datasets. In brief, the OneHotEncoder transformer is functionally restricted to two-state output. This is sufficient for encoding dense datasets (ie. “on” and “off” states), but not sparse datasets (ie. “on”, “off” and “unknown” states).

Important: A Scikit-Learn’s OneHotEncoder transformer and an XGBoost estimator are not compatible with one another in their default configurations. In brief, the OneHotEncoder transformer produces sparse data matrices by default, where the 1 value represents the “on” state and the omitted value represents the “off” state. However, an XGBoost estimator (mis-)interprets omitted values as the “unknown” state.

The above notes have been discussed in detail in an earlier blog post about one-hot encoding categorical features in Scikit-Learn XGBoost pipelines.

There are two pipeline configurations that yield provably correct results:

Making OHE output dense - [OneHotEncoder(sparse_output = False, dtype = numpy.int8), XGBModel()].
Keeping OHE output as sparse, and instructing XGBoost to interpret some value (other than float("NaN")) as missing value - [OneHotEncoder(sparse_output = True), XGBModel(missing = -999)].

The first configuration is more robust than the second (easier to understand, less likely to break after a scheduled Scikit-Learn or XGBoost version update). However, it involves a matrix densification operation, which can raise memory requirements by several orders of magnitude. When going this route, then the impact can be softened somewhat by changing the data type of output data matrix elements to some very small and cheap one such as numpy.(u)int8 or bool.

Constructing an OHE-style XGBoost estimator:

from xgboost.sklearn import XGBClassifier

def make_classifier():
  return XGBClassifier(objective = "binary:logistic", n_estimators = 131, max_depth = 6)

The values of XGBModel.n_estimators and XGBModel.max_depth attributes are set to values that should let XGBoost to run to exhaustion with this dataset.

Constructing and fitting a pipeline:

from sklearn.pipeline import Pipeline

mapper = make_mapper()
classifier = make_classifier()

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(X, y)

The XGBoost classifier performs label encoding and decoding during XGBClassifier.fit(X, y) and XGBClassifier.predict(X) method calls similarly to Scikit-Learn classifiers. The learned encoding is stored in the XGBClassifier._le attribute as a sklearn.preprocessing.LabelEncoder object.

Exporting the booster object:

classifier._Booster.save_model("Booster.bin")

The XGBModel._Booster attribute (not to be confused with the XGBModel.booster attribute!) holds a reference to the underlying xgboost.Booster object.

Saving the booster object into a file is the suggested way for moving XGBoost models between ML frameworks and applications. Unfortunately, the only supported data format is a binary proprietary one, for which there is no public specification or documentation available.

When making pipeline configuration changes, then it is possible to monitor both its intended and unintended consequences by diffing booster files. For example, when updating parameterizations, or replacing Scikit-Learn’s OneHotEncoder transformer with some third-party implementation, then the byte content of the booster file must remain the same.

XGBoost versions 0.8 and 0.9 (v0.82, v0.90)

Added the XGBModel.save_model(fname) method:

Exporting the booster object:

classifier.save_model("Booster.bin")

XGBoost version 1.0 (v1.0.2)

Added the XGBoostLabelEncoder class, and changed the type of the XGBClassifier._le attribute to it.

Exporting the booster object is now also possible in JSON data format:

# Note the ".json" filename extension!
classifier.save_model("Booster.json")

XGBoost generates a “minified” JSON document, which does not contain any newlines, indentation or even intermittent whitespace. Such a multi-megabyte, non-breaking line of text poses a serious challenge to most text editors.

Pretty-printing a booster JSON file to make it more human friendly:

$ python -m json.tool < Booster.json > Booster-pretty.json

XGBoost versions 1.1 and 1.2 (v1.1.1, v1.2.1)

No changes.

XGBoost version 1.3 (v1.3.3)

Added the XGBClassifier.use_label_encoder attribute.

The XGBClassifier.fit(X, y) method emits a user warning that label encoding should be done manually:

/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py:888: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].

This request goes against Scikit-Learn conventions that label management is the responsibility of the ML framework.

Extracting the label encoder:

from pandas import Series
from xgboost.compat import XGBoostLabelEncoder

def make_classifier():
  return XGBClassifier(
    objective = "binary:logistic", n_estimators = 131, max_depth = 6,
    use_label_encoder = False
  )

classifier = make_classifier()

xgb_le = XGBoostLabelEncoder()
y = Series(xgb_le.fit_transform(y), name = "Adjusted")

classifier._le = xgb_le

The extracted label encoder is stored in the XGBClassifier._le attribute just like before.

It is clear from the XGBoost library source code that this attribute has retained its special status. For example, the XGBClassifier.predict(X) method checks for its presence and, if found, automatically post-processes the raw prediction using its inverse_transform(y) method.

According to XGBoost version 1.3 release notes, the library now supports direct categorical splits. This functionality is considered highly experimental. It is partially integrated into the low-level Python Learning API, but not into the high-level Scikit-Learn API.

A booster object that contains categorical splits cannot be saved in binary proprietary data format.

XGBoost version 1.4 (v1.4.2)

The booster JSON file contains an embedded feature map (lists of feature names and feature types).

When making predictions, then the XGBoost library uses this information to ensure that the new validation or testing datasets are structurally similar to the training dataset. Scikit-Learn pipelines should pass such checks cleanly thanks to the “initializer” step.

XGBoost version 1.5 (v1.5.2)

Added the XGBModel.enable_categorical attribute.

Starting from here, the support for direct categorical splits is available at all API levels.

Constructing a categorical-style initializer:

from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

def make_mapper():
  return DataFrameMapper(
    [([cont_col], [ContinuousDomain()]) for cont_col in cont_cols] +
    [([cat_col], [CategoricalDomain(dtype = "category")]) for cat_col in cat_cols]
  , input_df = True, df_out = True)

A categorical column can be marked as such by simply casting it to the pandas.CategoricalDtype data type.

Important: A quick cast using the category data type alias is dataset-dependent. If the same pipeline object is used for making predictions on different datasets, then the dtype argument must be a full-blown CategoricalDtype object, which enumerates all valid category levels ahead of time.

Removing Scikit-Learn’s OneHotEncoder transformer lifts all restrictions that were previously imposed by it. Specifically, the same pipeline can now be applied to both dense and sparse datasets.

Constructing a categorical-style XGBoost estimator:

from xgboost.sklearn import XGBClassifier

def make_classifier():
  return XGBClassifier(
    objective = "binary:logistic", n_estimators = 131, max_depth = 6,
    tree_method = "gpu_hist", enable_categorical = True,
    use_label_encoder = False
  )

The enable_categorical argument acts as a double confirmation mechanism. If not explicitly set to True, then the fit method shall fail with the following value error:

Traceback (most recent call last):
  File "train-categorical.py", line 45, in <module>
    pipeline.fit(X, y)
  File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 406, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  ...
  File "/usr/local/lib/python3.9/site-packages/xgboost/sklearn.py", line 1245, in <lambda>
    create_dmatrix=lambda **kwargs: DMatrix(nthread=self.n_jobs, **kwargs),
  ...
  File "/usr/local/lib/python3.9/site-packages/xgboost/data.py", line 772, in dispatch_data_backend
    return _from_pandas_df(data, enable_categorical, missing, threads,
  File "/usr/local/lib/python3.9/site-packages/xgboost/data.py", line 312, in _from_pandas_df
    data, feature_names, feature_types = _transform_pandas_df(
  File "/usr/local/lib/python3.9/site-packages/xgboost/data.py", line 256, in _transform_pandas_df
    _invalid_dataframe_dtype(data)
  File "/usr/local/lib/python3.9/site-packages/xgboost/data.py", line 236, in _invalid_dataframe_dtype
    raise ValueError(msg)
ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns:Deductions, Education, Employment, Gender, Marital, Occupation

This pipeline refactoring from “OHE style” to “categorical style” is provably correct, because these two configurations produce identical decision tree ensembles when fitted on the same platform using the same tree_method tree param.

The above statement is trivial to verify by diffing PMML files:

$ diff XGBoostAudit-OHE.pmml XGBoostAudit-categorical.pmml

The diff output shows changes to two lines of text. First, the Header element contains a different document creation timestamp. Second, the data type of the “Deductions” field has changed from boolean to string.

The technical cause for the unexpected data type change remains unexplained. Perhaps there is a slight incompatibility along the scikit-learn, sklearn_pandas and pandas library chain, which sorts itself out after a couple of package version updates.

In contrast, the above statement (about two decision tree ensembles being identical) is impossible to verify by diffing booster JSON files.

The booster JSON file references feature map entries by index. However, this pipeline refactoring caused the embedded feature map to collapse in size (from 50 entries to 9 entries), thereby introducing large systematic shifts into index values. For example, the “Education” field was mapped to a feature map entry range [5, 20] (ie. a list of 16 binary indicator features), but is now mapped to a sole feature map entry [4] (ie. a single categorical feature).

The diff output shows that the booster JSON file has been rewritten. While technically correct, this observation is completely unhelpful when trying to understand the bigger picture.

XGBoost version 1.6 (v1.6.2)

Deprecated the use of the XGBClassifier.use_label_encoder attribute, and added the XGBModel.max_cat_to_onehot attribute.

Evidently, XGBoost is quickly getting better at categorical splits. The OHE-based partitioning is considered outdated. The new default is set-based aka optimal partitioning.

The max_cat_tp_onehot attribute allows the data scientist to choose one partitioning algorithm over the other.

The default value of the max_cat_to_onehot tree param is 4. Typical values range from 2 to 20. Setting the value to an arbitrarily large number (eg. over 10’000) will effectively disable the set-based aka optimal partitioning.

Granted, one-hot encoding does not work particularly well with medium- and high-cardinality categorical features, especially if there are second-order effects involved (eg. sets of closely related category levels).

One-hot encoding faces additional headwinds with decision tree models, because it identifies and isolates significant category levels one by one, leading to long and thin branch growth. For example, a member decision tree whose maximum depth is limited to 6 (the default value for the max_depth tree param) will be fully developed after coming in contact with a categorical feature that has six or more significant category levels.

Exporting the booster object is now also possible in Universal Binary JSON (UBJSON) data format:

# Note the ".ubj" filename extension!
classifier.save_model("Booster.ubj")

UBJSON is a binary data serialization format for JSON-compatible data structures. It aims to improve data reading and writing speeds, and reduce data size. These capabilities are highly relevant, as the size of booster JSON files can reach tens to hundreds of megabytes.

XGBoost version 1.7 (v1.7.3)

Added the XGBModel.max_cat_threshold attribute.

This attribute limits the maximum set size for the set-based aka optimal partitioning.

The default value of the max_cat_threshold tree param is 64.

XGBoost will consider a direct categorical split only if the number of category levels for a categorical feature has dropped below this value. For example, the USPS ZIP Code has over 40’000 codes. When used as a categorical feature, then it would be useless to attempt a categorical split at a federal level (eg. sending 10’000 codes to the left and 30’000 codes to the right) or even at a state level. However, it would be very useful to attempt the same at county or city levels (eg. sending 10 codes to the left and 30 to the right).

The structure of member decision trees reflects how the original heterogeneous dataset is being divided into smaller and more homogenous subsets. Continuous and low-cardinality categorical features (see the max_cat_tp_onehot tree param!) can appear at every branching level. In contrast, medium- and high-cardinality categorical features initially stay on the sidelines. They appear gradually at deeper branching levels, where the details matter.

PMML

The JPMML-XGBoost library converts booster objects to the Predictive Model Markup Language (PMML) representation. This library supports all XGBoost versions and data formats (eg. Binary, JSON, UBJSON), and can be used either as a standalone command-line application or as an ML framework plugin.

Getting started with PMML is easy. Simply replace the default Scikit-Learn’s pipeline class with the sklearn2pmml.pipeline.PMMLPipeline class, and then pass the fitted pipeline object to the sklearn2pmml.sklearn2pmml utility function:

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(X, y)

# Disable PMML optimizations
pipeline.configure(compact = False)

sklearn2pmml(pipeline, "XGBoostAudit.pmml")

The quality of PMML documents depends on the quality and completeness of the XGBoost model schema information. For example, XGBoost versions 1.3 through 1.7 do not store category levels in the embedded feature map. It is impossible to generate meaningful categorical splits in such a situation, unless the embedded feature map is overriden with a more sophisticated external feature map.

The integration is much deeper within the JPMML ecosystem, as various JPMML platform libraries (eg. JPMML-SkLearn, JPMML-R, JPMML-SparkML) present the model schema as a live org.jpmml.converter.Schema object.

The JPMML-XGBoost library can expose one XGBoost model in different ways. A PMML document can be optimized for explainability by expanding all decision tree leaves and maintaining complex feature values until the end. For example, generating predicate elements that operate with temporal feature values (eg. <SimplePredicate field="lastSeenDate" operator="lessOrEqual" value="2022-12-31"/>). Alternatively, a PMML document can be optimized for resource efficiency by compacting and simplifying it.

The conversion process can be guided using conversion options.

Resources

Dataset: audit.csv
Python scripts: train-OHE.py (XGBoost version 1.0 and newer) and train-categorical.py (XGBoost version 1.6 and newer)