Training Python-based XGBoost accelerated failure time models

Survival analysis is a subtype of regression analysis, which models time durations (eg. time to an event of interest).

XGBoost supports both Cox proportional hazards (Cox PH) and accelerated failure time (AFT) algorithms. The training is currently possible via the low-level Python Learning API, but not via the high-level Scikit-Learn API. This limitation can be tracked as XGBoost-7292.

The solution is to train XGBoost survival models using the Python Learning API, and then migrate them to the Scikit-Learn API for deployment.

Data canonicalization

Loading the “lung” dataset:

from lifelines.datasets import load_lung

df = load_lung()

print(df.dtypes)
print(df.isna().sum())

The loaded data matrix contains ten numeric columns:

Name	Role	Loaded `dtype`	Refined `dtype`	Missing
`inst`	feature	`float64`	`pandas.Int64Dtype`	1
`time`	proto-label	`int64`	`int64`	0
`status`	proto-label	`int64`	`int64`	0
`age`	feature	`int64`	`int64`	0
`sex`	feature	`int64`	`int64`	0
`ph.ecog`	feature	`float64`	`pandas.Int64Dtype`	1
`ph.karno`	feature	`float64`	`pandas.Int64Dtype`	1
`pat.karno`	feature	`float64`	`pandas.Int64Dtype`	3
`meal.cal`	feature	`float64`	`pandas.Int64Dtype`	47
`wt.loss`	feature	`float64`	`pandas.Int64Dtype`	14

The time and status columns encode time durations, leaving the other eight columns as features.

All feature data are inherently integer-like. However, they are stored in float64 data type columns in order to accommodate missing value placeholders in the form of float("NaN").

Refining column data types:

from pandas import Int64Dtype

cols = df.columns.tolist()

for col in cols:
  has_missing = pandas.isnull(df[col]).any()
  df[col] = df[col].astype(Int64Dtype() if has_missing else int)

Data pre-processing

XGBoost versions 1.5 and newer can work with the canonicalized “lung” dataset as-is. There is no longer any technical reason for imputing missing values or encoding categorical values.

In fact, this is a welcome development, as Scikit-Learn is rather inept at transforming sparse categorical features into legacy XGBoost-compatible representation, as discussed in detail in an earlier blog post about one-hot encoding categorical features in Scikit-Learn XGBoost pipelines.

The only data pre-processing action that is needed is casting the data type of inst and sex columns from integer to pandas.CategoricalDtype:

df["inst"] = df["inst"].astype("category")
df["sex"] = df["sex"].astype("category")

However, the above code is not suitable for deployment!

A quick cast using the category data type alias creates a new CategoricalDtype object on each and every call. Its “business state” is a mapping from category levels to category indices. If a testing dataset does not have exactly the same set of unique category levels as the training dataset, then the mapping will be different, and along with it all the predictions.

Constructing a casting transformer that is robust towards dataset variations:

from pandas import CategoricalDtype
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

cat_cols = ["inst", "sex"]
cont_cols = ["age", "ph.ecog", "ph.karno", "pat.karno", "meal.cal", "wt.loss"]

def make_cat_dtype(x):
  categories = pandas.unique(x)
  # Drop null-like category levels
  categories = numpy.delete(categories, pandas.isnull(categories))
  return CategoricalDtype(categories = categories, ordered = False)

inst_dtype = make_cat_dtype(df["inst"])
sex_dtype = make_cat_dtype(df["sex"])

transformer = DataFrameMapper(
  [(["inst"], CategoricalDomain(dtype = inst_dtype))] +
  [(["sex"], CategoricalDomain(dtype = sex_dtype))] +
  [([cont_col], ContinuousDomain(dtype = numpy.float32)) for cont_col in cont_cols]
, input_df = True, df_out = True)

The cast to CategoricalDtype data type is performed using the sklearn2pmml.decoration.CategoricalDomain decorator. Alternatively, it could be performed using the sklearn2pmml.preprocessing.CastTransformer transformer. Using decorators is preferable, because they capture valuable metadata about the training dataset (valid value space, univariate statistics, etc.) which serves as model documentation.

This cast is possible with Pandas’ data containers (eg. pandas.DataFrame and pandas.Series), but not with Numpy arrays and matrices.

The propagation of Pandas’ data containers through Scikit-Learn pipeline has been challenging. Scikit-Learn version 1.2.0 introduces the set_output API, which allows to configure the output data container type for all transformer subclasses using the TransformerMixin.set_output(transform) method.

It will take some time until the set_output API gets propagated through the ecosystem. Until then, the tried and tested sklearn_pandas.DataFrameMapper meta-transformer class is the way to go.

Unlike categorical features, continuous features do not require any casting or transformation here. Nevertheless, they are also filtered through the sklearn2pmml.decoration.ContinuousDomain decorator in order to capture metadata.

In the above example, all columns are declared one by one, even if the associated transformation supports multi-column input. This is needed for preserving the original column names.

Re-coding the label as pairs of scalar values:

def make_aft_label(time, status):
  time_lower = time

  time_upper = time.copy()
  time_upper[status == 0] = float("+Inf")

  return (time_lower, time_upper)

According to XGBoost conventions, uncensored labels are encoded as (<value>, <value>), whereas right-censored labels are encoded as (<value>, +Inf).

Training via Python Learning API

Constructing an xgboost.DMatrix object:

from xgboost import DMatrix

Xt = transformer.fit_transform(df)

time_lower, time_upper = make_aft_label(df["time"], df["status"])

dmat = DMatrix(
  # Features
  data = Xt,
  # Label
  label_lower_bound = time_lower, label_upper_bound = time_upper,
  missing = float("NaN"), 
  enable_categorical = True
)

Training an xgboost.Booster object, and saving it in JSON data format:

import xgboost

params = {
  "objective" : "survival:aft",
  "eval_metric" : "aft-nloglik",
  "max_depth" : 3,
  "tree_method" : "hist"
}

booster = xgboost.train(params = params, dtrain = dmat, num_boost_round = 31)
booster.save_model("booster.json")

Important: The computation of categorical splits must be activated by setting the tree_method tree param to some approximated training algorithm such as approx or hist (gpu_hist on GPU implementations). If this tree param is set to exact (default for small datasets such as the “lung” dataset), then the computation yields continuous splits, irrespective of the underlying data type. XGBoost does not issue any warnings about the fallback.

When in doubt, it is possible to open the booster JSON file in a text editor, and skim over categories_nodes, categories_segments and categories_sizes properties. With categorical features around, one would expect to see a healthy proportion of non-empty list values. It would be a clear cause for concern if they were all empty.

Making predictions via Scikit-Learn API

Scikit-Learn pipeline is a composite of transformers and estimators. It provides atomic pickling, and unified fit_transform(X) and fit_predict(X, y) API methods for dealing with arbitrary complexity workflows.

Constructing a pipeline from available components:

from sklearn.pipeline import Pipeline
from xgboost import Booster
from xgboost.sklearn import XGBRegressor

regressor = XGBRegressor()
regressor._Booster = Booster(model_file = "booster.json")

pipeline = Pipeline([
  ("transformer", transformer),
  ("regressor", regressor)
])

As noted earlier, survival analysis is a subtype of regression analysis. This makes it possible to wrap an AFT booster object into an xgboost.XGBRegressor object, and have it directly respond to XGBRegressor.predict(X) method calls.

Pipeline verification

Typically, a Pipeline object is constructed using unfitted components, and fitted right thereafter. This is convenient and makes a very strong guarantee that all pipeline steps work together harmoniously. However, there are still plenty of situations where one or more pipeline steps defy the unified API, or make use of extra information or functionality that is not available within the Pipeline.fit(X, y) method scope.

When a Pipeline object is constructed from pre-fitted components, then there are no guarantees in place. For example, the Pipeline constructor does not even check if the output dimensions of one step match the input dimensions of the next step.

The correctness of unknown origin and unknown status Pipeline objects can only be assessed empirically, by calling their predict methods (eg. predict(X), predict_proba(X)) with adequate verification datasets, and comparing actual results against expected results.

Asserting equivalence between Python Learning API and Scikit-Learn API predictions:

def check_predict(expected, actual, rtol, atol):
  isclose = numpy.isclose(expected, actual, rtol = rtol, atol = atol, equal_nan = False)
  num_conflicts = numpy.sum(isclose == False)
  if num_conflicts:
    for idx, status in enumerate(isclose):
      if not status:
        print("{} != {}".format(expected[idx], actual[idx]))
    raise ValueError("Found {} conflicting prediction(s)".format(num_conflicts))
  print("All correct")

# Reference values
booster_time = booster.predict(dmat)

# To-be-checked values
pipeline_time = pipeline.predict(df)

check_predict(booster_time, pipeline_time, 1e-6, 1e-3)

The AFT booster object makes predictions using the same DMatrix object that it was fitted on, whereas the pipeline object makes predictions starting from the canonicalized dataset. If there were any discrepancies between their predictions, then it would be natural to put more trust in the former (the “expected” side), and less in the latter (the “actual” side).

Arrays of floating-point numeric values can be compared element-wise for equivalence using the numpy.isclose utility function.

The algorithm is controlled by two parameters. First, relative tolerance (the rtol argument) is a workflow property (the precision of the default numeric data type, plus the order of “mathematical complexity” of transformations). Second, absolute tolerance (the atol argument) is a dataset property.

XGBoost is based on float32 data type, which limits the relative tolerance to about 1E-6 .. 1E-7 range.

The label of the “lung” dataset is survival time in days, with a typical range from 100 to 2000. The absolute tolerance value of 1E-3 is estimated by multiplying the selected relative tolerance 1E-6 with the selected characteristic label value of 1000.

In other words, two survival times are considered to be equivalent if they agree with each other in the order of “full minutes or better”.

PMML

Wrapping the pipeline object into a PMMLPipeline object:

from sklearn2pmml import make_pmml_pipeline, sklearn2pmml

pmml_pipeline = make_pmml_pipeline(pipeline, active_fields = (cat_cols + cont_cols), target_fields = ["time"])

Xt_imp = booster.get_score(importance_type = "weight")
print(Xt_imp)

# Transform dict to list
Xt_imp = [Xt_imp[col] for col in pmml_pipeline.active_fields]

regressor.pmml_feature_importances_ = numpy.asarray(Xt_imp)

df_verif = df[pmml_pipeline.active_fields].sample(10)

pmml_pipeline.verify(df_verif, precision = 1e-6, zeroThreshold = 1e-3)

sklearn2pmml(pmml_pipeline, "XGBoostAFTLung.pmml")

It is advisable to use the sklearn2pmml.make_pmml_pipeline(obj) utility function for constructing a PMMLPipeline object from pre-fitted components, because it is programmed to perform exactly the same extra object initialization work as the PMMLPipeline.fit(X, y) method is doing. For example, setting the PMMLPipeline.active_fields and PMMLPipeline.target_fields attributes.

The PMMLPipeline object is enhanced with verification data in order to auto-discover any regressions that might arise from migration from Python platform to other language platforms. PMML implements model verification similarly to the numpy.isclose utility function. The main difference is terminological, as relative tolerance and absolute tolerance are called “precision” and “zero threshold”, respectively.

Resources

Python script: train.py