Extending Scikit-Learn with prediction post-processing

The centerpiece of ML pipelines is the model. Steps that precede the model are called “data pre-processing” aka “feature engineering” steps. Steps that follow the model are called “prediction post-processing” aka “decision engineering” steps.

Overall, the feature engineering part is more appreciated and valued than the decision engineering part. Feature transformations allow the data to be (re)presented in more nuanced and relevant ways, thereby leading to better models. However, even the best model is functionally compromised if its predictions are confusing or difficult to integrate with the host application.

Scikit-Learn pipelines provide feature engineering support, but completely lack decision engineering support. Any attempt to insert a transformer step after the final estimator step shall be rejected by raising an error.

Potential workarounds include wrapping an estimator into a post-processing meta-estimator (that overrides the predict(X) method), or performing post-processing computations outside of the Scikit-Learn pipeline using free-form Python code.

The selection and functionality of meta-estimators is rather limited. The two notable examples are TransformedTargetRegressor for transforming the target during regression, and CalibratedClassifierCV for transforming the decision function during classification.

Performing computations using free-form Python code is the nuclear option. It allows reaching any goal, but sacrifices the main value proposition of ML pipelines, which is atomicity and ease of deployment across time and space.

PMMLPipeline “transformed prediction” API

The sklearn2pmml package provides a sklearn2pmml.pipeline.PMMLPipeline class, which extends the sklearn.pipeline.Pipeline class with prediction post-processing.

The idea is to attach a number of child transformers to the parent pipeline, one for each predict method:

Attribute Predict method Transformed predict method
predict_transformer predict(X) predict_transform(X)
predict_proba_transformer predict_proba(X) predict_proba_transform(X)
apply_transformer apply(X) apply_transform(X)

A transformed predict method extends the pipeline towards a particular objective. Its output is a 2-D Numpy array, where the leftmost column(s) correspond to the primary result, and all the subsequent columns to secondary results:

import numpy

def predict_transforme(X):
  yt = self.predict(X)
  yt_postproc = self.predict_transformer.transform(yt)
  return numpy.hstack((yt, yt_postproc))

Child transformers cannot see the incoming X dataset. A data matrix may expand or contract during data pre-processing in unforeseen ways, so it would be very difficult to match a specific feature column or condition during prediction post-processing. If a business decision is a function of both model input and output, then it still needs to be coded manually.

There is no limit to child transformer’s complexity, except that it cannot encapsulate a full-blown model.

Additionally, the sklearn2pmml package provides a sklearn2pmml.postprocessing.BusinessDecisionTransformer transformer, which generates rich OutputField elements following the “decision” result feature conventions.

Examples

The class label of the “audit” dataset is encoded as a binary integer, where the “0” value and the “1” value indicate non-productive and productive audits, respectively. Such internal encodings should be unwound before reaching higher application levels.

Post-processing class labels:

from sklearn2pmml.decoration import Alias
from sklearn2pmml.postprocessing import BusinessDecisionTransformer
from sklearn2pmml.preprocessing import ExpressionTransformer

binary_decisions = [
	("yes", "Auditing is needed"),
	("no", "Auditing is not needed")
]

pipeline = PMMLPipeline([...]
, predict_transformer = Alias(BusinessDecisionTransformer(ExpressionTransformer("'yes' if X[0] == 1 else 'no'"), "Is auditing necessary?", binary_decisions, prefit = True), "binary decision", prefit = True))

yt = pipeline.predict_transform(X)

Regression results can be transformed numerically using FunctionTransformer or ExpressionTransformer transformers, whereas classification results can be re-mapped using the LookupTransformer transformer.

The BusinessDecisionTransformer transformer is applicable to categorical results (classification and clustering results, bucketized regression results). It articulates the business problem, and enumerates the full range of business decisions that this output field can make.

Post-processing probability distributions:

from sklearn2pmml.decoration import Alias
from sklearn2pmml.postprocessing import BusinessDecisionTransformer
from sklearn2pmml.preprocessing import CutTransformer, ExpressionTransformer

graded_decisions = [
	("no", "Auditing is not needed"),
	("no over yes", "Audit in last order"),
	("yes over no", "Audit in first order"),
	("yes", "Auditing is needed"),
]

event_proba_quantiles = [0.0, 0.1363, 0.5238, 0.7826, 1.0]

predict_proba_transformer = Pipeline([
	("selector", ExpressionTransformer("X[1]")),
	("decider", Alias(BusinessDecisionTransformer(CutTransformer(bins = event_proba_quantiles, labels = [key for key, value in graded_decisions]), "Is auditing necessary?", graded_decisions, prefit = True), "graded decision", prefit = True))
])

pipeline = PMMLPipeline([...]
, predict_proba_transformer = predict_proba_transformer)

yt = pipeline.predict_proba_transform(X)

The input to the predict_proba_transformer is a multi-column array. Therefore, the transformation is typically implemented as a pipeline, where the first step performs column selection.

In case of elementary operations it is possible to keep the transformer as a standalone pipeline step, or embed it into the BusinessDecisionTransformer transformer. The former approach gives rise to an extra OutputField element, which may be seen as an unnecessary clutter to a model schema.

The two above examples are about fully-decoupled child transformers. They are composed of prefitted components, and may be defined and assigned in the PMMLPipeline constructor.

However, there are several application areas where the child transformer needs to reference the internal state of the preceding estimator, or even be fitted relative to it (eg. probability calibration). This does not pose any problems, because all the relevant PMMLPipeline attributes may be assigned and re-assigned at any later time.

Post-processing leaf indices:

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.decoration import Alias
from sklearn2pmml.preprocessing import LookupTransformer

classifier = DecisionTreeClassifier()

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
])
pipeline.fit(X, y)

def leaf_sizes(tree):
  leaf_sizes = dict()
  for i in range(tree.node_count):
    if (tree.children_left[i] == -1) and (tree.children_right[i] == -1):
      leaf_sizes[i] = int(tree.n_node_samples[i])
  return leaf_sizes

pipeline.apply_transformer = Alias(LookupTransformer(leaf_sizes(classifier.tree_), default_value = -1), "leaf size", prefit = True)

yt = pipeline.apply_transform(X)

pipeline.configure(compact = False, flat = False, numeric = False, winner_id = True)
sklearn2pmml(pipeline, "DecisionTreeAudit.pmml")

The input to the apply_transformer is a column vector for decision tree models, and a 2-D Numpy array for decision tree ensemble models.

Scikit-Learn identifies decision tree nodes by 1-based integer index, which can be encoded into PMML documents using the generic entity identifiers mechanism.

By default, the sklearn2pmml package does not collect and encode node identifiers, because that would prevent it from compacting and flattening the tree data structure. The default behaviour is suppressed by deactivating compact and flat conversion options, and activating the winner_id conversion option. The numeric conversion option controls the encoding of categorical splits, and can toggled freely.

Resources