The centerpiece of ML pipelines is the model. Steps that precede the model are called “data pre-processing” aka “feature engineering” steps. Steps that follow the model are called “prediction post-processing” aka “decision engineering” steps.
Overall, data pre-processing is more appreciated and valued than prediction post-processing. Feature transformations allow the data to be (re)presented in more nuanced and relevant ways, thereby leading to better models. However, even the best model is functionally compromised if its predictions are confusing or difficult to integrate with the host application.
Scikit-Learn pipelines provide data pre-processing support, but completely lack prediction post-processing support. Any attempt to insert a transformer step after the final estimator step shall fail with an error.
Potential workarounds include wrapping an estimator into a post-processing meta-estimator (that overrides the
predict(X) method), or performing post-processing computations outside of the Scikit-Learn pipeline using free-form Python code.
The selection and functionality of meta-estimators is rather limited. The two notable examples are
TransformedTargetRegressor for transforming the target during regression, and
CalibratedClassifierCV for transforming the decision function during classification.
Performing computations using free-form Python code is the nuclear option. It allows reaching any goal, but sacrifices the main value proposition of ML pipelines, which is atomicity and ease of deployment across time and space.
PMMLPipeline “transformed prediction” API
sklearn2pmml package provides the
sklearn2pmml.pipeline.PMMLPipeline class, which extends the
sklearn.pipeline.Pipeline class with prediction post-processing.
The idea is to attach a number of child transformers to the parent pipeline, one for each predict method:
|Attribute||Predict method||Transformed predict method|
A transformed predict method extends the pipeline towards a particular objective. Its output is a 2-D Numpy array, where the leftmost column(s) correspond to the primary result, and all the subsequent columns to secondary results:
import numpy def predict_transform(X): yt = self.predict(X) yt_postproc = self.predict_transformer.transform(yt) return numpy.hstack((yt, yt_postproc))
Child transformers cannot see the incoming
A data matrix may expand or contract during data pre-processing in unforeseen ways, so it would be very difficult to match a specific feature column or condition during prediction post-processing.
If a business decision is a function of both model input and output, then it still needs to be coded manually.
There is no limit to child transformer’s complexity, except that it cannot encapsulate a full-blown model.
sklearn2pmml package provides the
sklearn2pmml.postprocessing.BusinessDecisionTransformer transformer, which generates rich
OutputField elements following the “decision” result feature conventions.
The class label of the “audit” dataset is encoded as a binary integer, where the “0” value and the “1” value indicate non-productive and productive audits, respectively. Such internal encodings should be unwound before reaching higher application levels.
Post-processing class labels:
from sklearn2pmml.decoration import Alias from sklearn2pmml.postprocessing import BusinessDecisionTransformer from sklearn2pmml.preprocessing import ExpressionTransformer binary_decisions = [ ("yes", "Auditing is needed"), ("no", "Auditing is not needed") ] pipeline = PMMLPipeline([...] , predict_transformer = Alias(BusinessDecisionTransformer(ExpressionTransformer("'yes' if X == 1 else 'no'"), "Is auditing necessary?", binary_decisions, prefit = True), "binary decision", prefit = True)) yt = pipeline.predict_transform(X)
Regression results can be transformed numerically using
ExpressionTransformer transformers, whereas classification results can be re-mapped using the
BusinessDecisionTransformer transformer is applicable to categorical results (classification and clustering results, bucketized regression results).
It articulates the business problem, and enumerates the full range of business decisions that this output field can make.
Post-processing probability distributions:
from sklearn2pmml.decoration import Alias from sklearn2pmml.postprocessing import BusinessDecisionTransformer from sklearn2pmml.preprocessing import CutTransformer, ExpressionTransformer graded_decisions = [ ("no", "Auditing is not needed"), ("no over yes", "Audit in last order"), ("yes over no", "Audit in first order"), ("yes", "Auditing is needed"), ] event_proba_quantiles = [0.0, 0.1363, 0.5238, 0.7826, 1.0] predict_proba_transformer = Pipeline([ ("selector", ExpressionTransformer("X")), ("decider", Alias(BusinessDecisionTransformer(CutTransformer(bins = event_proba_quantiles, labels = [key for key, value in graded_decisions]), "Is auditing necessary?", graded_decisions, prefit = True), "graded decision", prefit = True)) ]) pipeline = PMMLPipeline([...] , predict_proba_transformer = predict_proba_transformer) yt = pipeline.predict_proba_transform(X)
The input to the
predict_proba_transformer is a multi-column array.
Therefore, the transformation is typically implemented as a pipeline, where the first step performs column selection.
In case of elementary operations it is possible to keep the transformer as a standalone pipeline step, or embed it into the
BusinessDecisionTransformer transformer. The former approach gives rise to an extra
OutputField element, which may be seen as an unnecessary clutter to a model schema.
The two above examples are about fully-decoupled child transformers. They are composed of prefitted components, and may be defined and assigned in the
However, there are several application areas where the child transformer needs to reference the internal state of the preceding estimator, or even be fitted relative to it (eg. probability calibration).
This does not pose any problems, because all the relevant
PMMLPipeline attributes may be assigned and re-assigned at any later time.
Post-processing leaf indices:
from sklearn2pmml import sklearn2pmml from sklearn2pmml.decoration import Alias from sklearn2pmml.preprocessing import LookupTransformer classifier = DecisionTreeClassifier() pipeline = PMMLPipeline([ ("mapper", mapper), ("classifier", classifier) ]) pipeline.fit(X, y) def leaf_sizes(tree): leaf_sizes = dict() for i in range(tree.node_count): if (tree.children_left[i] == -1) and (tree.children_right[i] == -1): leaf_sizes[i] = int(tree.n_node_samples[i]) return leaf_sizes pipeline.apply_transformer = Alias(LookupTransformer(leaf_sizes(classifier.tree_), default_value = -1), "leaf size", prefit = True) yt = pipeline.apply_transform(X) pipeline.configure(compact = False, flat = False, numeric = False, winner_id = True) sklearn2pmml(pipeline, "DecisionTreeAudit.pmml")
The input to the
apply_transformer is a column vector for decision tree models, and a 2-D Numpy array for decision tree ensemble models.
Scikit-Learn identifies decision tree nodes by 1-based integer index, which can be encoded using the PMML entity identifiers mechanism.
By default, the
sklearn2pmml package does not collect and encode node identifiers, because that would prevent it from compacting and flattening the tree data structure.
The default behaviour is suppressed by deactivating
flat conversion options, and activating the
winner_id conversion option. The
numeric conversion option controls the encoding of categorical splits, and can toggled freely.
- “Audit” dataset:
- Python scripts: