There are numerous application scenarios which require an ability to "look into" a model to understand how a particular prediction was computed. They range from low-stakes applications scenarios such as tracing and debugging misbehaving models, to high-stakes ones such as generating reports for models that are making life-changing decisions.
Most ML frameworks completely overlook this need. For example, Scikit-Learn logistic regression models expose
predict_proba(X) methods, which return plain numeric predictions. The only way to understand how a particular number was computed (eg. active terms and their coefficients, the family and parameterization of the link function) is to open the source code of the logistic regression model class in a text editor, and parse/interpret the body of the predict method line-by-line. However, if the model operates on a transformed feature space, and the ML framework itself uses low-level abstractions for feature representation (eg. string features are transformed to binary vectors), then it is virtually impossible for a casual observer to make any sense of it all.
This problem has an easy two-step solution. First, the model or pipeline should be converted from the low-level ML framework representation to the high-level Predictive Model Markup Language (PMML) representation, which makes it human-readable and -interpretable in the original feature space. Second, all the tracing and reporting work should be automated using a PMML engine.
Reporting Java API
The JPMML-Evaluator library is probably the most capable and versatile PMML engine for the Java/JVM platform. It provides different API levels and hooks for interacting with deployed models, including a special-purpose Value API for capturing all operations that are made when computing a prediction.
The appropriate Value API can be activated using the
org.jpmml.evaluator.ModelEvaluatorBuilder#setValueFactoryFactory(org.jpmml.evaluator.ValueFactoryFactory) method. For example, creating two
Evaluator objects based on the same in-memory
The reporting Value API captures the computation in the Mathematical Markup Language (MathML) representation. MathML is an XML dialect, which can be rendered as image, or translated to other data formats and representations such as LaTeX, or R and Python language expressions.
When the reporting Value API is activated, then target field value(s) shall be complex objects that implement the
org.jpmml.evaluator.HasReport marker interface. This interface declares a sole
HasReport#getReport() method, which gives access to the live
org.jpmml.evaluator.Report object. The
Report class is polymorphic, and has several specialized implementation classes available. The simplest way to obtain the final MathML string is to invoke the
org.jpmml.evaluator.ReportUtil#format(Report) utility method:
Reporting PMML vendor extension
After successfully designing and implementing the reporting Value API, the authors made a suggestion to Data Mining Group (DMG.org) that the PMML standard should incorporate similar functionality in the form of a
report result feature. Unfortunately, DMG.org decided against doing so, which leaves everything into the status of a vendor extension.
OutputField element has the following attributes:
name– The name of the output field. A good convention is to wrap the name of the base output field as
reporting(<output field name>).
optype– Fixed as
feature– Fixed as
x-report. The "x-" prefix to the attribute value indicates that this is a vendor extension.
x-reportField– The name of the base output field. Again, the "x-" prefix to the attribute name indicates that this is a vendor extension.
For example, enhancing a binary classification model to extract probability calculation reports for the "event" and "no-event" target categories:
The ordering of
OutputField elements is not significant, except for the common sense requirement that the declaration of the base output field must precede the declaration of the reporting output field that references it.
Training a minimalistic XGBoost model for the "audit" dataset:
sklearn2pmml package encodes this XGBoost model in the form of a two-segment model chain. The first segment is the "booster", which sums the predictions of 17 member decision tree models. The second segment is the "sigmoid function", which transforms the boosted value to a pair of probability values.
The generation of reporting
OutputField elements could be controlled using a special-purpose conversion option. However, for as long as it is not available, or when working with existing and/or third-party PMML documents, then they need to be generated manually.
The newly generated PMML document
XGBoostAudit.pmml is copied into
XGBoostAudit-reporting.pmml, and modified in a text editor in the following way:
According to the PMML specification, the results provided from the model chain are the results of the last active segment. The results from earlier active segments must be explicitly propagated.
For example, the value of the "report(xgbValue)" output field stays in "booster" scope by default. It needs to be imported from "booster" scope into "sigmoid function" scope using a
MiningField element, and then re-exported as "ref(report(xgbValue))" using an
jpmml_evaluator package provides a Python wrapper for the JPMML-Evaluator library. It enables quick PMML validation and evaluation work, without writing a single line of Java application code.
Creating a verified PMML engine, and evaluating the first row of the "audit" dataset:
The result is a
dict object with six items:
The above report shows that the boosted value -2.1023297 is obtained by summing 17 member values that range from -0.17968129 to -0.09484299. As is typically the case with gradient boosting methods, the magnitude of member values decreases with each iteration. The probability of the positive scenario 0.1088706 is obtained by applying the inverse of the logit function to -2.1023297. The probability of the negative scenario 0.8911294 is obtained by subtracting the probability of the positive scenario from 1.
- "Audit" dataset:
- Python scripts:
- Reporting PMML document: