Extending Scikit-Learn with feature specifications

Predictive analytics applications must pay attention to "model-data fit", which means that a model can only be used if it is known to be relevant and applicable.

To illustrate, given a model object, one should be able to confidently answer questions like:

Much of this (meta-)information is readily available during model training. JPMML family conversion tools and libraries aim to capture, systematize and store it automatically, with minimal intrusion to existing workflows.

This blog post demonstrates how Scikit-Learn users should approach the "model-data fit" problematics.


The Predictive Model Markup Language (PMML) defines data structures for representing most common model types.

Every model element holds the description of its data interface:

Value preparation is a two stage process.

In the first stage, the user value is converted to a PMML value according to the DataField element.

The user value is cast or parsed into the correct data type, restricted to the correct operational type (one of continuous, categorical or ordinal), and assigned to the value space (one of valid, invalid or missing). The resulting PMML value can be regarded as a value in a three-dimensional space <data type>-<operational type>-<value space type>.

Consider the following DataField element:

<DataField name="status" dataType="integer" optype="categorical">
  <Value value="1"/>
  <Value value="2"/>
  <Value value="3"/>
  <Value value="-999" property="missing"/>

Value conversions:

Java value PMML value Explanation
java.lang.String("1") integer-categorical-valid Parseable, listed as valid
java.lang.Integer(2) integer-categorical-valid As-is, listed as valid
java.lang.Double(3.0) integer-categorical-valid Castable without loss of precision, listed as valid
java.lang.String("one") integer-categorical-invalid Not parseable
java.lang.Integer(0) integer-categorical-invalid As-is, not listed as valid
java.lang.Double(3.14) integer-categorical-invalid Not castable without loss of precision
null integer-categorical-missing Missing value
java.math.BigDecimal("-999.000") integer-categorical-missing Castable without loss of precision, listed as missing

In the second stage, the PMML value undergoes one or more value space-dependent treatments according to the MiningField element.

Valid values pass by default. The domain of continuous values can be restricted by changing the value of the outliers attribute from asIs to asMissingValues or asExtremeValues, plus adding lowValue and highValue attributes.

Invalid values do not pass by default, because the default value of the invalidValueTreatment attribute is returnInvalid.

The behaviour where the model actively refuses to compute a prediction can be surprising to Scikit-Learn users. However, this should be seen as a feature, not a bug, because the objetive is to inform upstream agents about data correctness and/or consistency issues (eg. feature drift) and prevent downstream agents from taking action on dubious results.

The model can be forced to accept invalid values by changing the value of the invalidValueTreatment attribute to asIs. However, as every invalid value is "broken" in its own way, the computation may succeed or fail arbitrarily.

The recommended approach is to make the computation more controllable.

Invalid values may be replaced with a predefined valid value by changing the value of the invalidValueTreatment attribute to asIs, plus adding the (x-)invalidValueReplacement attribute (the x- prefix is required in PMML schema versions earlier than 4.4). Alternatively, they may be replaced with a missing value by changing the value of the invalidValueTreatment attribute to asMissing.

Missing values pass by default. By analogy with invalid value treatment, missing values can be rejected by changing the value of the missingValueTreatment attribute to (x-)returnInvalid, or replaced with a predefined valid value by adding the missingValueReplacement attribute.

Important: The IEEE 754 constant NaN ("Not a Number") is assigned to invalid value space (not to missing value space).

SkLearn2PMML domain decorator classes

The sklearn2pmml package provides several domain decorator classes for customizing the content of DataField and MiningField elements:

The PMML data type is derived from the Python data type, but it can be overriden using the dtype parameter. The operational type is derived from the location of the subclass in class hierarchy.

If the training dataset contains masked missing values, then the value of the mask should be declared using the missing_values parameter.

For example, creating the stub of the above DataField element:

from sklearn2pmml.decoration import CategoricalDomain

domain = CategoricalDomain(dtype = int, missing_values = -999)

The valid value space cannot be set or overriden manually. It is collected and stored automatically whenever the Domain.fit(X, y) method is called.

The outlier treatment, invalid value treatment and missing value treatment are PMML defaults, but they can be overriden using the corresponding parameters. Parameter names and values are derived from PMML attribute names and values by changing the format from lower camelcase ("someValue") to lower underscore case ("some_value").

For example, making the default configuration explicit:

from sklearn2pmml.decoration import ContinuousDomain

domain = ContinuousDomain(outlier_treatment = "as_is", low_value = None, high_value = None, invalid_value_treatment = "return_invalid", invalid_value_replacement = None, missing_value_treatment = "as_is", missing_value_replacement = None)

The Domain.transform(X) method uses all this information to prepare the dataset exactly the same way as any standards-compliant PMML engine would do.

Scikit-Learn examples

Domain decorator classes bring most value when working with heterogeneous datasets.

The simplest way to go about such workflows is to assemble a two-step pipeline, where the first step is either a sklearn_pandas.DataFrameMapper or sklearn.compose.ColumnTransformer meta-transformer for performing column-oriented feature engineering work, and the second step is an estimator:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

import pandas

df = pandas.read_csv("audit.csv")

mapper = DataFrameMapper([
  (["Income"], None),
  (["Employment"], OneHotEncoder())

classifier = DecisionTreeClassifier()

pipeline = PMMLPipeline([
  ("mapper", mapper),
  ("classifier", classifier)
pipeline.fit(df, df["Adjusted"])
pipeline.verify(df.sample(n = 10))

sklearn2pmml(pipeline, "pipeline.pmml")

Some guiding principles to follow when introducing domain decorators:

from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

mapper = DataFrameMapper([
  (["Income"], ContinuousDomain()),
  (["Employment"], [CategoricalDomain(), OneHotEncoder()])

Avoiding duplicate decorations:

from sklearn2pmml.decoration import Alias, MultiDomain
from sklearn2pmml.preprocessing import ExpressionTransformer, LookupTransformer

import numpy

employment_sector = {
  "Consultant" : "Private",
  "PSFederal" : "Public",
  "PSLocal" : "Public",
  "PSState" : "Public",
  "Private" : "Private",
  "SelfEmp" : "Private"

mapper = DataFrameMapper([
  (["Income"], ContinuousDomain()),
  (["Income", "Hours"], [MultiDomain([None, ContinuousDomain()]), Alias(ExpressionTransformer("X[0] / (X[1] * 52)", dtype = float), "Hourly_Income", prefit = True)]),
  (["Employment"], [CategoricalDomain(), OneHotEncoder()]),
  (["Employment"], [Alias(LookupTransformer(employment_sector, default_value = "Other"), "Employment_Sector", prefit = True), OneHotEncoder()])

In the above Python code, transformations have been grouped by input columns, whereas simple transformations ("Income", "Employment") have been moved in front of complex tranformations ("Hourly_Income", "Employment_Sector"). The "Hours" column does not make a standalone appearance. It is decorated using the MultiDomain meta-decorator class when the data enters the "Hourly_Income" transformers list.

Restricing the range of valid values:

mapper = DataFrameMapper([
  (["Income"], ContinuousDomain(outlier_treatment = "as_extreme_values", low_value = 2000, high_value = 400000)),
  (["Income", "Hours"], MultiDomain([None, ContinuousDomain(outlier_treatment = "as_missing_values", low_value = 0, high_value = 168, missing_value_treatment = "return_invalid", dtype = float)]), ..])

The "Income" column is restricted to [2000, 400000]. The "Hours" column is restricted to [0, 168], which represents the bounds of physical reality (number of hours in a week). Any value outside that range is replaced with a missing value in order to trigger its rejection using the returnInvalid missing value treatment.

Customizing the treatment of invalid and missing values:

mapper = DataFrameMapper([
  (["Income"], ContinuousDomain(invalid_value_treatment = "as_is")),
  (["Employment"], [CategoricalDomain(invalid_value_treatment = "as_missing_value", missing_value_replacement = "Private"), OneHotEncoder()])

Decision trees are quite robust towards input values that were not present in the training dataset. For example, continuous splits send the data record to the left or to the right by comparing the input value against the split threshold value. These decisions do not carry any weight (eg. "weak left" vs. "strong right") that would depend on the distance between them.

Invalid and missing value spaces are often merged for convenience reasons. No matter if the "Employment" column contains an invalid value or a missing value, it will be replaced with "Private" (the most frequent value in the training dataset).