Converting Scikit-Learn pipelines to PMML

The h2o package provides Python language wrappers for estimators. One and the same class can be used in standalone mode (ie. the train-predict API) as well as in Scikit-Learn pipeline mode (ie. the fit-predict API).

A Scikit-Learn pipeline must address two extra challenges, which relate to bridging the gap between the “local” Scikit-Learn/Python environment and the “remote” environment:

  1. Uploading training and testing datasets from local to remote.
  2. Downloading models from remote to local.


A Scikit-Learn pipeline template:

from h2o import H2OFrame
from h2o.estimators.random_forest import H2ORandomForestEstimator
from sklearn.compose import ColumnTransformer
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.preprocessing.h2o import H2OFrameConstructor

import h2o


pipeline = PMMLPipeline([
  ("initializer", ColumnTransformer(
    [(cat_col, CategoricalDomain(), [cat_col]) for cat_col in cat_cols] +
    [(cont_col, ContinuousDomain(), [cont_col]) for cont_col in cont_cols]
  ("uploader", H2OFrameConstructor()),
  ("classifier", H2ORandomForestEstimator())
]), H2OFrame(y.to_frame(), column_types = ["categorical"]))


The initializer step is a column mapper (meta-)transformer that captures the detailed description of the training dataset using SkLearn2PMML decorators.

The pipeline does not perform any data pre-processing transformations. Unlike Scikit-Learn modeling algorithms, the majority of modeling algorithms can accept non-numeric columns as-is.

In fact, uncalled-for helper transformations may harm the predictive performance of a pipeline. For example, decision tree algorithms can generate set-style categorical splits (“ in ") for string columns. However, they fall back to binary indicator-style categorical splits (" is ") when the string column has been one-hot encoded into multiple integer columns.

The sklearn2pmml package provides the sklearn2pmml.preprocessing.h2o.H2OFrameConstructor (meta-)transformer for uploading datasets from within Scikit-Learn pipelines.

Important: The data uploader step can be inserted only into one specific location in the pipeline - right between the last Scikit-Learn transformer step and the first transformer (eg. PCA, TF-IDF, Word2Vec) or model step. This stems from the fact that data upload changes the type of the X dataset from Pandas’ data frame or Numpy array to data frame (ie. the h2o.H2OFrame type), thereby making it unacceptable to Scikit-Learn estimators (other than passthrough transformers).

The (PMML), y) method call runs the data pre-processing part on the local computer and the model training part on the remote cluster - all as a single transaction.

The model is a Java object that is tightly coupled to its parent environment. It can be downloaded for backup/archival purposes into the local computer in Java serialization (short-term storage) or MOJO data formats (long-term storage, custom Java applications).

Downloading the model in MOJO data format:

classifier = pipeline._final_estimator

mojo_path = "/path/to/"

classifier.download_mojo(path = mojo_path)


Any attempt to pickle a fitted H2OEstimator object shall fail with the following pickling error:

Traceback (most recent call last):
  File "", line 50, in <module>
    joblib.dump(pipeline, pkl_file)
_pickle.PicklingError: Can't pickle <class 'h2o.estimators.random_forest.H2ORandomForestEstimator'>: it's not the same object as h2o.estimators.random_forest.H2ORandomForestEstimator

The technical explanation is that the Python class definition of an estimator gets modified during the, y) method call, by making it a subclass of various model extension classes. For example, the H2ORandomForestEstimator class gets added to h2o.model.extensions.VariableImportance, h2o.model.extensions.Contributions and h2o.model.extensions.Fairness class hierarchies:

from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.model.extensions import VariableImportance

classifier = H2ORandomForestEstimator()

# False before fit
assert not isinstance(classifier, VariableImportance), y)

# True after fit
assert isinstance(classifier, VariableImportance)

The workaround is to switch from pure Pickle to dill-flavoured Pickle data format:

import dill

with open("H2ORandomForestAudit.pkl", "wb") as pkl_file:
  dill.dump(pipeline, pkl_file)

The longevity and pervasive nature of the above pickling error suggests that this might be a deliberate restriction rather than a bug.

For reference, the documentation does not place a direct veto on pickling. It advises that the only supported way of persisting fitted H2OEstimator objects is via a pair of h2o.download_model and h2o.upload_model utility functions:

import h2o

classifier = pipeline._final_estimator

h2o_backup_file = h2o.download_model(classifier, path = "/path/to/h2o_backup_dir")

classifier_clone = h2o.upload_model(h2o_backup_file)

Unfortunately, this advice falls short in the current case, as the H2OEstimator object is not a standalone entity, but is embedded into a much bigger, different language/application environment object.


The JPMML-SkLearn library integrates seamlessly with other JPMML-family conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.

The conversion of Scikit-Learn pipelines is a bit more complicated than other cross-ML framework pipelines because of’s inherent “local” vs. “remote” dichotomy.

The JPMML-SkLearn converter assumes that the input Pickle file contains a pipeline object in its most complete state. This assumption is violated in the case of estimators, because their fitted state holds a reference to a model in a remote cluster, rather than a fully-functional model itself.

The fix is to enhance the estimator with MOJO information.

The JPMML-SkLearn library supports two estimator enhancement styles. First, smaller MOJO files can be read into an in-memory byte array, and assigned to the _mojo_bytes extension attribute:

classifier = pipeline._final_estimator

with open("/path/to/", "rb") as mojo_file:
  classifier._mojo_bytes =

Second, bigger MOJO files should be left where they are. The path to a backing MOJO file can be assigned to the _mojo_path extension attribute:

classifier = pipeline._final_estimator

classifier._mojo_path = "/path/to/"

The good news is that starting from SkLearn2PMML version 0.95.0, the sklearn2pmml.sklearn2pmml utility function takes full care of all the above estimator enhancement and flavoured pickling details.

After a package update, the workflow simplifies back to the canonical one:

from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

import h2o


pipeline = PMMLPipeline(...), y)

# The call must happen while connected to a remote cluster
sklearn2pmml(pipeline, "H2ORandomForestAudit.pmml")