The h2o
package provides Python language wrappers for H2O.ai estimators.
One and the same class can be used in standalone mode (ie. the train-predict API) as well as in Scikit-Learn pipeline mode (ie. the fit-predict API).
A Scikit-Learn H2O.ai pipeline must address two extra challenges, which relate to bridging the gap between the “local” Scikit-Learn/Python environment and the “remote” H2O.ai/Java environment:
- Uploading training and testing datasets from local to remote.
- Downloading models from remote to local.
Training
A Scikit-Learn H2O.ai pipeline template:
from h2o import H2OFrame
from h2o.estimators.random_forest import H2ORandomForestEstimator
from sklearn.compose import ColumnTransformer
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml.preprocessing.h2o import H2OFrameConstructor
import h2o
h2o.init()
pipeline = PMMLPipeline([
("initializer", ColumnTransformer(
[(cat_col, CategoricalDomain(), [cat_col]) for cat_col in cat_cols] +
[(cont_col, ContinuousDomain(), [cont_col]) for cont_col in cont_cols]
)),
("uploader", H2OFrameConstructor()),
("classifier", H2ORandomForestEstimator())
])
pipeline.fit(X, H2OFrame(y.to_frame(), column_types = ["categorical"]))
h2o.shutdown()
The initializer step is a column mapper (meta-)transformer that captures the detailed description of the training dataset using SkLearn2PMML decorators.
The pipeline does not perform any data pre-processing transformations. Unlike Scikit-Learn modeling algorithms, the majority of H2O.ai modeling algorithms can accept non-numeric columns as-is.
In fact, uncalled-for helper transformations may harm the predictive performance of a pipeline.
For example, H2O.ai decision tree algorithms can generate set-style categorical splits (“
The sklearn2pmml
package provides the sklearn2pmml.preprocessing.h2o.H2OFrameConstructor
(meta-)transformer for uploading datasets from within Scikit-Learn pipelines.
Important: The data uploader step can be inserted only into one specific location in the pipeline - right between the last Scikit-Learn transformer step and the first H2O.ai transformer (eg. PCA, TF-IDF, Word2Vec) or model step.
This stems from the fact that data upload changes the type of the X
dataset from Pandas’ data frame or Numpy array to H2O.ai data frame (ie. the h2o.H2OFrame
type), thereby making it unacceptable to Scikit-Learn estimators (other than passthrough transformers).
The (PMML)Pipeline.fit(X, y)
method call runs the data pre-processing part on the local computer and the model training part on the remote H2O.ai cluster - all as a single transaction.
The model is a Java object that is tightly coupled to its parent H2O.ai/Java environment. It can be downloaded for backup/archival purposes into the local computer in Java serialization (short-term storage) or MOJO data formats (long-term storage, custom Java applications).
Downloading the model in MOJO data format:
classifier = pipeline._final_estimator
mojo_path = "/path/to/rf.mojo.zip"
classifier.download_mojo(path = mojo_path)
Persistence
Any attempt to pickle a fitted H2OEstimator
object shall fail with the following pickling error:
Traceback (most recent call last):
File "train.py", line 50, in <module>
joblib.dump(pipeline, pkl_file)
...
_pickle.PicklingError: Can't pickle <class 'h2o.estimators.random_forest.H2ORandomForestEstimator'>: it's not the same object as h2o.estimators.random_forest.H2ORandomForestEstimator
The technical explanation is that the Python class definition of an H2O.ai estimator gets modified during the H2OEstimator.fit(X, y)
method call, by making it a subclass of various model extension classes.
For example, the H2ORandomForestEstimator
class gets added to h2o.model.extensions.VariableImportance
, h2o.model.extensions.Contributions
and h2o.model.extensions.Fairness
class hierarchies:
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.model.extensions import VariableImportance
classifier = H2ORandomForestEstimator()
# False before fit
assert not isinstance(classifier, VariableImportance)
classifier.fit(X, y)
# True after fit
assert isinstance(classifier, VariableImportance)
The workaround is to switch from pure Pickle to dill-flavoured Pickle data format:
import dill
with open("H2ORandomForestAudit.pkl", "wb") as pkl_file:
dill.dump(pipeline, pkl_file)
The longevity and pervasive nature of the above pickling error suggests that this might be a deliberate restriction rather than a bug.
For reference, the H2O.ai documentation does not place a direct veto on pickling.
It advises that the only supported way of persisting fitted H2OEstimator
objects is via a pair of h2o.download_model
and h2o.upload_model
utility functions:
import h2o
classifier = pipeline._final_estimator
h2o_backup_file = h2o.download_model(classifier, path = "/path/to/h2o_backup_dir")
classifier_clone = h2o.upload_model(h2o_backup_file)
Unfortunately, this advice falls short in the current case, as the H2OEstimator
object is not a standalone entity, but is embedded into a much bigger, different language/application environment object.
Conversion
The JPMML-SkLearn library integrates seamlessly with other JPMML-family conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.
The conversion of Scikit-Learn H2O.ai pipelines is a bit more complicated than other cross-ML framework pipelines because of H2O.ai’s inherent “local” vs. “remote” dichotomy.
The JPMML-SkLearn converter assumes that the input Pickle file contains a pipeline object in its most complete state. This assumption is violated in the case of H2O.ai estimators, because their fitted state holds a reference to a model in a remote H2O.ai cluster, rather than a fully-functional model itself.
The fix is to enhance the H2O.ai estimator with MOJO information.
The JPMML-SkLearn library supports two H2O.ai estimator enhancement styles.
First, smaller MOJO files can be read into an in-memory byte array, and assigned to the _mojo_bytes
extension attribute:
classifier = pipeline._final_estimator
with open("/path/to/rf.mojo.zip", "rb") as mojo_file:
classifier._mojo_bytes = mojo_file.read()
Second, bigger MOJO files should be left where they are.
The path to a backing MOJO file can be assigned to the _mojo_path
extension attribute:
classifier = pipeline._final_estimator
classifier._mojo_path = "/path/to/rf.mojo.zip"
The good news is that starting from SkLearn2PMML version 0.95.0, the sklearn2pmml.sklearn2pmml
utility function takes full care of all the above H2O.ai estimator enhancement and flavoured pickling details.
After a package update, the workflow simplifies back to the canonical one:
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
import h2o
h2o.init()
pipeline = PMMLPipeline(...)
pipeline.fit(X, y)
# The call must happen while connected to a remote H2O.ai cluster
sklearn2pmml(pipeline, "H2ORandomForestAudit.pmml")
h2o.shutdown()
Resources
Related blog posts
-
2022-11-11Training Scikit-Learn H2O.ai pipelines