Training Scikit-Learn H2O.ai pipelines

H2O.ai is an alternative ML framework, which is built with maximum in-memory scaling properties in mind. Scikit-Learn users may find H2O.ai interesting when working with datasets that exceed the logical and physical limits of a desktop computer.

Sure, it is often possible to ignore the problem for extended periods of time by renting a bigger and faster computer. But the fact remains that Scikit-Learn is not a “Big Data”-oriented ML framework by design, and some new tools and new ways of doing things are likely to yield much better results.

Estimator upgrade from Scikit-Learn to H2O.ai

H2O.ai is written in Java, and typically runs in a managed server (whether on-premise or “cloud”). Python users can interact with an H2O.ai server using the h2o package.

Every H2O.ai algorithm is encapsulated into an h2o.estmators.H2OEstimator subclass:

Scikit-Learn class	H2O.ai class
`sklearn.cluster.KMeans`	`h2o.estimators.kmeans.H2OKMeansEstimator`
`sklearn.ensemble.IsolationForest`	`h2o.estimators.isolation_forest.H2OIsolationForestEstimator`
`sklearn.ensemble.GradientBoostingClassifier`, `GradientBoostingRegressor`	`h2o.estimators.gbm.H2OGradientBoostingEstimator`
`sklearn.ensemble.RandomForestClassifier`, `RandomForestRegressor`	`h2o.estimators.random_forest.H2ORandomForestEstimator`
`sklearn.isotonic.IsotonicRegression`	`h2o.estimators.isotonicregression.H2OIsotonicRegressionEstimator`
`sklearn.linear_model.LinearRegression`, `LogisticRegression`	`h2o.estimators.glm.H2OGeneralizedLinearEstimator`
`sklearn.naive_bayes.GaussianNB`	`h2o.estimators.naive_bayes.H2ONaiveBayesEstimator`
`sklearn.svm.SVC`, `SVR`	`h2o.estimators.psvm.H2OSupportVectorMachineEstimator`

According to H2O.ai modeling documentation, all these classes have Scikit-Learn style fit(X, y) and predict(X) methods readily available.

This suggests that it should be possible to insert H2O.ai models into standard Scikit-Learn pipelines:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from sklearn.pipeline import Pipeline

import pandas

df = pandas.read_csv("audit.csv")

X = df[[column for column in df.columns if column != "Adjusted"]]
y = df["Adjusted"]

pipeline = Pipeline([
  ("classifier", H2OGeneralizedLinearEstimator())
])
pipeline.fit(X, y)

The Pipeline constructor succeeds, but the subsequent Pipeline.fit(X, y) method call raises a rather obscure attribute error:

Traceback (most recent call last):
  File "train.py", line 23, in <module>
    pipeline.fit(X, y)
  File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 382, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.9/site-packages/h2o/estimators/estimator_base.py", line 481, in fit
    training_frame = X.cbind(y) if y is not None else X
  File "/usr/local/lib/python3.9/site-packages/pandas/core/generic.py", line 5902, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cbind'

Manual data upload to H2O.ai server

Closer inspection of the H2OEstimator.fit(X, y) method signature reveals that it expects both X and y arguments to be of h2o.H2OFrame type.

In spite of many API similarities, pandas.DataFrame and h2o.H2OFrame classes are not related in any way, and cannot be used interchangeably in Python code. The former represents local data, which is stored in local computer memory as “live” NumPy arrays. The latter represents a reference to remote data, which is stored in H2O.ai server memory.

An H2OFrame object can be constructed from, and be deconstruced back to a DataFrame object. However, these operations are rather slow, because they involve transmitting the complete dataset between local and remote machines:

from h2o import H2OFrame

import h2o
import numpy
import pandas

h2o.init()

pandas_df = pandas.read_csv(...)

# Uploads to H2O.ai server
h2o_df = H2OFrame(pandas_df)

# Downloads from H2O.ai server
pandas_h2o_df = h2o_df.as_data_frame()

print("All values equal after round-trip: {}".format(numpy.all(pandas_df == pandas_h2o_df)))

Important: All operations that interact with an H2OFrame object (whether directly or indirectly) require an active connection to an H2O.ai server.

Getting the standard Scikit-Learn pipeline to fit:

pipeline = Pipeline([
  ("classifier", H2OGeneralizedLinearEstimator())
])

h2o_X = H2OFrame(X)
h2o_y = H2OFrame(y.to_frame(), column_types = ["categorical"])

pipeline.fit(h2o_X, h2o_y)

Semi-automated data upload to H2O.ai server

If the pipeline contains any transformer steps, then the manual data upload approach will not work, because Scikit-Learn transformer classes do not support data container types other than numpy.ndarray and pandas.DataFrame.

In practice, most transformers interact with the X argument of the TransformerMixin.fit_transform(X, y) method, but not with the y argument. It means that the X dataset cannot be uploaded until the last transformer step has completed, whereas the y dataset can be uploaded right away.

The h2o package does not provide a meta-transformer class for data uploading. This gap is filled by the sklearn2pmml package in the form of the sklearn2pmml.preprocessing.h2o.H2OFrameConstructor class.

The best place for inserting a data upload step is right before the final H2O.ai model step:

from sklearn2pmml.preprocessing.h2o import H2OFrameConstructor

pipeline = Pipeline([
  ("transformer", ...),
  ("uploader", H2OFrameConstructor()),
  ("classifier", H2OGeneralizedLinearEstimator())
])
pipeline.fit(X, H2OFrame(y.to_frame(), column_types = ["categorical"]))

H2O.ai pipeline persistence

Scikit-Learn developers recommend using Python’s built-in pickle data format for short-term persistence needs:

import pickle

with open("pipeline.pkl", "wb") as f:
  pickle.dump(pipeline, f)

Unfitted H2OEstimator objects can be pickled and unpickled freely.

However, any attempt to pickle a fitted H2OEstimator object shall fail with the following pickling error in latest H2O.ai versions (at the time of writing this (November 2022), eg. 3.34.0.8, 3.36.1.5, 3.38.0.2):

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    pickle.dump(pipeline, f)
_pickle.PicklingError: Can't pickle <class 'h2o.estimators.glm.H2OGeneralizedLinearEstimator'>: it's not the same object as h2o.estimators.glm.H2OGeneralizedLinearEstimator

The pickling will work if the H2O.ai version is downgraded to 3.32.1.7 or older.

Resources

Python script: train.py