Training Scikit-Learn pipelines is an alternative ML framework, which is built with maximum in-memory scaling properties in mind. Scikit-Learn users may find interesting when working with datasets that exceed the logical and physical limits of a desktop computer.

Sure, it is often possible to ignore the problem for extended periods of time by renting a bigger and faster computer. But the fact remains that Scikit-Learn is not a “Big Data”-oriented ML framework by design, and some new tools and new ways of doing things are likely to yield much better results.

Estimator upgrade from Scikit-Learn to is written in Java, and typically runs in a managed server (whether on-premise or “cloud”). Python users can interact with an server using the h2o package.

Every algorithm is encapsulated into an h2o.estmators.H2OEstimator subclass:

Scikit-Learn class class
sklearn.cluster.KMeans h2o.estimators.kmeans.H2OKMeansEstimator
sklearn.ensemble.IsolationForest h2o.estimators.isolation_forest.H2OIsolationForestEstimator
sklearn.ensemble.GradientBoostingClassifier, GradientBoostingRegressor h2o.estimators.gbm.H2OGradientBoostingEstimator
sklearn.ensemble.RandomForestClassifier, RandomForestRegressor h2o.estimators.random_forest.H2ORandomForestEstimator
sklearn.isotonic.IsotonicRegression h2o.estimators.isotonicregression.H2OIsotonicRegressionEstimator
sklearn.linear_model.LinearRegression, LogisticRegression h2o.estimators.glm.H2OGeneralizedLinearEstimator
sklearn.naive_bayes.GaussianNB h2o.estimators.naive_bayes.H2ONaiveBayesEstimator
sklearn.svm.SVC, SVR h2o.estimators.psvm.H2OSupportVectorMachineEstimator

According to modeling documentation, all these classes have Scikit-Learn style fit(X, y) and predict(X) methods readily available.

This suggests that it should be possible to insert models into standard Scikit-Learn pipelines:

from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from sklearn.pipeline import Pipeline

import pandas

df = pandas.read_csv("audit.csv")

X = df[[column for column in df.columns if column != "Adjusted"]]
y = df["Adjusted"]

pipeline = Pipeline([
  ("classifier", H2OGeneralizedLinearEstimator())
]), y)

The Pipeline constructor succeeds, but the subsequent, y) method call raises a rather obscure attribute error:

Traceback (most recent call last):
  File "", line 23, in <module>, y)
  File "/usr/local/lib/python3.9/site-packages/sklearn/", line 382, in fit, y, **fit_params_last_step)
  File "/usr/local/lib/python3.9/site-packages/h2o/estimators/", line 481, in fit
    training_frame = X.cbind(y) if y is not None else X
  File "/usr/local/lib/python3.9/site-packages/pandas/core/", line 5902, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'cbind'

Manual data upload to server

Closer inspection of the, y) method signature reveals that it expects both X and y arguments to be of h2o.H2OFrame type.

In spite of many API similarities, pandas.DataFrame and h2o.H2OFrame classes are not related in any way, and cannot be used interchangeably in Python code. The former represents local data, which is stored in local computer memory as “live” NumPy arrays. The latter represents a reference to remote data, which is stored in server memory.

An H2OFrame object can be constructed from, and be deconstruced back to a DataFrame object. However, these operations are rather slow, because they involve transmitting the complete dataset between local and remote machines:

from h2o import H2OFrame

import h2o
import numpy
import pandas


pandas_df = pandas.read_csv(...)

# Uploads to server
h2o_df = H2OFrame(pandas_df)

# Downloads from server
pandas_h2o_df = h2o_df.as_data_frame()

print("All values equal after round-trip: {}".format(numpy.all(pandas_df == pandas_h2o_df)))

Important: All operations that interact with an H2OFrame object (whether directly or indirectly) require an active connection to an server.

Getting the standard Scikit-Learn pipeline to fit:

pipeline = Pipeline([
  ("classifier", H2OGeneralizedLinearEstimator())

h2o_X = H2OFrame(X)
h2o_y = H2OFrame(y.to_frame(), column_types = ["categorical"]), h2o_y)

Semi-automated data upload to server

If the pipeline contains any transformer steps, then the manual data upload approach will not work, because Scikit-Learn transformer classes do not support data container types other than numpy.ndarray and pandas.DataFrame.

In practice, most transformers interact with the X argument of the TransformerMixin.fit_transform(X, y) method, but not with the y argument. It means that the X dataset cannot be uploaded until the last transformer step has completed, whereas the y dataset can be uploaded right away.

The h2o package does not provide a meta-transformer class for data uploading. This gap is filled by the sklearn2pmml package in the form of the sklearn2pmml.preprocessing.h2o.H2OFrameConstructor class.

The best place for inserting a data upload step is right before the final model step:

from sklearn2pmml.preprocessing.h2o import H2OFrameConstructor

pipeline = Pipeline([
  ("transformer", ...),
  ("uploader", H2OFrameConstructor()),
  ("classifier", H2OGeneralizedLinearEstimator())
]), H2OFrame(y.to_frame(), column_types = ["categorical"])) pipeline persistence

Scikit-Learn developers recommend using Python’s built-in pickle data format for short-term persistence needs:

import pickle

with open("pipeline.pkl", "wb") as f:
  pickle.dump(pipeline, f)

Unfitted H2OEstimator objects can be pickled and unpickled freely.

However, any attempt to pickle a fitted H2OEstimator object shall fail with the following pickling error in latest versions (at the time of writing this (November 2022), eg.,,

Traceback (most recent call last):
  File "", line 4, in <module>
    pickle.dump(pipeline, f)
_pickle.PicklingError: Can't pickle <class 'h2o.estimators.glm.H2OGeneralizedLinearEstimator'>: it's not the same object as h2o.estimators.glm.H2OGeneralizedLinearEstimator

The pickling will work if the version is downgraded to or older.