The sklearn2pmml
package provides Scikit-Learn style wrappers for StatsModels classification, ordinal classification and regression models:
sklearn2pmml.statsmodels.StatsModelsClassifier
sklearn2pmml.statsmodels.StatsModelsOrdinalClassifier
sklearn2pmml.statsmodels.StatsModelsRegressor
Wrapper classes are maximally generic by design.
For example, the StatsModelsRegressor
wrapper can accommodate any StatsModels regression model, any StatsModels version.
This genericity is achieved using Python Arbitrary Keyword Arguments (aka **kwargs
) mechanisms.
Wrapper class methods accept any keyword arguments. They get packed into a singular dict
-type helper param, which is then dispatched to the right StatsModels model method at the right time.
Such “param aggregation” approach works fine with most common Scikit-Learn workflows. However, it falls short with advanced workflows, where individual Scikit-Learn estimators need to be queried and/or updated on a param-by-param basis.
The most prominent example is hyperparameter tuning.
Training
A hyperparameter tuning workflow can be summarized as follows:
- The end user provides a template estimator object and declares all its tunable params.
- The tuner makes a clone of the template estimator object and updates the initial values of one or more tunable params with new values.
- The tuner scores the updated estimator object. The updated estimator object is kept if it out-scores the previous best estimator object, and is discarded otherwise.
- The tuner repeats stages #2 and #3 until the stop criterion is met.
In the second workflow stage, candidate Scikit-Learn estimators are fabricated using the sklearn.base.clone
utility function.
The fabrication algorithm constructs a new unfitted estimator object irrespective if the “template” was a fitted or unfitted estimator object (ie. “selective clone” rather than “full clone” semantics).
The tunable param set is determined via the get_params
method.
The default implementation returns the constructor params of the estimator class.
Wrapper classes declare two named params model_class
and fit_intercept
.
They are typically set by a human expert depending on the nature of the modeling task.
However, there are no technical restrictions to varying them using grid search means in order to conduct a quick AutoML-like StatsModels model selection experiment:
from sklearn.model_selection import GridSearchCV
from sklearn2pmml.statsmodels import StatsModelsRegressor
from statsmodels.api import GLM, OLS, WLS
pipeline = make_statsmodels_pipeline(StatsModelsRegressor(OLS))
ctor_params_grid = {
"regressor__model_class" : [GLM, OLS, WLS],
"regressor__fit_intercept" : [True, False]
}
tuner = GridSearchCV(pipeline, param_grid = ctor_params_grid, verbose = 3)
tuner.fit(X, y)
print(tuner.best_estimator_)
Any attempt to call the GridSearchCV.fit(X, y)
method with an expanded param grid shall fail with a value error.
Task-specific subclassing
The canonical approach to enabling tunable params is by declaring them one by one as constructor params, and assigning them to Python class attributes (with exactly the same name) in the constructor body.
For example, defining a wrapper subclass for tuning the alpha
and L1_wt
params of the OLS.fit_regularized
method:
class TunableStatsModelsRegressor(StatsModelsRegressor):
def __init__(self, model_class, fit_intercept = True, alpha = 0.01, L1_wt = 1, **init_params):
super(TunableStatsModelsRegressor, self).__init__(model_class = model_class, fit_intercept = fit_intercept, **init_params)
self.alpha = alpha
self.L1_wt = L1_wt
def fit(self, X, y, **fit_params):
super(TunableStatsModelsRegressor, self).fit(X, y, alpha = self.alpha, L1_wt = self.L1_wt, **fit_params)
return self
Task-agnostic subclassing
Code reusability can be improved by aggregating all tunable params into a singular dict
-type helper param.
However, this approach qualifies as a hack, because it takes advantage of the notion that the tuner is willing to perform update operations also on virtual params. Such loophole exists in Scikit-Learn versions 1.0.X through 1.2.X. It may be closed in newer versions.
class TunableStatsModelsRegressor(StatsModelsRegressor):
def __init__(self, model_class, fit_intercept = True, tune_params = {}, **init_params):
super(TunableStatsModelsRegressor, self).__init__(model_class = model_class, fit_intercept = fit_intercept, **init_params)
self.tune_params = tune_params
def set_params(self, **params):
super_params = dict([(k, params.pop(k)) for k, v in dict(**params).items() if k in ["model_class", "fit_intercept", "tune_params"]])
super(TunableStatsModelsRegressor, self).set_params(**super_params)
setattr(self, "tune_params", dict(**params))
def fit(self, X, y, **fit_params):
super(TunableStatsModelsRegressor, self).fit(X, y, **self.tune_params, **fit_params)
return self
The above class self-reports model_class
, fit_intercept
and tune_params
params, but is open to perform an update operation on any param.
This “discrepancy” is powered by a custom set_params
method.
Update operations that target self-reported params are dispatched to the superclass’ set_params
method.
All other update operations are treated as Python item assignment operations against the tune_params
param.
The combined StatsModels model selection and model hyperparameter tuning experiment succeeds with either flavour of TunableStatsModelsRegressor
:
pipeline = make_statsmodels_pipeline(TunableStatsModelsRegressor(OLS))
ctor_params_grid = {
"regressor__fit_intercept" : [True, False]
}
regfit_params_grid = {
"regressor__alpha" : loguniform(1e-2, 1).rvs(5),
"regressor__L1_wt" : uniform(0, 1).rvs(5)
}
tuner = GridSearchCV(pipeline, param_grid = {**ctor_params_grid, **regfit_params_grid}, verbose = 3)
tuner.fit(X, y, regressor__fit_method = "fit_regularized")
print(tuner.best_estimator_)
Deployment
Hyperparameter tuning is 100% training-time phenomenon, which should not cause any complications at later model lifecycle phases. Any kind of subclassing violates this principle, because it creates a need to package and distribute model class definition(s) alongside the model object.
When the model deployment happens in the Python environment, then the solution is to transform the TunableStatsModelsRegressor
object into a new object for which there is a stable and reusable class definition available.
The TunableStatsModelsRegressor
class does not hold any fitted state beyond that of its StatsModelsRegressor
parent class.
Therefore, as another hack, the model object can be made easily recognizable to its future users by simply re-assigning its __class__
attribute:
import joblib
best_pipeline = tuner.best_estimator_
best_regressor = best_pipeline._final_estimator
best_regressor.__class__ = StatsModelsRegressor
joblib.dump(best_pipeline, "GridSearchAuto.pkl")
The proper way of model class conversion is to re-fit a new pipeline object using the best param set. If the latter is a mix of constructor and fit method params, then it must be partitioned into two disjoint subsets:
import joblib
best_params = dict(tuner.best_params_)
best_pipeline = make_statsmodels_pipeline(StatsModelsRegressor(OLS, fit_intercept = best_params.pop("regressor__fit_intercept")))
best_pipeline.fit(X, y, **best_params, regressor__fit_method = "fit_regularized")
joblib.dump(best_pipeline, "GridSearchAuto.pkl")
Model deployment in non-Python environments might seem impossible due to extensive StatsModels, Scikit-Learn and Python API dependencies.
No worries, because the Java PMML API software project provides a full stack of Java tools and libraries for untangling and converting arbitrary complexity Python ML artifacts to the Predictive Model Markup Language (PMML) representation.
Dealing with the best_pipeline
object in a fully automated fashion does not pose any substantial challenge.
There is one minor configuration issue related to the fact that the JPMML-SkLearn library recognizes and supports wrapper classes, but not their ad hoc subclasses.
It is possible to avoid tedious model class conversion operation by declaring model class equivalence (ie. “treat TunableStatsModelsRegressor
objects the same as StatsModelsRegressor
objects”) using a custom class mapping:
from sklearn2pmml import load_class_mapping, make_class_mapping_jar, sklearn2pmml
from sklearn2pmml.util import fqn
default_mapping = load_class_mapping()
# Map the ad hoc subclass to the same JPMML-SkLearn converter class as the parent class
statsmodels_mapping = {
fqn(TunableStatsModelsRegressor) : default_mapping[fqn(StatsModelsRegressor)]
}
extension_jar = "TunableStatsModelsRegressor.jar"
make_class_mapping_jar(statsmodels_mapping, extension_jar)
sklearn2pmml(tuner.best_estimator_, "GridSearchAuto.pmml", user_classpath = [extension_jar])
Resources
Related blog posts
-
2023-03-28Training Scikit-Learn StatsModels pipelines
-
2023-05-03Converting customized Scikit-Learn estimators to PMML
-
2019-12-25Converting Scikit-Learn GridSearchCV pipelines to PMML