StatsModels is a Python library that specializes in statistical and time-series analyses.

StatsModels and Scikit-Learn communities have co-existed since early 2010s. Yet, there is virtually no co-operation between the two, even though the potential for synergy is high.

In essence, StatsModels provides low-level tools for formulating and testing statistical hypotheses, whereas Scikit-Learn provides a high-level framework for organizing those tools into coherent and transparent workflows.

StatsModels’ linear models and Scikit-Learn’s linear models are compatible in their fit and predict behaviour, but address different application scenarios:

StatsModels | Scikit-Learn | |
---|---|---|

Audience | expert | non-expert |

Methodology | analytical, statistics | brute-force ML |

Objective | configurability | scalability and speed |

Inspectability | “white box” | “black box” |

Feedback | yes | no |

The biggest differentiator is model introspection capabilities.

StatsModels’ linear models are metadata rich and lend themselves to scrutinization. In contrast, Scikit-Learn’s linear models can only be judged after their live predictive performance. For example, all model selection and evaluation takes place under the assumption that the quality of model configurations is directly proportional to the quality of its predictions. There is no consideration of other aspects such as conformity or complexity.

## StatsModels workflow

A sample Scikit-Learn linear regression workflow:

```
from sklearn.linear_model import LinearRegression
lr_params = {...}
estimator = LinearRegression(**lr_params)
estimator.fit(X = X, y = y)
# Basic details
print(estimator.coef_)
print(estimator.intercept_)
yt = estimator.predict(X = X)
```

The same, after porting from Scikit-Learn to StatsModels:

```
from statsmodels.api import OLS
lr_params = {...}
model = OLS(endog = y, exog = X)
results = model.fit(**lr_params)
# Basic details
print(results.params)
# Advanced details
print(results.summary())
results.remove_data()
yt = results.predict(exog = X)
```

A modeling task can be split into **task specification** and **task outcome** parts.

Scikit-Learn keeps these two parts together in a singular estimator object.
Calling the `fit(X, y)`

method with the training dataset updates the object from the initial state to the fitted state, and enables the use of `predict(X)`

and `score(X, y_expected)`

methods.
The fitted state is very lean, and does not include any trace of the training dataset.

In contrast, StatsModels keeps these two parts separate between `statsmodels.base.model.Model`

and `statsmodels.base.model.Results`

objects, respectively.

The `Model`

object is instantiated directly. It holds the base configuration and the training dataset.
According to StatsModels conventions, the label data (aka dependent variable aka response) is passed as the `endog`

argument, and the feature data (aka independent variables aka observations) is passed as the `exog`

argument.

A model class can declare one or more fit methods:

`fit(method, ...)`

- Full fit.`fit_regularized(method, ...)`

- Regularized fit.`fit_constrained(constraints, ...)`

- Constrained fit.

Each fit method is backed by one or more modeling algorithms. When calling a fit method, then it will be necessary to pass a rather long and detailed supplementary configuration there (on top of the base configuration) in order to achieve the best performance.

The fit produces a `Results`

object, which holds a (back-)reference to the parent `Model`

object, and a bunch of estimated parameters.

The `Model`

object loses its relevance from this point forward.
All the analysis, testing and prediction functionality is commandable using the `Results`

object’s methods.

## Scikit-Learn StatsModels wrappers

The working principle of a StatsModels wrapper class:

```
from sklearn.base import BaseEstimator
class StatsModelsEstimator(BaseEstimator):
def __init__(self, model_class, **init_params):
self.model_class = model_class
self.init_params = init_params
def fit(self, X, y, **fit_params):
self.model_ = self.model_class(endog = y, exog = X, **self.init_params)
fit_method = fit_params.pop("fit_method", "fit")
self.results_ = getattr(self.model_, fit_method)(**fit_params)
return self
def predict(self, X, **predict_params):
return self.results_.predict(exog = X, **predict_params)
```

The `StatsModelsEstimator`

constructor requests to know the model’s class and its base configuration.
This information is stored as-is.

The `model_class`

argument must be some `Model`

class reference.
It does not matter if this reference is based on the fully qualified name (eg. `statsmodels.regression.linear_model.OLS`

) or some alias (eg. `statsmodels.api.OLS`

).

The `init_params`

argument is a list of keyword arguments to be passed to the `Model`

constructor at a later time.
This list must not include the `endog`

or `exog`

arguments, or any formula interface-related arguments.

The `fit(X, y)`

method performs all the heavy-lifting.
First, it constructs the `Model`

object by complementing the base configuration with the newly available training dataset.
After that, it determines the fit method by popping a special-purpose `fit_method`

fit param, and then calls it using the remainder of fit params.

The fitted state consists of `model_`

and `results_`

attributes.

Sample workflow:

```
from statsmodels.api import OLS
estimator = StatsModelsEstimator(OLS)
estimator.fit(X, y, fit_method = "fit_regularized")
print(estimator.results_.summary())
```

It takes around ten lines of Python code to achieve basic interoperability. However, it will take many times that to reduce the entry barrier from the expert level to the beginner-to-intermediate level, and achieve advanced interoperability so that third-party libraries can accept StatsModels’ linear models as drop-in replacements for Scikit-Learn’s linear models.

The `sklearn2pmml`

package provides `sklearn2pmml.statsmodels.StatsModelsClassifier`

and `sklearn2pmml.statsmodels.StatsModelsRegressor`

models that strive towards maximal Scikit-Learn API compliance.

Selected highlights:

- The
`StatsModelsEstimator.fit_intercept`

attribute. `StatsModelsEstimator.coef_`

and`StatsModelsEstimator.intercept_`

attributes.- The
`StatsModelsEstimator.remove_data()`

method. Reduces the size of the model by evicting the training dataset. - The
`StatsModelsClassifier.classes_`

attribute. `StatsModelsClassifier.predict(X)`

and`StatsModelsClassifier.predict_proba(X)`

methods.

Both `StatsModelsClassifier`

and `StatsModelsRegressor`

classes are registered with the JPMML-SkLearn library.
However, the conversion to the PMML representation is only possible if the fitted state operates with `Model`

and `Results`

subclasses that are supported by the underlying JPMML-StatsModels library.

## Scikit-Learn pipeline templates

### Model formula interface

StatsModels is heavily influenced by the R language. One clear telltale is the habit of specifying modeling tasks in compact symbolic form.

StatsModels relies on the `patsy`

package for its R-compatible re-implementation of model formulae and model design matrices.

The parsing and interpretation of model formulae is tailored for linear models’ needs.
For example, the model formula string is assumed to include an implicit intercept term, and the `C()`

categorical encoding operator drops the first category level.

A sample Patsy linear regression workflow:

```
from sklearn2pmml.statsmodels import StatsModelsRegressor
from statsmodels.api import OLS
import pandas
import patsy
df = pandas.read_csv(...)
# Model formula sides
lhs = ...
rhs = ...
y, X = patsy.dmatrices(lhs + " ~ " + rhs, df, return_type = "dataframe")
regressor = StatsModelsRegressor(OLS, fit_intercept = False)
regressor.fit(X, y)
print(regressor.results_.summary())
```

The same, in Scikit-Learn pipeline form:

```
from sklearn.pipeline import Pipeline
from sklearn2pmml.statsmodels import StatsModelsRegressor
from sklego.preprocessing import PatsyTransformer
from statsmodels.api import OLS
import pandas
df = pandas.read_csv(...)
rhs = ...
pipeline = Pipeline([
("patsy_transformer", PatsyTransformer(rhs, return_type = "dataframe")),
("regressor", StatsModelsRegressor(OLS, fit_intercept = False))
])
pipeline.fit(df, df[...])
regressor = pipeline._final_estimator
print(regressor.results_.summary())
```

The `sklego`

package provides the `sklego.preprocessing.PatsyTransformer`

transformer, which learns the model design matrix configuration (`patsy.DesignInfo`

class) during fitting, and then applies it during subsequent transforms.
The transformation behaviour is perfectly repeatable across time and space, including in situations where the testing dataset is a subset of the training dataset (eg. a single data sample).

The `PatsyTransformer`

transformer only deals with the right-hand side of the formula.
Its output is explicitly converted from `patsy.DesignMatrix`

to `pandas.DataFrame`

in order to ensure smooth passage between Patsy and Scikit-Learn realms.

The first column of the pre-processed dataset is labelled “Intercept” and is filled with constant `1.0`

values.
The `StatsModelsRegressor`

model is expressly prohibited from introducing an extra constant column by settings its `fit_intercept`

attribute to `False`

.

### Data matrix interface

The data matrix interface is about retrofitting existing Scikit-Learn pipelines.

Model substitution is straightforward. The only pitfall relates to regularized fit methods, where StatsModels’ linear models expect direct commanding via fit params:

```
#from sklearn.linear_model import ElasticNet
from sklearn2pmml.statsmodels import StatsModelsRegressor
from statsmodels.api import OLS
pipeline = Pipeline([
("mapper", ...),
#("regressor", ElasticNet()),
("regressor", StatsModelsRegressor(OLS, fit_intercept = True))
])
fit_params = {
"regressor__fit_method" : "fit_regularized",
"regressor__method" : "elastic_net"
}
pipeline.fit(X, y, **fit_params)
```

StatsModels summaries are readily available.
However, the default print-out is quite meaningless, because it displays a `const`

term followed by a long list of `x1`

, `x2`

, etc. terms.

The situation can be fixed by proper feature identification.

Scikit-Learn transformers offer rudimentary feature naming support using the `get_feature_names_out()`

method.
Calling this method on the last transformer step of a fitted pipeline yields a list of strings that should align perfectly against the above list of anonymized terms.

This process can be fully automated in Scikit-Learn version 1.2.0 and newer using the set_output API:

```
from sklearn2pmml.statsmodels import StatsModelsRegressor
from statsmodels.api import OLS
pipeline = Pipeline([
("mapper", ...),
("regressor", StatsModelsRegressor(OLS, fit_intercept = True))
])
pipeline.set_output(transform = "pandas")
fit_params = {
"regressor__fit_method" : "fit_regularized",
"regressor__method" : "elastic_net"
}
pipeline.fit(X, y, **fit_params)
regressor = pipeline._final_estimator
print(regressor.results_.summary())
```

The `Pipeline.set_output()`

method propagates the same configuration to all steps.
It should be called right after construction.

Scikit-Learn generates less intuitive feature names than Patsy.

The results of some transformer step can be highlighted by prepending a unique feature name prefix to them.
This can be implemented by wrapping the transformer into a `sklearn.compose.ColumnTransformer`

object:

```
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PolynomialFeatures
def make_prefixer(transformer, name):
return ColumnTransformer([
(name, transformer, lambda X: X.columns.values)
])
interactor = make_prefixer(PolynomialFeatures(degree = 2, interaction_only = True), "ix1")
```

The interaction features of the `interactor`

object will now be easily identifiable in StatsModels summaries by their `"ix1__"`

prefix.