The behaviour of Scikit-Learn estimators is controlled using hyperparameters. Feature transformers and selectors perform deterministic computations that take a very limited number of very transparent hyperparameters. In contrast, models perform non-deterministic computations (numerical optimization) that take a much larger number of rather obscure hyperparameters. Some of them control the complexity of the learned model object, whereas some others control the quality and speed of the learning process itself.
Scikit-Learn estimators assign reasonable default values to hyperparameters in their constructors. This facilitates prototyping work, where the goal is to establish the structure of a pipeline by quickly adding or modifying steps. However, the default configuration is hardly ever the optimal one.
There is no analytic procedure for determining the best configuration from scratch, or even comparing two configurations goodness-wise. In practice, the most common way of finding a good configuration is to generate many configurations, and rank them on the basis of their predictive performance on a testing dataset.
The Model Selection module provides meta-estimators and utility functions for developing robust solutions in this area.
In brief, a data scientist defines the template pipeline and the associated hyperparameter space.
The latter is a mapping between parameter names and parameter value ranges (a list of preselected values, or a distribution function).
If the dimensionality of the hyperparameter space is low, and the gradation of all individual dimensions is directly enumerable, then it is possible to perform exhaustive search using the
In all other cases, it is possible to perform random sampling using the
Single estimator (aka local) tuning
If the pipeline contains just one tuneable estimator, then the tuning work should be performed locally, by wrapping this estimator in its current place into a search meta-estimator:
RandomizedSearchCV meta-estimators split the original dataset into training and validation subsets.
As a result, the fit method of the tuneable estimator is exposed to less data records than the fit methods of all the other estimators in the pipeline.
For example, in the above Python code, the
LogisticRegression.fit(X, y) method is called with roughly 80% of data records (the training subset of the original dataset), whereas
StandardScaler.fit(X) methods are called with 100% of data records (full original dataset).
Data scientists may want to compensate for this effect, especially when working with smaller and more heterogeneous datasets.
Pipeline (aka global) tuning
If the pipeline contains multiple tuneable estimators, then the tuning work should be performed globally, by wrapping the complete pipeline into a search meta-estimator:
GridSearchCV meta-estimator can be regarded as a workflow execution engine.
It takes a template pipeline, performs the search, and returns a hyperparameter-tuned clone of this template pipeline as a
All hyperparameter spaces are collected into a single map. They are kept separate from one another logically by prefixing parameter names with the step identifier.
The search meta-estimator is still splitting the original dataset into two subsets.
However, the split happens before the workflow execution enters the
(PMML)Pipeline.fit(X, y) method, so all estimators in the pipeline are exposed to the same number of data records.
If the span of a validation subset exceeds that of a training subset, then the corresponding cross-validation fold fails with a value error:
It is possible to suppress this sanity check by changing the value of the
Domain.invalid_value_treatment attribute from