PyCaret is another AutoML tool, which specializes in tabular and time-series data analyses.
The recent PyCaret upgrade from 2(.3) to 3(.0) is exciting for two reasons. First, heavily refactored experiment API (see PyCaret-2271). The OOP-style experiment setup and command interface is much more relatable than the functional programming-style interface. Second, replacing the majority of custom transformer classes with their Scikit-Learn equivalents, which improves interoperability with existing Scikit-Learn oriented tooling.
PyCaret 3 fundamentals
Typical supervised learning workflow:
from pycaret.classification import ClassificationExperiment, RegressionExperiment
exp = ClassificationExperiment()
exp.setup(data = df, target = "y", ...)
model = exp.create_model(...)
pycaret_pipeline = exp.finalize_model(dt)
# The training dataset, excluding the target column
X = df[df.columns.drop("y")]
yt = pycaret_pipeline.predict(X)
yt_proba = pycaret_pipeline.predict_proba(X)
Main stages:
- The
exp.setup(...)
method constructs a data pre-processing pipeline that meets the user-supplied requirements, and fits it with the training dataset. - The subsequent
exp.create_model(...)
method fits a model based on the pre-processed training dataset. The model can be further refined usingexp.tune_model(...)
,exp.calibrate_model(...)
, etc. methods. - The
exp.finalize_model(...)
method distills all the relevant steps into a unitary, deployment-ready PyCaret pipeline.
The PyCaret pipeline class inherits from the Scikit-Learn pipeline class. It overrides all fit, transform and predict methods to enable caching, which is critical when performing the same computation many times.
PyCaret 3 relies on its own data management and data flow logic, which assumes pandas.DataFrame
as the data matrix type, and where individual columns are identified by name, not by position.
The implementation is built around the pycaret.internal.preprocess.transformers.TransformerWrapper
meta-transformer class.
A TransformerWrapper
object selects input column(s) from a data matrix, feeds them to the wrapped transformer object, and inserts the result column(s) back into the data matrix.
As the module name indicates, the TransformerWrapper
meta-transformer class belongs to be PyCaret internal API.
It is nice to be aware of its existence and main behavioural characteristics, but there is no reason to import it into everyday Python scripts.
When iterating over the steps of a PyCaret pipeline, then one will see a flat sequence of TransformerWrapper
steps, followed by the final estimator step.
There are no other transformer classes visible at the top level.
from pycaret.internal.preprocess.transformers import TransformerWrapper
from pycaret.internal.pipeline import Pipeline as PyCaretPipeline
from sklearn2pmml.util import fqn
def print_pycaret_pipeline(pipeline):
if not isinstance(pipeline, PyCaretPipeline):
raise TypeError()
steps = pipeline.steps
transformer_steps = steps[:-1]
final_estimator_step = steps[-1]
for transformer_step in transformer_steps:
name = transformer_step[0]
transformer = transformer_step[1]
if not isinstance(transformer, TransformerWrapper):
raise TypeErrpr()
print("{} -> {} // {} inputs".format(name, fqn(transformer.transformer), len(transformer._include)))
name = final_estimator_step[0]
final_estimator = final_estimator_step[1]
print("{} -> {}".format(name, fqn(final_estimator)))
PyCaret 3 performs data pre-processing in the following stages:
Stage | Selection | Setup options |
---|---|---|
Imputation | Subset | imputation_type (categorical_imputation , numeric_imputation ) |
Categorical-to-categorical transformations | Subset | rare_to_value (rare_value ) |
Categorical-to-numeric transformations | Subset | encoding_method , max_encoding_ohe |
Feature generation | Full set | polynomial_features (polynomial_degree ) |
Generalization, redundant feature elimination | Full set | low_variance_threshold , remove_multicollinearity (multicollinearity_threshold ) |
Numeric-to-numeric transformations | Full set | transformation (transformation_method ), normalize (normalize_method ), pca (pca_method , pca_components ) |
Feature selection | Full set | feature_selection (feature_selection_method , feature_selection_estimator , n_features_to_select ) |
PyCaret pipeline keeps track of column operational type (ie. ordinal, categorical, continuous aka numeric). This information is used in the opening stages, where different column groups are subjected to different transformations. For example, splitting columns between categorical imputation vs. numeric imputation. However, this kind of specificity disappears as soon as all data matrix columns have become numeric.
PyCaret 3 checks all the boxes of a good AutoML tool checklist:
- Can handle diverse data science tasks.
- Knows about the major bits of functionality available in the underlying ML framework, can parameterize and order them correctly.
- Can perform large-scale optimizations, statistical tests.
- Does not do anything stupid.
The key to success is the third point - if you can work really-really hard, you do not need to work that smart after all.
It is currently rather difficult to incorporate expert or domain knowledge into data pre-processing.
The manual construction and insertion of extra TransformerWrapper
steps into PyCaret pipelines seems technically possible, but there is no offical guidance or endorsement for doing so.
A seasoned data scientist may therefore prefer to keep the most creative part of the workflow for herself, and task PyCaret with more mundane parts.
Substituting a custom data pre-processing pipeline into the workflow:
from pycaret.classification import ClassificationExperiment, RegressionExperiment
exp = ClassificationExperiment()
exp.setup(
data = df, target = "y",
preprocess = False, custom_pipeline = Pipeline([...])
)
model = exp.create_model(...)
pycaret_pipeline = exp.finalize_model(model)
Classification experiment
The “audit-NA” dataset contains three numeric columns and five categorical string columns. Roughly 25% of values are missing.
The goal of the experiment is to train a binary logistic regression classifier. A good data pre-processing pipeline would therefore need to perform imputation, scaling and some sort of redundant feature elimination.
from pycaret.classification import ClassificationExperiment
import pandas
df = pandas.read_csv("audit-NA.csv")
df = df.drop(columns = ["Deductions"], axis = 1)
print(df.dtypes)
exp = ClassificationExperiment()
exp.setup(
data = df, target = "Adjusted",
imputation_type = "simple",
rare_to_value = 0.02, rare_value = "(Other)",
encoding_method = None, max_encoding_ohe = 7,
fix_imbalance = True,
normalize = "zscore",
remove_multicollinearity = True, multicollinearity_threshold = 0.75
)
model = exp.create_model(estimator = "lr")
pycaret_pipeline = exp.finalize_model(model)
print_pycaret_pipeline(pycaret_pipeline)
Summary of the pycaret_pipeline
object:
Name | Class | Columns |
---|---|---|
numerical_imputer |
sklearn.impute.SimpleImputer |
3 |
categorical_imputer |
sklearn.impute.SimpleImputer |
5 |
rare_category_grouping |
pycaret.internal.preprocess.transformers.RareCategoryGrouping |
5 |
ordinal_encoding |
category_encoders.ordinal.OrdinalEncoder |
1 |
onehot_encoding |
category_encoders.one_hot.OneHotEncoder |
2 |
rest_encoding |
category_encoders.leave_one_out.LeaveOneOutEncoder |
2 |
remove_multicollinearity |
pycaret.internal.preprocess.transformers.RemoveMulticollinearity |
19 |
balance |
pycaret.internal.preprocess.transformers.FixImbalancer |
19 |
normalize |
sklearn.preprocessing.StandardScaler |
19 |
actual_estimator |
sklearn.linear_model.LogisticRegression |
(all) |
The results reveal that the current configuration failed to achieve redundant feature elimination, because the RemoveMulticollinearity
transformer draws 19 columns as input, and returns the same 19 columns as result.
Regression experiment
The “auto” dataset contains four numeric columns and three numeric-like columns. The latter are manually cast to categorical string columns.
The goal of the experiment is to explore model space via AutoML means. Data pre-processing is performed conservatively, in order to ensure a level playing field to all major algorithm families (linear models, decision trees, etc.).
During AutoML search, there are three regressor types excluded.
The first two (catboost
and gpc
) are excluded for technical reasons (not supported by the chosen PMML conversion software). The third one (knn
) is excluded just to keep tension high, as k-nearest neighbors is known to perform extremely well with small and homogeneous datasets.
from pycaret.regression import RegressionExperiment
import pandas
df = pandas.read_csv("auto.csv")
cat_cols = ["cylinders", "model_year", "origin"]
for cat_col in cat_cols:
df[cat_col] = df[cat_col].astype(str)
print(df.dtypes)
exp = RegressionExperiment()
exp.setup(
data = df, target = "mpg",
# Model composition changes, when omitting this attribute
categorical_features = cat_cols,
imputation_type = None,
encoding_method = None, max_encoding_ohe = 3,
normalize = True, normalize_method = "robust",
remove_multicollinearity = True, multicollinearity_threshold = 0.9
)
# Generate models
top3_models = exp.compare_models(exclude = ["catboost", "gpc", "knn"], n_select = 3)
# Select the best model from generated models
automl_model = exp.automl(optimize = "MAE")
pycaret_pipeline = exp.finalize_model(automl_model)
print_pycaret_pipeline(pycaret_pipeline)
Summary of the pycaret_pipeline
object:
Name | Class | Columns |
---|---|---|
onehot_encoding |
category_encoders.one_hot.OneHotEncoder |
1 |
rest_encoding |
category_encoders.leave_one_out.LeaveOneOutEncoder |
2 |
remove_multicollinearity |
pycaret.internal.preprocess.transformers.RemoveMulticollinearity |
9 |
normalize |
sklearn.preprocessing.RobustScaler |
8 |
actual_estimator |
sklearn.linear_model.HuberRegressor |
(all) |
Surprisingly enough, this competition is won by a linear model, well ahead of various decision tree ensemble models.
PMML
The sklearn2pmml
package provides the sklearn2pmml.sklearn2pmml
utility function for converting Scikit-Learn pipelines to the Predictive Model Markup Language (PMML) representation.
However, the sklearn2pmml
utility function refuses to accept Python classes other than the sklearn2pmml.pipeline.PMMLPipeline
class.
The solution is to wrap the PyCaret pipeline object into a PMMLPipeline
object using the sklearn2pmml.pycaret.make_pmml_pipeline
utility function.
This utility function differs from the generic sklearn2pmml.make_pmml_pipeline
utility function by the fact that it knows about the TransformerWrapper
meta-transformer class, and can perform proper escaping of its contents.
The escaping is needed to ensure that the “business state” of all transformers and estimators is complete when dumped in pickle data format.
The list of known troublemakers contains mostly Scikit-Learn selector classes (caused by the dynamic implementation of the _get_support_mask()
method).
If the escaping is not done, then the conversion succeeds with simpler pipelines, but may fail with more complex ones.
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pycaret import make_pmml_pipeline as pycaret_make_pmml_pipeline
pmml_pipeline = pycaret_make_pmml_pipeline(pycaret_pipeline, target_fields = ["y"])
sklearn2pmml(pmml_pipeline, "PyCaretPipeline.pmml")
The make_pmml_pipeline
utility function takes active_fields
and target_fields
arguments, which capture feature names and label name(s), respectively.
They can be left to their default None
values if the column names of the training dataset are fine.
PMML documents are concise yet informative, and fully self-contained. In the two experiments above, if the PyCaret pipeline object is first saved in PMML data format (uncompressed text) and then in pickle data format (compressed binary), then the size of the PMML file is actually smaller in both cases!
Resources
- Datasets:
audit-NA.csv
andauto.csv
- Python scripts:
train-classification.py
andtrain-regression.py