You have been doing it wrong
Consider the simplest TF(-IDF) plus XGBoost pipeline:
Is this pipeline correct or not?
The question is not about spotting a typo, or optimizing the default parameterization.
The question is "are you allowed to pass the document-term matrix of
CountVectorizer (or any of its subclasses such as
TfidfVectorizer) directly to
XGBClassifier, or not?".
This pipeline can be fitted without any errors or warnings, and appears to make sensible predictions. Therefore, the anwser must be "yes", right?
Not so fast! The executability proves technical compatibility, but it does not prove logical compatibility.
Despite adhering to the standard Scikit-Learn API, these two pipeline steps both exhibit slightly non-standard behaviour.
transform(X) method of Scikit-Learn TF(-IDF) transformers produces sparse not dense data matrices.
For example, the "sentiment" dataset is expanded into a compressed sparse row (CSR)
scipy.sparse.csr.csr_matrix data matrix of shape
(1000, 1847), which has ~0.005 density (ie. only 0.5% of cells hold non-zero values).
fit(X, y) and
predict(X) methods of XGBoost estimators accept most common SciPy, NumPy and Pandas data structures.
However, behind the scenes, they are all converted to a proprietary
xgboost.DMatrix data matrix.
It is possible to reduce complexity in the contact area by explicitly converting the document-term matrix from sparse to dense representation.
CountVectorizer transformer does not provide any controls (eg. a "sparse" constructor parameter) for that.
A good workaround is to use the
mlxtend.preprocessing.DenseTransformer pseudo-transformer from the
This pipeline (dense) should be functionally identical to the first one (sparse), but somehow it is making different predictions!
For example, the predicted probabilities for the first data record of the "sentiment" dataset are
[0.9592051, 0.04079489] and
[0.976111, 0.02388901], respectively.
Clearly, one of the two pipelines must be incorrect.
Untangling the mess
The situation cannot be definitively cleared up by making more predictions, or exploring the documentation and Python source code of relevant classes and methods.
Converting the pipeline to Predictive Model Markup Language (PMML) data format, and making predictions using a PMML engine provides an objective (ie. first-principles) second opinion.
Converting using the
In the current case, it does not matter which pipeline of the two is converted.
The resulting PMML documents will be identical (except for the conversion timestamp in the header), because the
DenseTransformer pseudo-transformation is no-op.
Making predictions using the
PMML predictions are in perfect agreement with the predictions of the second pipeline (dense).
It follows that the first pipeline (sparse) is indeed incorrect.
The source of the error is the algorithm that the XGBoost library uses for converting
The document-term matrix keeps count how many times each document (rows) contains each term (columns).
The cell value is set only if the count is greater than zero.
The DMatrix converter appears to interpret unset cell values as missing values (
NaN) rather than zero count values (
In plain english, these interpretations read like "I do not know if the document contains the specified term" and "I know that the document contains zero occurrences of the specified term", respectively.
Scikit-Learn estimators typically error out when they encounter
In contrast, XGBoost estimators treat
NaN values as special-purpose missing value indicator values, and grow missing value-aware decision trees.
When comparing XGBoost estimators between the first and the second pipeline, then they are structurally different (overall vocabulary, the time and location of invoking individual terms, etc). The former incorrectly believes that it was dealing with massive amounts of missing values during training, and all its internals are thus systematically off.
A data scientist may evaluate such a biased TF(-IDF) plus XGBoost pipeline with a validation dataset, and decide that its raw numeric performance is still good enough for productionization. It would be okay. The Pipeline API provides adequate guarantees that all biases survive and are consistently applied throughout the pipeline life-cycle.
Doing it right
As of XGBoost 1.3(.3), the
missing constructor parameter has no effect:
This user warning should be taken seriously, and the fitted pipeline abandoned, because it is incorrect again.
Converting the document-term matrix from sparse to dense representation is good for quick troubleshooting purposes. However, it is prohibitively expensive in real-life situations where the dimensions of data matrices easily reach millions of rows (documents) and/or tens of thousands of columns (terms).
The solution is to perform the conversion from
xgboost.DMatrix over a temporary sparse
The above TF(-IDF) plus XGBoost sequence is correct in a sense that unset cell values are interpreted as zero count values.
The only problem is that this sequence cannot be "formatted" as a
Pipeline object, because there is no reusable (pseudo-)transformer that would implement the intermediate
DataFrame.sparse.from_spmatrix(data) method invocation.
However, fitted pipeline steps can be combined into a temporary pipeline for PMML conversion purposes: