The outline of a TF(-IDF) workflow:
- Text tokenization.
- Token normalization (case conversion, stemming, lemmatization).
- Token filtering (removing stop words and low-importance words).
- Token aggregation into terms (n-gram generation).
- Term score estimation.
Scikit-Learn packs TF(-IDF) workflow operations 1 through 4 into a single transformer –
CountVectorizer for TF, and
TfidfVectorizer for TF-IDF:
- Text tokenization is controlled using one of
- Token normalization is controlled using
- Token filtering is controlled using
- Token aggregation is controlled using the
Term scores are estimated using the final estimator step.
Linear models estimate a score for each and every term. The sentence score is the sum of its constituent term scores. For better interpretability, it is advisable to keep sentences short and uniform (ie. sentences should parse into structurally similar token sets), and constrain the number of features.
Decision tree models estimate a score for combinations of terms. The sentence score is the value associated with a decision path like "sentence contains term A, and does not contain terms B and C". Decision trees can be ensembled either via bagging (random forest) or boosting (XGBoost, LightGBM), which gives them scoring properties that are more similar to linear models.
The Predictive Model Markup Language (PMML) provides the
TextIndex element for representing TF(-IDF) operations.
In brief, this tranformation takes a string input value, normalizes it, and then counts the occurrences of the specified term.
Term matching can be strict or fuzzy.
Text tokenization rules must be expressed in the form of regular expressions (REs).
The default behaviour for PMML (and Apache Spark ML) is text splitting using a word separator RE:
Splitting yields "dirty" tokens. They are automatically cleansed by trimming all the leading and trailing punctuation characters.
A splitting tokenizer is available as the
sklearn2pmml.feature_extraction.text.Splitter callable type:
The default behaviour for Scikit-Learn is token matching (aka token extraction) using a word RE.
Unfortunately, this behaviour cannot be supported by the standard
wordSeparatorCharacterRE attribute, because there is no straightforward way of translating between word and word separator REs.
The JPMML ecosystem extends the
TextIndex element with the
x-wordRE attribute as proposed in http://mantis.dmg.org/view.php?id=271:
Matching is assumed to yield "clean" tokens. A data scientist shall be free to craft a word RE that extracts and retains significant punctuation or whitespace characters.
A matching tokenizer is available as the
sklearn2pmml.feature_extraction.text.Matcher callable type:
Another difference between TF(-IDF) workflows is that PMML performs text normalization (precedes tokenization) whereas Scikit-Learn performs token normalization (follows tokenization).
Text normalization is activated by adding one or more
TextIndexNormalization child elements to the
Again, the rules must be expressed in the form of regular expressions.
The JPMML-SkLearn library currently uses a single
TextIndexNormalization element for encoding the removal of stop words.
Future versions may use more to encode stemming, lemmatization and other string manipulations.
Alternatively, stemming and lemmatization can be emulated by manually specifying the
Levenshtein distance is metric that reflects the distance between two character sequences in terms of the minimum number of one-character edits (additions, replacements or removals).
For example, in the English language, the Levenshtein distance between the singular and plural forms of a regular noun is 1 (ie. the "s" suffix). Knowing this, it is trivial to make one
TextIndex element match both forms:
Token filtering by importance and token aggregation do not require any PMML integration, because they are solely training-time phenomena.