Scikit-Learn algorithms operate on numerical data.
If the dataset contains complex features, then they need to be explicitly encoded and/or transformed from their native high-level representation to a suitable low-level representation.
For example, a string column must be expanded into a list of binary indicator columns using the
Scikit-Learn can be extended with custom features by building extension layers on top of the numeric base layer.
Custom features allow data scientists to represent and manipulate data using more realistic concepts, thereby improving their productivity (reducing cognitive load, eliminating whole categories of systematic errors). For example, compare working with temporal data in the form of Unix timestamps (number of seconds since the Unix Epoch) versus ISO 8601 strings.
This blog post demonstrates how the
sklearn2pmml package extends Scikit-Learn with PMML-compatible date and datetime features.
A datetime is a data structure that represents an instant (point in time) according to some calendar and time zone.
The calendar component takes care of mapping larger periods of time such as years, months and days. Most computer systems use the Gregorian calendar, which provides a rather simple algorithm for spacing 365.2422-day solar years uniformly.
The time zone component takes care of mapping periods within a day. While based on solar time, they include socially and economically motivated adjustments. From the software development perspective, time zones should be regarded as ever-growing lookup tables that need to be updated regularly (typically handled behind the scenes by the operating system). The lookup function returns a time zone offset relative to the Coordiated Universal Time (UTC) for the specified point in time.
Datetimes should be formatted as strings following the ISO 8601 standard:
<date part>– Local date
<date part>T<time part>– Local datetime
<date part>T<time part>±<time zone offset part>– Local datetime with explicit time zone offset
<date part>T<time part>Z– UTC datetime
PMML defines four temporal data types for representing instants (points in time):
date– Local date
datetime– Local datetime
time– Local time (24-hour clock)
timeSeconds– Local time (unrestricted clock)
They are all "local" in a sense that they do not maintain explicit time zone offset (close analogy with
java.time.Time Java classes).
If a predictive analytics application is dealing with temporal values associated with different time zones, then it should unify them to a common time zone (UTC or local time zone) before passing them on to the PMML engine.
Instants should be regarded as discrete. The natural operational type is ordinal (ordered categorical), because it is possible to compare instants for equality plus determine the ordering between them (eg. answering a question "is instant A earlier/later than instant B?").
PMML further defines two sets of temporal data types for representing durations (distances between two points in time):
dateDaysSince[<year>]– Distance from the epoch in days
dateTimeSecondsSince[<year>]– Distance from the epoch in seconds
The epoch can take values
The JPMML ecosystem extends this range with values
2020 as proposed in http://mantis.dmg.org/view.php?id=234.
Durations should be regarded as continuous integers.
PMML defines three built-in functions for converting instants to durations:
dateSecondsSinceYear built-in functions support arbitrary epochs.
The suggestion is to use an epoch that would minimize the range of computed durations, and make them easier to analyze and explain for humans.
A good choice is the minimum year of the training dataset (restricts values to
[0, (maximum - minimum)]).
In principle, subtracting one instant from another should yield a duration, and adding a duration to an instant should yield another instant. The PMML specification does not clarify the behaviour of temporal values in the context of arithmetic operations so, while technically feasible, it should be avoided for the time being.
The sample dataset is a list of crewed lunar missions under the Apollo program. In years 1968 thorugh 1972 there were nine flights. The first two were lunar orbiting missions, and the remaining seven were lunar landing missions.
In Python language speak, a datetime with time zone information is regarded as "(time zone-) offset-aware", whereas a datetime without time zone information is called a "(time zone-) offset-naive". Python datetime functions typically raise an error when offset-aware and offset-naive datetimes are interacted:
The aim of data pre-procesing is to convert offset-aware UTC datetimes to offset-naive PMML-compatible local datetimes:
For example, the first cell of the dataset is converted from
The hour of day has been incremented by three hours (representing the time zone offset between UTC and Estonia/Tallinn time zone on 21st of December, 1968), and the
Z suffix has been truncated.
Feature specification and engineering
sklearn2pmml package provides domain decorators and transformers for working with pre-processed temporal values.
Domain decorators are meant for declaring the type and behaviour of individual features. They were discussed in full detail in an earlier blog post about extending Scikit-Learn with feature specifications.
sklearn2pmml.decoration module provides three domain decorators:
DateDomain– Default date
DateTimeDomain– Default datetime
OrdinalDomain– Custom date or datetime
Domain decorators take care of parsing or casting input values to appropriate temporal values.
TemporalDomain decorators can be applied to multiple columns at once.
Their core configuration is hard-coded to prevent the collection and storage of valid value space information (ie.
Domain(with_data = False, with_statistics = False)).
The main assumption is that the temporal feature(s) is likely to take previously unseen values.
In contrast, the
OrdinalDomain decorator can only be applied to one column at once, but is fully configurable.
It may come in handy when a temporal feature must be restricted in a certain way.
A datetime is a complex data structure which needs to be "flattened" to a scalar before it can be fed to Scikit-Learn algorithms.
sklearn2pmml.preprocessing module provides three transformers, which correspond to previously discussed PMML built-in functions:
Further transformations are possible using the good old
For example, calculating the duration of a mission in seconds:
The lack of more fine-grained calendaring functions can be overcome by performing the required arithmetic operations manually. If some functionality is needed often, then it should be extracted into a separate utility function.
For example, calculating the hour of day (24-hour clock) by dividing the number of seconds since midnight by 3600 seconds/hour:
- Python script: