Export Scikit-learn Pipeline Using SkLearn2PMML and Deploy as Apache Hive UDF


In this tutorial we're going to use Iris dataset, create and train a logistic regression ensemble model using Python and Scikit-learn package, convert it to standard PMML representation using SKlearn2PMML (a tool from Openscoring JPMML family), package it into Apache Hive User Defined Function and then evaluate some data using the newly created UDF. Apache Hive is a data warehouse software which facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. By defining a UDF, you can deploy model 'close to data', meaning you don't send huge amount of data over network to your scoring service, but bring the scoring service to the same platform where your data resides.

There seems to be a large gap between developing maching learning models and deploying them to production. Openscoring believes and invests heavily in PMML-provided interoperability between training and production environments, where the software stacks are usually very different - when model development happens in R and Python but production systems run on Java. We've developed a toolset of converters and evaluators, with PMML providing the interoperability inbetween.

This tutorial takes the following path through the JPMML ecosystem


The Setup

All the necessary software and its dependencies are already preinstalled on an Amazon Machine Image, including Python, pandas, numpy, scipy, scikit-learn, sklearn2pmml, Apache Hive/Hadoop, and JPMML-Evaluator for Hive. Some of the dependencies are non-trivial, like the need to add memory to EC2 to install scipy and the initial configuration of Hive.

Here're the AMIs (and as usual - let me know if you can't access any of these regions and I'll make the image available in your region too):

  • EU (Frankfurt) - ami-011c1a70ec49fe054
  • US West (N. California) - ami-073a84c441b7f905c
  • US East (N. Virginia) - ami-0528869f6be13dbf6

In the prepared image:

  • Python is on path
  • Apache Hive is located in /home/ec2-user/apache-hive-2.3.2-bin/bin and also added to path
  • JPMML-Evaluator for Hive is located in /home/ec2-user/jpmml-evaluator-hive/ and the precompiled JARs are already present in target/ directory
  • All the necessary data is located in /home/ec2-user/sampledata directory


Training and Converting the Model

Launch Python by simply typing 'python' on command prompt and then use the following script to train and convert the model:

import pandas

iris_df = pandas.read_csv("/home/ec2-user/sampledata/Iris.csv")

from sklearn_pandas import DataFrameMapper
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline

iris_pipeline = PMMLPipeline([
        ("mapper", DataFrameMapper([
                (["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width"], [ContinuousDomain(), Imputer()])
        ("pca", PCA(n_components = 3)),
        ("selector", SelectKBest(k = 2)),
        ("classifier", LogisticRegression())

iris_pipeline.fit(iris_df, iris_df["Species"])

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, "/home/ec2-user/sampledata/LogisticRegressionIris.pmml", with_repr = True)

The same script is also located in the sampledata/ directory and can be launched using:

python Iris.py

It will train the model based on dataset provided in sampledata/Iris.csv and create the specified file LogisticRegressionIris.pmml in sampledata/ directory.


Creating Apache Hive UDF

Usually when you want to create Hive UDF, you can't get away without writing some code. But we actually offer you a much better solution - we've done the code-writing part and come up with a META-UDF that writes the UDF code for you :-)

So here it goes - launch Hive by typing 'hive' on command prompt and execute the following steps (I have to warn you that the script will fail, but read on, there's a solution):

ADD JAR /home/ec2-user/jpmml-evaluator-hive/target/jpmml-evaluator-hive-runtime-1.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION BuildArchive AS 'org.jpmml.evaluator.hive.ArchiveBuilderUDF';
SELECT BuildArchive('io.openscoring.LogisticRegressionIris', '/home/ec2-user/sampledata/LogisticRegressionIris.pmml', '/home/ec2-user/sampledata/LogisticRegressionIris.jar');
ADD JAR /home/ec2-user/sampledata/LogisticRegressionIris.jar;
CREATE TEMPORARY FUNCTION LogisticRegressionIris AS 'io.openscoring.LogisticRegressionIris';
DESCRIBE FUNCTION LogisticRegressionIris;
LOCATION '/home/ec2-user/Iris';
SELECT LogisticRegressionIris(named_struct('Sepal_Length', Sepal_Length, 'Sepal_Width', Sepal_Width, 'Petal_Length', Petal_Length, 'Petal_Width', Petal_Width)) FROM Iris;

What this script does is that:

  • It'll add JPMML-Evaluator for Hive runtime to Hive
  • It'll create user defined function BuildArchive that is used to create the UDFs based on PMML
  • You'll need to give the following input parameters to the function:
    • The class name (io.openscoring.LogisticRegressionIris in the example; this class doesn't have to exist anywhere as .java file, it'll be generated)
    • The PMML file (LogisticRegressionIris.pmml in the example)
    • The path and name for JAR file where to package the resulting code
  • The we'll add the newly generated model JAR file to Hive
  • And create a new function based on it (LogisticRegressionIris in the example)
  • We're then going to import some test data previously added to Hadoop Filesystem
  • And evaluate the records

This will fail with the following exception due to the fact, that parentheses - ( and ) cannot be used in variable names in Hive.

hive> SELECT LogisticRegressionIris(named_struct('Sepal_Length', Sepal_Length, 'Sepal_Width', Sepal_Width, 'Petal_Length', Petal_Length, 'Petal_Width', Petal_Width)) FROM Iris;
FAILED: SemanticException Error: : expected at the position 33 of 'struct<species:string,probability(setosa):double,probability(versicolor):double,probability(virginica):double>' but '(' is found.

And because PMML is actually a text file, we can easily fix it by replacing this fragment:

   <OutputField name="probability(setosa)" optype="continuous" dataType="double" feature="probability" value="setosa"/>
   <OutputField name="probability(versicolor)" optype="continuous" dataType="double" feature="probability" value="versicolor"/>
   <OutputField name="probability(virginica)" optype="continuous" dataType="double" feature="probability" value="virginica"/> 

with this fragment (note the change from probability(x) to probability.x):

   <OutputField name="probability.setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/>
   <OutputField name="probability.versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/>
   <OutputField name="probability.virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/>

For your convenience, there's already a file LogisticRegressionIris-edit.pmml in the sampledata/ directory that contains the fix. You need to go back to the "SELECT BuildArchive..." step and generate the model JAR again.

This fix is so simple thanks to the fact that PMML is human-readable text file - if we used some binary notation here, we would need to go back to the training environment.

So when you run the script with edited PMML, you'll get your Irises classified:

hive> SELECT LogisticRegressionIris(named_struct('Sepal_Length', Sepal_Length, 'Sepal_Width', Sepal_Width, 'Petal_Length', Petal_Length, 'Petal_Width', Petal_Width)) FROM Iris;