Why Use PMML for Your Machine Learning Models?

PMML (Predictive Model Markup Language) is a XML-based format created by DMG to store predictive analytical models in a common format.

Although the support for PMML among the tools on the market seems to be ranging from correct implementations to mediocre to non-existent, it still pretty much the only open standard that provides the common denominator between different machine learning tools and packages. Yes, there's also PFA, but Google search for "PFA evaluator" suggests me to search "PFA calculator" instead...

So unless you can train your models and make predictions with the same tool and in the same environment, you'll find yourself looking for ways to be interoperable, for example - deploy R models in Java applications. Sooner or later you'll find yourself researching options and one of those is probably PMML.

As we've been creating tools that convert R/Scikit-learn/Spark/Tensorflow models to PMML and also providing the PMML evaluator, we know a thing or two about it. The below arguments concentrate more on model deployment and management processes and less on the bleeding edge research into new predictive model types.

 

#1 PMML is an open standard

Every time there's a question whether to prefer an open standard or a proprietary format, we'd bet on the open standard.

We don't even want to elaborate or bring examples of this not to draw the attention away from this single most important argument. It really beats us why would someone need bundles of JSON/Protobuf files without clearly documented structure, when all this can be represented as a single and clear PMML file.

 

#2 PMML model is a text file

PMML is a text based representation of your machine learning model - it's no longer a black box where data goes in and scores come out. Many of the below arguments stem from the fact that we have a textual representation of the model in the first place.

 

#3 PMML model is human-readable and editable

You can just open the file and see what type of model is it, what are the inputs/outputs and data transformations.

Although you seldom want to go that route, you still have the option for a quick fix whenever this is really needed - for example, in the post about deploying PMML as Hive UDF we can quickly remove parentheses from the output field names as these are reserved characters in Hive.

 

#4 PMML model programmatically editable

In addition to being human readable and editable, it's also editable by a program - you just need a proper XML parsing tool. JPMML converters use this feature extensively, from storing additional metadata (for debugging purposes) in the Extension block to converting binary splits in decision trees to equivalent multiway splits, which save time and memory.

 

#5 PMML model can be versioned

PMML models can be pretty printed and checked into version control systems like Git or Subversion. It's hard to version binary things, but there're plenty of tools to version, diff (software engineers lingo for 'finding the difference') and merge text based formats. In this way, you can keep the history of your model evolution, collaborate or even maintain multiple branches of it.

 

#6 PMML model can be stored for long time

Say you have a (regulatory?) requirement to have repeatable predictions so whenever you have the original data points and timestamp, you can find the PMML model that was in production at the time and get exactly the same result that you got originally. JPMML even has MathML support to retrieve the whole prediction as one MathML expression.

You don't have to worry whether you still have any programs to read the model and make predictions based on it.

 

#7 PMML models are backward compatible by design

New PMML schema versions allow you to express the same logic more concisely/elegantly. All older PMML schema versions that require more markup to achieve the same objective, still continue to be valid.

Quoting our CTO Villu Ruusmann: "Give us any PMML file that has been produced in past 10 years, and we can make a prediction with it. This statement will hold true also 10 years into the future."

 

#8 PMML representation provides interoperability between tools

PMML models are independent from tools and their versions. If any of the vendors changes their proprietary data format, your old models might not be supported or even readable - when developing converters, we regularly stumble upon small, but sometimes breaking changes like this. Not the case with PMML.

 

#9 PMML provides interoperability between teams and environments

PMML represenation provides interoperability between training environments and deployment environments. You can train your model once and run it anywhere.

You don't have to force specific tools for data scientists just because your production environment is only able to use models created in, say, Python. You can have multiple sources of models provided that they all can be exported to PMML. And you can execute the models in any environent that is capable of producing predictions based on PMML.


You might wonder how Openscoring/JPMML fits into this picture. We certainly make our bet on an open standard and have created tools to enable PMML export from many open platforms like R, Scikit-learn and Apache Spark.

Our tools produce PMML models and let you evaluate them in multitude of environments including plain Java applications for which we provide the JPMML-Evaluator library, but also Apache Hive and Pig for which we provide integration via UDF, we integrate with Apache Spark and we also create a client-agnostic REST web service for PMML model evaluation.

We're also working on a machine learning model management software, where you could convert, store, version and verify your models.

Export Scikit-learn Pipeline Using SkLearn2PMML and Deploy as Apache Hive UDF

Introduction

In this tutorial we're going to use Iris dataset, create and train a logistic regression ensemble model using Python and Scikit-learn package, convert it to standard PMML representation using SKlearn2PMML (a tool from Openscoring JPMML family), package it into Apache Hive User Defined Function and then evaluate some data using the newly created UDF. Apache Hive is a data warehouse software which facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. By defining a UDF, you can deploy model 'close to data', meaning you don't send huge amount of data over network to your scoring service, but bring the scoring service to the same platform where your data resides.

There seems to be a large gap between developing maching learning models and deploying them to production. Openscoring believes and invests heavily in PMML-provided interoperability between training and production environments, where the software stacks are usually very different - when model development happens in R and Python but production systems run on Java. We've developed a toolset of converters and evaluators, with PMML providing the interoperability inbetween.

This tutorial takes the following path through the JPMML ecosystem

 

The Setup

All the necessary software and its dependencies are already preinstalled on an Amazon Machine Image, including Python, pandas, numpy, scipy, scikit-learn, sklearn2pmml, Apache Hive/Hadoop, and JPMML-Evaluator for Hive. Some of the dependencies are non-trivial, like the need to add memory to EC2 to install scipy and the initial configuration of Hive.

Here're the AMIs (and as usual - let me know if you can't access any of these regions and I'll make the image available in your region too):

  • EU (Frankfurt) - ami-011c1a70ec49fe054
  • US West (N. California) - ami-073a84c441b7f905c
  • US East (N. Virginia) - ami-0528869f6be13dbf6

In the prepared image:

  • Python is on path
  • Apache Hive is located in /home/ec2-user/apache-hive-2.3.2-bin/bin and also added to path
  • JPMML-Evaluator for Hive is located in /home/ec2-user/jpmml-evaluator-hive/ and the precompiled JARs are already present in target/ directory
  • All the necessary data is located in /home/ec2-user/sampledata directory

 

Training and Converting the Model

Launch Python by simply typing 'python' on command prompt and then use the following script to train and convert the model:

import pandas

iris_df = pandas.read_csv("/home/ec2-user/sampledata/Iris.csv")

from sklearn_pandas import DataFrameMapper
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline

iris_pipeline = PMMLPipeline([
        ("mapper", DataFrameMapper([
                (["Sepal_Length", "Sepal_Width", "Petal_Length", "Petal_Width"], [ContinuousDomain(), Imputer()])
        ])),
        ("pca", PCA(n_components = 3)),
        ("selector", SelectKBest(k = 2)),
        ("classifier", LogisticRegression())
])

iris_pipeline.fit(iris_df, iris_df["Species"])

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, "/home/ec2-user/sampledata/LogisticRegressionIris.pmml", with_repr = True)

The same script is also located in the sampledata/ directory and can be launched using:

python Iris.py

It will train the model based on dataset provided in sampledata/Iris.csv and create the specified file LogisticRegressionIris.pmml in sampledata/ directory.

 

Creating Apache Hive UDF

Usually when you want to create Hive UDF, you can't get away without writing some code. But we actually offer you a much better solution - we've done the code-writing part and come up with a META-UDF that writes the UDF code for you :-)

So here it goes - launch Hive by typing 'hive' on command prompt and execute the following steps (I have to warn you that the script will fail, but read on, there's a solution):

ADD JAR /home/ec2-user/jpmml-evaluator-hive/target/jpmml-evaluator-hive-runtime-1.0-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION BuildArchive AS 'org.jpmml.evaluator.hive.ArchiveBuilderUDF';
DESCRIBE FUNCTION BuildArchive;
DESCRIBE FUNCTION EXTENDED BuildArchive;
SELECT BuildArchive('io.openscoring.LogisticRegressionIris', '/home/ec2-user/sampledata/LogisticRegressionIris.pmml', '/home/ec2-user/sampledata/LogisticRegressionIris.jar');
ADD JAR /home/ec2-user/sampledata/LogisticRegressionIris.jar;
CREATE TEMPORARY FUNCTION LogisticRegressionIris AS 'io.openscoring.LogisticRegressionIris';
DESCRIBE FUNCTION LogisticRegressionIris;
DESCRIBE FUNCTION EXTENDED LogisticRegressionIris;
CREATE EXTERNAL TABLE IF NOT EXISTS Iris (Sepal_Length DOUBLE, Sepal_Width DOUBLE, Petal_Length DOUBLE, Petal_Width DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/home/ec2-user/Iris';
SELECT LogisticRegressionIris(named_struct('Sepal_Length', Sepal_Length, 'Sepal_Width', Sepal_Width, 'Petal_Length', Petal_Length, 'Petal_Width', Petal_Width)) FROM Iris;

What this script does is that:

  • It'll add JPMML-Evaluator for Hive runtime to Hive
  • It'll create user defined function BuildArchive that is used to create the UDFs based on PMML
  • You'll need to give the following input parameters to the function:
    • The class name (io.openscoring.LogisticRegressionIris in the example; this class doesn't have to exist anywhere as .java file, it'll be generated)
    • The PMML file (LogisticRegressionIris.pmml in the example)
    • The path and name for JAR file where to package the resulting code
  • The we'll add the newly generated model JAR file to Hive
  • And create a new function based on it (LogisticRegressionIris in the example)
  • We're then going to import some test data previously added to Hadoop Filesystem
  • And evaluate the records

This will fail with the following exception due to the fact, that parentheses - ( and ) cannot be used in variable names in Hive.

hive> SELECT LogisticRegressionIris(named_struct('Sepal_Length', Sepal_Length, 'Sepal_Width', Sepal_Width, 'Petal_Length', Petal_Length, 'Petal_Width', Petal_Width)) FROM Iris;
FAILED: SemanticException Error: : expected at the position 33 of 'struct<species:string,probability(setosa):double,probability(versicolor):double,probability(virginica):double>' but '(' is found.

And because PMML is actually a text file, we can easily fix it by replacing this fragment:

<Output>
   <OutputField name="probability(setosa)" optype="continuous" dataType="double" feature="probability" value="setosa"/>
   <OutputField name="probability(versicolor)" optype="continuous" dataType="double" feature="probability" value="versicolor"/>
   <OutputField name="probability(virginica)" optype="continuous" dataType="double" feature="probability" value="virginica"/> 
</Output>

with this fragment (note the change from probability(x) to probability.x):

<Output>
   <OutputField name="probability.setosa" optype="continuous" dataType="double" feature="probability" value="setosa"/>
   <OutputField name="probability.versicolor" optype="continuous" dataType="double" feature="probability" value="versicolor"/>
   <OutputField name="probability.virginica" optype="continuous" dataType="double" feature="probability" value="virginica"/>
</Output>

For your convenience, there's already a file LogisticRegressionIris-edit.pmml in the sampledata/ directory that contains the fix. You need to go back to the "SELECT BuildArchive..." step and generate the model JAR again.

This fix is so simple thanks to the fact that PMML is human-readable text file - if we used some binary notation here, we would need to go back to the training environment.

So when you run the script with edited PMML, you'll get your Irises classified:

hive> SELECT LogisticRegressionIris(named_struct('Sepal_Length', Sepal_Length, 'Sepal_Width', Sepal_Width, 'Petal_Length', Petal_Length, 'Petal_Width', Petal_Width)) FROM Iris;
OK
{"species":"versicolor","probability.setosa":0.2777858958596321,"probability.versicolor":0.6347515721817479,"probability.virginica":0.0874625319586199}
{"species":"setosa","probability.setosa":0.8897147674727336,"probability.versicolor":0.11025030924052269,"probability.virginica":3.4923286743653504E-5}
{"species":"setosa","probability.setosa":0.807031956342117,"probability.versicolor":0.1929198025927802,"probability.virginica":4.82410651027145E-5}
{"species":"setosa","probability.setosa":0.8190557599026391,"probability.versicolor":0.18091383415588733,"probability.virginica":3.0405941473391493E-5}

Export R Model with R2PMML and Deploy as Apache Pig UDF

Introduction

There seems to be a large gap between developing maching learning models and deploying them to production. This seems to be especially problematic when it comes to models created in R. In this tutorial, we go through the process of training a simple regression model based on Auto-MPG dataset in R, export it to PMML using R2PMML (a tool from Openscoring JPMML family) and then deploy it on Apache Pig as UDF (User Defined Function). UDFs are a great way to deploy predictive models close to your big data to avoid sending the data over the network for scoring.

We here at Openscoring believe and invest heavily in PMML-provided interoperability between training and production environments, where the software stacks are usually very different - when model development happens in R and Python but production systems run on Java. We've developed a toolset of converters and evaluators, with PMML providing the interoperability inbetween.

 This tutorial takes the following path through the JPMML ecosystem

This tutorial takes the following path through the JPMML ecosystem

The Setup

As usual, we try to provide full setup on the Amazon Machine Image.

The image contains preinstalled R (including all necessary packages like plyr, devtools and r2pmml), Apache Pig and JPMML-Evaluator for Pig (both of which in turn depend on many other software packages, like Java, Maven, JPMML-Evaluator and JPMML-Model, etc). In case you want to install everything by yourself, you need to do some trial and error and install all the dependencies - some of which are non-trivial on EC2.

We've created the following images for different regions  (let me know if you can't access any of these regions and I'll make the image available in your region too):

  • EU (Frankfurt) - ami-011c1a70ec49fe054
  • US West (N. California) - ami-073a84c441b7f905c
  • US East (N. Virginia) - ami-0528869f6be13dbf6

When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).

In the prepared image:

  • R is on path
  • Apache Pig is located in /home/ec2-user/pig-0.17.0/bin/ and also added to path
  • JPMML-Evaluator for Pig is located in /home/ec2-user/jpmml-evaluator-pig/ and the precompiled JARs are already present in target/ directory
  • All the necessary data is located in /home/ec2-user/sampledata directory

 

    Training and Converting the Model

    You can launch R by simply typing R.

    The following script imports the necessary libraries and trains the model:

    # Installation preconditions
    library("plyr")
    library("devtools")
    library("r2pmml")
    
    # If any of the commands fail, you need to do:
    # install.packages("plyr")
    # install.packages("devtools")
    # install_git("git://github.com/jpmml/r2pmml.git")
    
    # Devtools package has several non-trivial dependencies on EC2:
    # sudo yum install openssl-devel 
    # sudo yum install libcurl-devel
    
    # Load and prepare the Auto-MPG dataset
    auto = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", quote = "\"", header = FALSE, na.strings = "?", row.names = NULL, col.names = c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name"))
    auto$origin = as.factor(auto$origin)
    auto$car_name = NULL
    auto = na.omit(auto)
    
    # Train a model
    auto.glm = glm(mpg ~ (. - horsepower - weight - origin) ^ 2 + I(displacement / cylinders) + cut(horsepower, breaks = c(0, 50, 100, 150, 200, 250)) + I(log(weight)) + revalue(origin, replace = c("1" = "US", "2" = "Europe", "3" = "Japan")), data = auto)
    
    # Export the model to PMML
    r2pmml(auto.glm, "/home/ec2-user/sampledata/auto_glm.pmml")

      As usual in these tutorials we're not going to discuss the quality of the model. This tutorial is about converting and deploying the model, not data analysis.

       

      Creating a Simple PIG UDF and Evaluating Some Data

      First of all, let's launch Pig and register the necessary JAR file:

      ~/pig-0.17.0/bin/pig -x local

      And on Pig prompt:

      REGISTER /home/ec2-user/jpmml-evaluator-pig/target/jpmml-evaluator-pig-runtime-1.0-SNAPSHOT.jar;

      Now let's create a simple UDF based on the PMML file we got from R (feel free to overwrite the sampledata/auto_glm.pmml file with what you got from R):

      DEFINE AutoMPG org.jpmml.evaluator.pig.EvaluatorFunc('/home/ec2-user/sampledata/auto_glm.pmml');

      Now we'll load some data and do the scoring:

      AutoMPG_test = LOAD '/home/ec2-user/sampledata/mpg-test.csv' USING PigStorage(';') AS (cylinders:double, displacement:double, horsepower:double, weight:double, acceleration:double, model_year:double, origin:int, name:chararray);
      
      DESCRIBE AutoMPG_test;
      DUMP AutoMPG_test;
      
      AutoMPG_calc = FOREACH AutoMPG_test GENERATE AutoMPG(*);
      
      DESCRIBE AutoMPG_calc;
      DUMP AutoMPG_calc;

      Please be warned that both DUMPs will output a few hundred lines of data (so if you haven't configured your scrollback buffer to be long enough, you might lose the beginning of the log).

       

      Creating PMML Functions Using the ArciveBuilder UDF

      Now we're going to take it to the META-level and use our UDF to create UDF-s :-) On Pig, you don't really need to go here (as opposed to Apache Hive), but the solution is so elegent, we couldn't keep us from implementing it!

      First, let's create a file Udf.csv with the following content (it actually already exists in sampledata/ directory):

      io.openscoring.AutoMPG,/home/ec2-user/sampledata/auto_glm.pmml,/home/ec2-user/sampledata/AutoMPG.jar

      And from Pig prompt we need to execute the following commands (this assumes that the jpmml-evaluator-pig-runtime is registered):

      Udf = LOAD '/home/ec2-user/sampledata/Udf.csv' USING PigStorage(',') AS (Class_Name:chararray, PMML_File:chararray, Model_Jar_File:chararray);
      
      Udf_model_jar_file = FOREACH Udf GENERATE org.jpmml.evaluator.pig.ArchiveBuilderFunc(*);
      
      DUMP Udf_model_jar_file;

      What happens here is that:

      • We load the class name from Udf.csv (there's no need for the actual .java file to exist with this class definition, we generate its code on the fly)
      • We use the PMML file referenced in Udf.csv (make sure it exists at the path given)
      • We output the generated and packaged code in the file Model_Jar_File at the specified path
      • DUMP actually creates the JAR file and returns the full path to it

      As the next step, register the JAR file and define UDF based on it:

      REGISTER /home/ec2-user/sampledata/AutoMPG.jar;
      DEFINE AutoMPG io.openscoring.AutoMPG;

      And now we can do the same scoring part as shown for the simple function.

      Converting R Models using R2PMML and Deploying Them on Openscoring REST Evaluator

      Introduction

      There seems to be a large gap between developing maching learning models and deploying them to production. This seems to be especially problematic when it comes to models created in R. In this tutorial, we go through the process of training a simple regression model based on Titanic dataset in R, export it to PMML using R2PMML converter (a tool from Openscoring JPMML family) and then deploy it on Openscoring REST-based evaluation service.

      We here at Openscoring believe and invest heavily in PMML-provided interoperability between training and production environments, where the software stacks are usually very different - when model development happens in R and Python but production systems run on Java. We've developed a toolset of converters and evaluators, with PMML providing the interoperability inbetween.

      This tutorial looks at training a model in R and deploying it to REST-based evaluator.

       

      The Setup

      Usually we've tried to provide full setup on the Amazon Machine Image, but this time we're taking a bit longer route as AMI doesn't provide possibilities for graphic user interfaces like RStudio. So this is what you need:

       

      Training the Model

      The following script imports the necessary libraries and trains the model (make sure you get the Titanic dataset path right):

      # Installation preconditions
      install.packages("xgboost")
      library("xgboost")
      library("devtools")
      install_git("git://github.com/jpmml/r2pmml.git")
      library("r2pmml")
      
      # Reading data and training the model - you need to change the path!
      training.data.raw <- read.csv('C:/Users/karelk/Documents/jpmml/Titanic/Titanic.csv',header=T,na.strings=c(""))
      data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T)
      data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10))
      data$Survived = as.factor(data$Survived)
      model <- glm(Survived ~.,family=binomial(link='logit'),data=data)
      r2pmml(model, "Titanic-R.pmml")

        We're not going to discuss the quality of the model - it's kept simple for example purposes. My intention here is to show the conversion and deployment part, not maximize prediction accuracy. There's a very thorough tutorial exploring the various relationships in the Titanic dataset in this Youtube playlist.

         

        Model Deployment

        There're two ways to deploy the model and both of them are much simpler than learning even basic R:

        The easiest way: Testing PMML Execution with Openscoring REST Service on Amazon EC2 - in this tutorial we provide you the Amazon Machine Image with Openscoring REST evaluation service already listening on port 8080 once you start the image. 

        Still easy way, but with a bit more dependencies: Deploying PMML Model to Openscoring REST Evaluation Service - in this tutorial we describe how to set up Openscoring REST evaluation service on your own Windows computer and deploy the model.

        In the below screenshot I'm using Postman to deploy the model to Openscoring service running on localhost:

        And the next picture I'm already calculating the scores:

        Postman evaluate.jpg

        JPMML-Evaluator Decision Tree/Random Forest Performance Test on Amazon EC2 t2.micro

        One day I had a discussion with our CTO Villu Ruusmann about what scoring numbers we can actually use in our sales pitch or promote on the web page. Villu was telling me about single digit microseconds, which - I can't say I didn't believe, but was quite a bit skeptical about. I wanted to see it by myself, so Villu gave me the following command line to experiment with:

        C:\github\jpmml-evaluator>java -jar pmml-evaluator-example\target\example-1.4-SNAPSHOT.jar --model c:\github\openscoring\openscoring-service\src\etc\DecisionTreeIris.pmml --input c:\github\openscoring\openscoring-service\src\etc\input.csv --output test.txt --intern --optimize --loop 1000000

        It gave me some interesting numbers which seemed to confirm the single digit microseconds, but I won't paste them here right away, because this was run on my personal laptop, the CPU usage never exceeded 40% and the CSV file to be scored includes 3 records. Not good for comparison.

        I wanted something repeatable, something everyone can actually try, check out the setup themselves, count the lines in the input CSV files, verify the number of trees in a random forest. So once again, I turned to Amazon EC2.

         

        The Setup

        I used the following setup for the tests:

        • All tests were done on Amazon EC2 t2.micro (available on free tier). I didn't accumulate any costs during the tests.
        • JPMML-Evaluator 1.4.0 was used in the test
        • CSV files were used as an input to the tests (it's by far easiest way to generate large number of records)
        • Each test included scoring a million data points:
          • 1-record CSV scored 1000000 times
          • 10-record CSV scored 100000 times
          • 100-record CSV scored 10000 times
          • 1000-record CSV scored 1000 times
          • 10000-record CSV scored 100 times
          • 100000-record CSV scored 10 times
          • (with 1 000 000 records once I ran into memory issues on t2.micro)
        • The same approach was used to score the following models:
        • The following command line parameters were added the the evaluator:
          • --model specifies the model to be used (model is loaded only once and not 1 million times when scoring single record)
          • --input points to the CSV file containing the records to be scored
          • --output specifies the output file
            • Output file is overwritten for every loop - scoring 1 record million times still produces 1 line in output file and means that file is opened, written and closed also a million times
            • Scoring 1 hundred thousand records ten times produces output file with 100 000 records
            • Using /dev/null as output didn't result any performance gains
          • --loop which specifies the number of times input file is run through the scoring process
          • --intern replaces recurring elements in PMML with one element, this improves memory usage, but not performance
          • --optimize which tries to convert java.lang.String elements to java.lang.Integer or java.lang.Double once in the beginning and not on every model execution
        • I ran every command 5 times, chose the best 3 from those 5 and averaged the results (this was done because while the highs were relatively consistent, there were 2-3 lows during the overall testing process which I couldn't anyhow relate to our app).

         

        The Results: Iris Decision Tree

        iris.jpg

        With a simple decision tree the single data point is scored wih 5 microseconds (that's five millionths of a second aka 0.005 milliseconds). The peak performance is with a combination of 100-line CSV scored 10 000 times, where nearly 250 000 records are scored with 1 second, bringing the single score down to 4 microseconds. My theory is that the peak is produced by an optimum between handling file writes (remember, when scoring a single record, file is opened, written and closed 1 million times) and managing the CSV in memory (addressing an array with 1000 items starts to take its toll).

         

        The Results: Titanic Random Forest

        titanic.jpg

        With 20-tree random forest, the single data point is scored with 19 microseconds, which is considerably slower than with a single decision tree, but it's nowhere near 20 times slower as one would estimate - it's merely 3,6 times slower. The optimum performance at 100*10 000 is not as pronounced as proportionally more time is spent on calculating the scores than on other activities (like writing files).

         

        Comparison

        iris-vs-titanic.jpg

        I've included a graph showing Iris (tree model) and Titanic (random forest) model performance side by side - again, although the random forest contains 20 tree models whose output is averaged, it's not nearly 20 times slower than a single tree (which we consider really good!).

         

        Conclusion

        We can conclude that the base performance of JPMML library really is very good, with single data point scorings measurable in microseconds. And this is all in single thread, on the least powerful virtual machine AWS provides! In a less rigorous test on my Lenovo T470s laptop I could get around 700 000+ scores per second with the Iris decision tree at about 80% CPU utilization. Imagine what we could do in a multithreaded, optimized production environment running on a powerful server :-)

         

        Next Steps and How You Can Help

        Iris and Titanic are both decision tree models, we're interested in doing the same also with other model types, possibly trained in different environments and converted to PMML with different options. Same goes for our REST-based scoring engine.

        You can help us by providing some actual models solving real-world problems along with sample dataset for scoring (say, 10 records) and we will run them under same conditions for comparison. Of course you can obfuscate the labels of inputs and outputs. Small enough models can be sent by email and the bigger ones shared using WeTransfer - my email is karel@openscoring.io. I'd appreciate a few words about what these models do, which I won't disclose without permission. Or alternatively - you can do your own tests on the AMIs provided below.

         

        Amazon Machine Images

        The AMI-s are as follows (let me know if you can't access any of these regions and I'll make the image available in your region too):

        • EU (Frankfurt) - ami-0a22a739ce1f9a9a7
        • US West (N. California) - ami-02a035e53d718a40f
        • US East (N. Virginia) - ami-0d90521380e13b956

        When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).

        JPMML-Evaluator is located in /home/ec2-user/jpmml-evaluator with Iris PMML and CSV-s in IrisTestData/ and TitanicTestData/ respectively.

        The sample command line goes like this (assuming working directory ~/jpmml-evaluator):

        java -jar pmml-evaluator-example/target/example-1.4-SNAPSHOT.jar --model TitanicTestData/Titanic.pmml --input TitanicTestData/Titanic-10.csv --output test.txt --intern --optimize --loop 100000

        Here's the sample output:

        [ec2-user@ip-172-31-41-51 jpmml-evaluator]$ pwd
        /home/ec2-user/jpmml-evaluator
        [ec2-user@ip-172-31-41-51 jpmml-evaluator]$ java -jar pmml-evaluator-example/target/example-1.4-SNAPSHOT.jar --model TitanicTestData/Titanic.pmml --input TitanicTestData/Titanic-10.csv --output test.txt --intern --optimize --loop 100000 
        3/18/18 9:59:54 AM
        -- Timers --------
        main
                     count = 100000
                 mean rate = 5353.47 calls/second
             1-minute rate = 4464.99 calls/second
             5-minute rate = 4279.49 calls/second
            15-minute rate = 4246.15 calls/second
                       min = 0.14 milliseconds
                       max = 188.50 milliseconds
                      mean = 0.18 milliseconds
                    stddev = 0.65 milliseconds
                    median = 0.16 milliseconds
                      75% <= 0.17 milliseconds
                      95% <= 0.18 milliseconds
                      98% <= 0.66 milliseconds
                      99% <= 0.75 milliseconds
                    99.9% <= 4.47 milliseconds
        

        Please be aware that the calls per second mean rate is per one CSV file that you feed to JPMML-Evaluator (if you have 1 record CSV, the mean rate is per single data point, if you have million-line CSV, the mean rate is per million records and to get single data point score, you have to multiply it by million).

        Just a quick note from Villu regarding the difference between max and mean time - the maximum time happens on the very first evaluation when the optimizations are done.

        Training Random Forest Model in Apache Spark and Deploying It to Openscoring REST Web Service

        Introduction

        In this tutorial we're building the following setup:

        1. We take the Titanic dataset 
        2. We are going to use a PySpark script based on this example (using SQL Dataframe, ML lib and Pipelines) and modify it a little bit to suit our needs
        3. We train a random forest model using Apache Spark Pipelines
        4. We convert the model to standard PMML using JPMML-SparkML package
        5. We deploy the random forest model to Openscoring web service using Openscoring Python client library
        6. We get real-time scoring of test data directly from Spark envrionment

        Here's a visual representation which JPMML/Openscoring components will be used in the following tutorial as well as all the other possibilities to follow the path from model training to model evaluation provided by us.

        Training model on Apache Spark and deploying to Openscoring REST evaluator (click for bigger version)

         

        Setting up the Environment

        We have some good news here - we've done all the heavy lifting for you on the Amazon Machine Images that you can use as a template to launch your own EC2 instance and try the same steps there. Please be aware that this AMI is more like a quick and dirty proof of concept setup and less like production. Of course you can do the whole setup from scratch by yourself, it's no harder than just tracing the dependencies and installing them one by one (you're going to need JDK8, Spark2.2, Maven, Git, Python, NumPy, pandas, Openscoring, JPMML-SparkML package - just to name a few).

        The AMI-s are as follows (let me know if you can't access any of these regions and I'll make the image available in your region too):

        • EU (Frankfurt) - ami-0856fb979732e0016
        • EU (Ireland) - ami-0aa76e5c2adbaf682
        • US West (N. California) - ami-0c6ac011f918947e1
        • US East (N. Virginia) - ami-07f1b7187e58af6b4
        • Asia Pacific (Singapore) - ami-0ddae89443be18b52

        When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).

         

        Launching Spark

        Once you've logged in to the EC2 instance you just created based on our template, you can use the following commands to launch Apache Spark:

        $ cd spark-2.2.1-bin-hadoop2.7/bin
        $ ./pyspark --jars /home/ec2-user/jpmml-sparkml-package/target/jpmml-sparkml-package-1.3-SNAPSHOT.jar

        This will open the PySpark prompt and will also include the necessary JPMML-SparkML package jar file, that will take care of the pipeline conversion to PMML.

         

        Training the Random Forest Model with Titanic Data

        Here comes the Python code for training the model, deploying it to Openscoring web service running on the same EC2 and getting the test scores. Openscoring-provided Python client library is used to deploy the model to REST evaluator.

        Based on the prediction the young female passenger had a pretty good chance to survive the Titanic catastrophy while the middle-aged man sadly didn't:

        >>> arguments = {"Fare":7.25,"Sex":"female","Age":20,"SibSp":1,"Pclass":1,"Embarked":"Q"}
        >>> print(os.evaluate("Titanic", arguments))
        {u'probability(0)': 0.1742083085440686, u'probability(1)': 0.8257916914559313, u'pmml(prediction)': u'1', u'Survived': u'1', u'prediction': 1.0}
        
        >>> arguments = {"Fare":7.25,"Sex":"male","Age":45,"SibSp":1,"Pclass":3,"Embarked":"S"}
        >>> print(os.evaluate("Titanic", arguments))
        {u'probability(0)': 0.8934427101921765, u'probability(1)': 0.10655728980782375, u'pmml(prediction)': u'0', u'Survived': u'0', u'prediction': 0.0}
        

         

        Accessing the Model Using Openscoring REST API

        You can access Openscoring web service REST API (scroll down for the API spec) not only with the Python client library from Spark, but with any tool that can send custom HTTP requests. We prefer to use Postman for this. The previous post in our blog discussed using Openscoring REST service in more detail.

        In order to access Openscoring running on your EC2 instance from outside the machine itself, you need to find the security group associated with your EC2 instance and add one rule:

        You need to choose Custom TCP, port 8080, Anywhere - this will allow you to access Openscoring web service running on port 8080 from anywhere in the world (you can always choose 'My IP' to just open it to yourself). Now go back to the instances list, grab the public DNS of your instance and create the following request in Postman:

        • Use POST
        • Set URL to: http://<public DNS>:8080/openscoring/model/Titanic/csv
        • Set Content-Type header to text/plain

        You can use the following data as the request body (or compile your own based on the data dictionary):

        Fare,Sex,Age,SibSp,Pclass,Embarked
        7.8292,male,34.5,0,3,Q
        7,female,17,1,3,C
        9.6875,male,62,0,2,Q
        8.6625,male,27,0,3,S
        12.2875,female,22,1,3,S
        9.225,male,14,0,3,S
        7.6292,female,30,0,3,Q
        29,male,26,1,2,S

        Submitting the request in Postman should result something like this:

        As per the model predictions, the chances of surviving the Titanic catastrophy were pretty bleak :-(

        If you don't bother installing Postman, you can access these two URLs even from your browser:

        http://<public DNS>:8080/openscoring/model/Titanic
        http://<public DNS>:8080/openscoring/model/Titanic/pmml

        The first one will give you brief overview of model inputs and outputs in JSON format while the second one will give you the PMML representation of whole random forest.

        Feel free to let me know of any issues related to this guide in the comments section!

        Testing PMML Execution with Openscoring REST Service on Amazon EC2

        Last time I wrote about how to install Openscoring web service to your own computer. This has several major dependencies like installing Maven, GitHub Desktop and possibly messing with Java versions and environment variables. While not too complex, its still requires some error-prone steps, depends on operting system versions and so on.

        Today we want to lower the entry barrier even further - we've prepared and Amazon AMI (Amazon Machine Image) for you with preinstalled Openscoring service. This post will walk you through the steps required to get things up and running.

        Preconditions

        This time, you're going to need:

        Amazon EC2 Setup

        The part with steps related to AWS is a bit long, but that's also due to the amount of images included. Bear with me, it's all web browsing and no command line at all :-)

        Assuming you have AWS account or have signed up during the previous step, this is what you need to do:

        • On the very first screen that pops up, you need to choose EC2 (or "Launch a virtual machine with EC2").
        • This will lead you to EC2 management screen. Take a look at the upper right corner of the screen where the AWS region is displayed:
        region.jpg

        I've created the images for following regions:

        1. EU (Frankfurt) - ami-0856fb979732e0016
        2. EU (Ireland) - ami-0aa76e5c2adbaf682
        3. US West (N. California) - ami-0c6ac011f918947e1
        4. US East (N. Virginia) - ami-07f1b7187e58af6b4
        5. Asia Pacific (Singapore) - ami-0ddae89443be18b52

        If you cannot use any of these regions (I really don't know if Texas is US West or US East; at least from EU, I couldn't access the EC2 instance I created to US East region), please let me know and I'll make the image available in your region too.

        • From the left menu, choose Images->AMIs, choose Public Images and search according to you region for either the respective AMI ID directly or for "openscoring":
        • From the right-click menu choose Launch.
        • Next up is Launch process. Choose Free Tier (if you haven't been using AWS for more than 12 months) and now Next: Configure Instance Details
        • Click through the steps always choosing Next:... (the default conf here is just fine), until you reach the security part:
        • What you need to do here is Add Rule and choose Custom TCP, port 8080, Anywhere (this will allow you to access Openscoring web service running on port 8080 from anywhere in the world).
        • It will briefly complain about leaving the instance open to the world, but we don't care as this is for testing anyway, so go ahead and Review and Launch the instance.
        • You can launch the instance without configuring any keys (of course if you want you can do it, but for our purposes no command line access will be needed; you'll also be able to launch more instances based on the same AMI later with keys if you want to dig around there).
        • Now navigate to the Instances screen and wait until the status is running:
        • From here, copy the Public DNS.

        Executing the Model

        Openscoring service is already running on your EC2 instance on port 8080. You copied the public DNS name in previous step, now you can open Postman and paste the name into URL field, add port 8080 and you can do everything our REST API allows (scroll down to API spec).

        For more details how to use Postman, check out the last chapters of our post Deploying PMML Model to Openscoring REST Evaluation Service.

        The EC2 instace comes with DecisionTreeIris PMML predeployed, so you can easily assemble the following request for first test:

        http://<public DNS>:8080/openscoring/model/DecisionTreeIris

        Set the request body to the following value:

        {
            "id" : "record-001",
            "arguments" : {
                "Sepal_Length" : 5.1,
                "Sepal_Width" : 3.5,
                "Petal_Length" : 1.4,
                "Petal_Width" : 0.2
            }
        }

        The Content-Type header to: application/json and you're good to Send the request.

        And there you go again - based on the DecisionTreeIris model, your iris is of 'setosa' type with the probability score of 1.

        If any of the steps don't work out for you, feel free to let me know either in the comments or by email: karel@openscoring.io and I'll try to help.

        Deploying PMML Model to Openscoring REST Evaluation Service

        This article will guide you through the steps of deploying a PMML predictive model on our REST API based scoring service. It follows the guidelines provided in the project README file and tries to simplify some aspects for non-Java non-programmers :-) This is seriously entry level, so feel free to skip the parts you've gone through or done before (like GitHub or Java environment setup).

         

        Why Care?

        During many conversations with our licensees, we've found out that it's unnecessarily complex to deploy and execute a PMML model. Providing a REST API is a step towards simplifying the usage of PMML models. We also provide a range of converters which let you first convert your Scikit-learn, R or Apache Spark proprietary model formats to standard PMML.

        We strongly believe that using standardized PMML model format is the future of machine learning models. Having a text-based representation gives you the following advantages:

        • The models are human readable (and although not suggested - can be modified if necessary)
        • The models can be properly versioned - you can use version control tools (even GitHub) to manage your models, see the differences of versions, etc
        • You're not married to some proprietary software that costs six-figure numbers :-)
        • And so on - probably we'll do a whole article about them in the future.

         

        Necessary software

        This guide is written for Windows 10. GitHub Desktop seems to be the only piece not available on Linux, but there sure are differences when it comes to setting environment variables and so on.

        You will need the following software present to conveniently go through the below steps:

        1. GitHub Desktop
        2. Java JDK version 8.x (we're moving to JDK 9 soon)
        3. Maven
        4. Postman

         

        Pulling the Sources

        Run GitHub desktop - you should get to the following screen:

        From there you need to choose Clone a repository, locate the URL tab and enter 'https://github.com/openscoring/openscoring':

        clone.jpg

        Now you've got the source code on your machine and next step is building the application.

         

        Building and Running the Application

        For building you need to have both JDK and Maven installed.

        Make sure you have both maven\bin and java\bin on your path. You can execute the following commands from your command prompt, just make sure you change the actual path according to your installation:

        setx path "%path%;C:\apache-maven-3.5.2\bin"
        setx path "%path%;C:\Program Files\Java\jdk1.8.0_161\bin"

        Read this guide how to set JAVA_HOME (it's a system variable and cannot be set from command line unless you're admin).

        Now open command prompt and navigate to where your Openscoring sources are and start the build: 

        cd c:\github\openscoring
        c:\github\openscoring> mvn clean install
        

        This will take a few minutes as all the dependencies are downloaded, code is compiled and tests are run. It should end with the following messages:

        Now navigate to openscoring-server directory from command line and run the following command:

        C:\github\openscoring>cd openscoring-server
        C:\github\openscoring\openscoring-server>java -jar target/server-executable-1.4-SNAPSHOT.jar

        This should result the Openscoring server running:

        Now Openscoring service is listening at http://localhost:8080/openscoring/. You can navigate there using your browser, but as there're no models deployed, it will result with an error message for now.

         

        Deploying the Model Using Postman

        Postman is a very simple and intuitive tool for sending HTTP requests. It requires registration, but this is worth the two minutes.

        Now we're finally ready to deploy the model. Example PMML file and requests are located in openscoring\openscoring-service\src\etc directory. You need to pay attention to a few things:

        • Method must be PUT
        • URL must be http://localhost:8080//openscoring/model/<model name> (model name is something you should choose)
        • Content-Type header must be set to application/xml
        • In Body tab, you should choose binary and also locate the model PMML file (again, the examples are in openscoring\openscoring-service\src\etc directory, choose DecisionTreeIris.pmml from there)
        body.jpg

        And then you can click SEND. This is what should appear in a moment:

        In the response body, you see the deployed model in JSON format. This means that the model is deployed and accessible at the endpoint you specified, in my example http://localhost:8080/openscoring/model/DecisionTreeIris.

         

        Executing the Deployed PMML Model

        When making the scoring request, you need to keep an eye on the following items:

        • HTTP method must be POST
        • URL will remain the same
        • On Body tab, choose raw and set the content type application/json (this will also change the Content-Type header on Headers tab)
        • The following request can be pasted to Body window:
        {
            "id" : "record-001",
            "arguments" : {
                "Sepal_Length" : 5.1,
                "Sepal_Width" : 3.5,
                "Petal_Length" : 1.4,
                "Petal_Width" : 0.2
            }
        }

        And there you go - based on the DecisionTreeIris model, your iris is of 'setosa' type with the probability score of 1.

        You can check out the other methods in Openscoring REST API from the documentation.

        Using Apache Spark ML pipeline models for real-time prediction: the Openscoring REST web service approach

        Originally written by Villu Ruusmann

        EDIT: There's also an updated version of the post, where we even provide Amazon Machine Image for testing.

        Apache Spark follows the batch data processing paradigm, which has its strengths and weaknesses. On one hand, the batch processing is suitable for working with true Big Data datasets. Apache Spark splits the task into manageable-size batches and distributes the workfload across a cluster of machines. Apache Spark competitors such as R or Python cannot match that, because they typically require the task to fit into the RAM of a single machine.

        On the other hand, the batch processing is characterized by high "inertia". Apache Spark falls short in application areas where it is necessary to work with small datasets (eg. single data records) in real time. Essentially, there is a lower bound (instead of an upper bound) to the effective size of a task.

        This blog post is about demonstrating a workflow where Spark ML pipeline models are exported in Predictive Model Markup Language (PMML) data format, and then imported into Openscoring REST web service for easy interfacing with third-party applications.

        Step 1: Exporting Spark ML pipeline models to PMML

        The support for PMML was introduced in Apache Spark MLlib version 1.4.0 in the form of a org.apache.spark.mllib.pmml.PMMLExportable trait. The invocation of the PMMLExportable#toPMML()method (or one of its overloaded variants) produces a PMML document withich contains the symbolic description of the fitted model object.

        Unfortunately, this solution is not very relevant to Apache Spark ML. First, Spark ML is organized around the pipeline concept. A Spark ML pipeline can be regarded as a directed graph of data transformations and models. When exporting a model, then it will be necessary to include all the preceding stages to the dump. Second, Spark ML comes with rich metadata. The DataFramerepresentation of a dataset is associated with a static schema, which can be queried for column names, data types and more. Finally, Spark ML has replaced and/or abstracted away a great deal of Spark MLlib APIs. Newer versions of Spark ML have almost completely ceased to rely on Spark MLlib classes that implement the PMMLExportable trait.

        The JPMML-SparkML library is an independent effort to provide a fully-featured PMML exporter for Spark ML pipelines.

        The main interaction point is the org.jpmml.sparkml.ConverterUtil#toPMML(StructType, PipelineModel) utility method. The conversion engine initializes a PMML document based on the StructType argument, and fills it with relevant content by iterating over all the stages of the PipelineModel argument.

        The conversion engine requires a valid class mapping from org.apache.spark.ml.Transformer to org.jpmml.sparkml.TransformerConverter for every stage class. The class mappings registry is automatically populated for most common Spark ML transformer and model types. Application developers can implement and register their own TransformerConverter classes when looking to move beyond that.

        Typical usage:

        DataFrame dataFrame = ...;
        StructType schema = dataFrame.schema();
        
        Pipeline pipeline = ...;
        PipelineModel pipelineModel = pipeline.fit(dataFrame);
        
        PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);
        
        JAXBUtil.marshalPMML(pmml, new StreamResult(System.out));

        The JPMML-SparkML library depends on a newer version of the JPMML-Model library than Spark MLlib, which introduces severe compile-time and run-time classpath conflicts. The solution is to employ Maven Shade Plugin and relocate the affected org.dmg.pmml and org.jpmml.(agent|model|schema) packages.

        The JPMML-SparkML-Bootstrap project aims to provide a complete example about developing and packaging an JPMML-SparkML powered application.

        The org.jpmml.sparkml.bootstrap.Main application class demonstrates a two-stage Spark ML pipeline. The first stage is a RFormula feature selector that selects columns from a CSV input file. The second stage is either a DecisionTreeRegressor or DecisionTreeClassifier estimator that finds the best approximation between the target column and active columns. The result is written to a PMML output file.

        The exercise starts with training a classification-type decision tree model for the "wine quality" dataset:

        spark-submit \
          --class org.jpmml.sparkml.bootstrap.Main \
          /path/to/jpmml-sparkml-bootstrap/target/bootstrap-1.0-SNAPSHOT.jar \
          --formula "color ~ . -quality" \
          --csv-input /path/to/jpmml-sparkml-bootstrap/src/test/resources/wine.csv \
          --function CLASSIFICATION \
          --pmml-output wine-color.pmml

        The resulting wine-color.pmml file can be opened for inspection in a text editor.

        Step 2: The essentials of PMML representation

        A PMML document specifies a workflow for transforming an input data record to an output data record. The end user interacts with the entry and exit interfaces of the workflow, and can completely disregard its internals.

        The design and implementation of these two interfaces is PMML engine specific. The JPMML-Evaluator library is geared towards maximum automation. The entry interface exposes complete description of active fields. Similarly, the exit interface exposes complete description of the primary target field and secondary output fields. A capable end user agent can use this information to format input data record and parse output data records without any external help.

        Input

        The decision tree model is represented as the PMML/TreeModel element. Its schema is defined by the combination of MiningSchema and Output child elements.

        A MiningField element serves as a collection of "import" and "export" statements. It refers to some field, and stipulates its role and requirements in the context of the current model element. The fields themselves are declared as PMML/DataDictionary/DataField and PMML/TransformationDictionary/DerivedField elements.

        The wine color model defines eight input fields ("fixed_acidity", "volatile_acidity", .., "sulphates"). The values of input fields are prepared by performing type conversion from user-supplied representation to PMML representation, which is followed by categorization into valid, invalid or missing subspaces, and application of subspace-specific treatments.

        The default definition of the "fixed_acidity" input field:

        <PMML>
          <DataDictionary>
            <DataField name="fixed_acidity" optype="continuous" dataType="double"/>
          </DataDictionary>
          <TreeModel>
            <MiningSchema>
              <MiningField name="fixed_acidity"/>
            </MiningSchema>
          </TreeModel>
        </PMML>

        The same, after manual enhancement:

        <PMML>
          <DataDictionary>
            <DataField name="fixed_acidity" optype="continuous" dataType="double">
              <Value value="?" property="missing"/>
              <Interval closure="closure" leftMargin="3.8" rightMargin="15.9"/>
            </DataField>
          </DataDictionary>
          <TreeModel>
            <MiningSchema>
              <MiningField name="fixed_acidity" invalidValueTreatment="returnInvalid" missingValueReplacement="7.215307" missingValueTreatment="asMean"/>
            </MiningSchema>
          </TreeModel>
        </PMML>

        The enhanced definition reads:

        1. If the user didn't supply a value for the "fixed_acidity" input field, or its string representation is equal to string constant "?", then replace it with string constant "7.215307".
        2. Convert the value to double data type and continuous operational type.
        3. If the value is in range [3.8, 15.9], then pass it on to the model element. Otherwise, throw an "invalid value" exception.

        Output

        The primary target field may be accompanied by a set of secondary output fields, which expose additional details about the prediction. For example, classification models typically return the label of the winning class as the primary result, and the breakdown of the class probability distribution as the secondary result.

        Secondary output fields are declared as Output/OutputField elements.

        Spark ML models indicate the availability of additional details by implementing marker interfaces. The conversion engine keeps an eye out for the org.apache.spark.ml.param.shared.HasProbabilityColmarker interface. It is considered a proof that the classification model is capable of estimating class probability distribution, which is a prerequisite for encoding an Output element that contains probability-type OutputField child elements.

        The wine color model defines a primary target field ("color"), and two secondary output fields ("probability_white" and "probability_red"):

        <PMML>
          <DataDictionary>
            <DataField name="color" optype="categorical" dataType="string">
              <Value value="white"/>
              <Value value="red"/>
            </DataField>
          </DataDictionary>
          <TreeModel>
            <MiningSchema>
              <MiningField name="color" usageType="target"/>
            </MiningSchema>
            <Output>
              <OutputField name="probability_white" feature="probability" value="white"/>
              <OutputField name="probability_red" feature="probability" value="red"/>
            </Output>
          </TreeModel>
        </PMML>
        

        In case of decision tree models, it is often desirable to obtain information about the decision path. The identifier of the winning decision tree leaf can be queried by declaring an extra entityId-type OutputField element:

        <PMML>
          <TreeModel>
            <Output>
              <OutputField name="winnerId" feature="entityId"/>
            </Output>
          </TreeModel>
        </PMML>
        

        Spark ML does not assign explicit identifiers to decision tree nodes. Therefore, a PMML engine would be returning implicit identifiers in the form of a 1-based index, which are perfectly adequate for distinguishing between winning decision tree leafs.

        The JPMML-Evaluator and JPMML-Model libraries provides rich APIs that can resolve node identifiers to org.dmg.pmml.Node class model objects, and backtrack these to the root of the decision tree.

        Transformations

        From the PMML perspective, Spark ML data transformations can be classified as "real" or "pseudo". A "real" transformation performs a computation on a feature or a feature vector. It is encoded as one or more PMML/DataDictionary/DerivedField elements.

        Examples of "real" transformer classes:

        • Binarizer
        • Bucketizer
        • MinMaxScaler
        • PCA
        • QuantileDiscretizer
        • StandardScaler

        A Binarizer transformer for "discretizing" wine samples based on their sweetness:

        Binarizer sweetnessBinarizer = new Binarizer()
          .setThreshold(6)
          .setInputCol("residual_sugar")
          .setOutputColumn("sweet_indicator");
        

        The above, after conversion to PMML:

        <PMML>
          <DataDictionary>
            <DerivedField name="sweet_indicator" dataType="double" optype="continuous">
              <Apply function="if">
                <Apply function="lessOrEqual">
                  <FieldRef field="residual_sugar"/>
                  <Constant>6.0</Constant>
                </Apply>
                <Constant>0.0</Constant>
                <Constant>1.0</Constant>
              </Apply>
            </DerivedField>
          </DataDictionary>
        </PMML>
        

        A "pseudo" transformation performs Spark ML-specific housekeeping work such as assembling, disassembling or subsetting feature vectors.

        Examples of "pseudo" transformer classes:

        • ChiSqSelector
        • IndexToString
        • OneHotEncoder
        • RFormula
        • StringIndexer
        • VectorAssembler
        • VectorSlicer

        The conversion engine is capable of performing smart analyses and optimizations in order to produce a maximally compact and expressive PMML document. The case in point is the identification and pruning of unused field declarations, which improves the robustness and performance of production workflows

        For example, the wine.csv CSV data file contains 11 feature columns, but the wine color model reveals that three of them ("residual_sugar", "free_sulfur_dioxide" and "alcohol") do not contribute to the discrimination between white and red wines in any way. The conversion engine takes notice of that and omits all the related data transformations from the workflow, thereby eliminating three-elevenths of the complexity.

        Step 3: Importing PMML to Openscoring REST web service

        Openscoring provides a way to expose a predictive model as a REST web service. The primary design consideration is to make predictive models easily discoverable and usable (a variation of the HATEOAStheme) for human and machine agents alike. The PMML representation is perfect fit thanks to the availability of rich descriptive metadata. Other representations can be plugged into the framework with the help of wrappers that satisfy the requested metadata query needs.

        Openscoring is minimalistic Java web application that conforms to Servlet and JAX-RS specifications.

        It can be built from the source checkout using Apache Maven:

        git clone https://github.com/jpmml/openscoring.git
        ch openscoring
        mvn clean package
        

        Openscoring exists in two variants. First, the standalone command-line application variant openscoring-server/target/server-executable-${version}.jar is based on Jetty web server. Easy configuration and almost instant startup and shutdown times make it suitable for local development and testing use cases. The web application (WAR) variant openscoring-webapp/target/openscoring-webapp-${version}.war is more suitable for production use cases. It can be deployed on any standards-compliant Java web- or application container, and secured and scaled according to organization's preferences.

        Alternatively, release versions of the Openscoring WAR file can be downloaded from the org/openscoring/openscoring-webapp section of the Maven Central repository.

        A demo instance of Openscoring can be launched by dropping its WAR file into the auto-deployment directory of a running Apache Tomcat web container:

        1. Download the latest openscoring-webapp-${version}.war file from the Maven Central repository to a temporary directory. At the time of writing this, it is openscoring-webapp-1.2.15.war.
        2. Rename the downloaded file to openscoring.war. Apache Tomcat generates the context path for a web application from the filename part of the WAR file. So, the context path for openscoring.warwill be "/openscoring/" (whereas for the original openscoring-webapp-${version}.war it would have been "/openscoring-webapp-${version}/").
        3. Move the openscoring.war file from the temporary directory to the $CATALINA_HOME/webapps auto-deployment directory. Allow the directory watchdog thread a couple of seconds to unpack and deploy the web application.
        4. Verify the deployment by accessing http://localhost:8080/openscoring/model. Upon success, the response body should be an empty JSON object { }.

        Openscoring maps every PMML document to a /model/${id} endpoint, which provides model-oriented information and services according to the REST API specification.

        Model deployment, download and undeployment are privileged actions that are only accessible to users with the "admin" role. All the unprivileged actions are accessible to all users. This basic access and authorization control can be overriden at the Java web container level. For example, configuring Servet filters that restrict the visibility of endpoints by some prefix/suffix, restrict the number of data records that can be evaluated in a time period, etc.

        Deployment

        Adding the wine color model:

        curl -X PUT --data-binary @/path/to/wine-color.pmml -H "Content-type: text/xml" http://localhost:8080/openscoring/model/wine-color
        

        The response body is an org.openscoring.common.ModelResponse object:

        {
          "id" : "wine-color",
          "miningFunction" : "classification",
          "summary" : "Tree model",
          "properties" : {
            "created.timestamp" : "2016-06-19T21:35:58.592+0000",
            "accessed.timestamp" : null,
            "file.size" : 13537,
            "file.md5sum" : "1a4eb6324dc14c00188aeac2dfd6bb03"
          },
          "schema" : {
            "activeFields" : [ {
              "id" : "fixed_acidity",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "volatile_acidity",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "citric_acid",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "chlorides",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "total_sulfur_dioxide",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "density",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "pH",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "sulphates",
              "dataType" : "double",
              "opType" : "continuous"
            } ],
            "targetFields" : [ {
              "id" : "color",
              "dataType" : "string",
              "opType" : "categorical",
              "values" : [ "white", "red" ]
            } ],
            "outputFields" : [ {
              "id" : "probability_white",
              "dataType" : "double",
              "opType" : "continuous"
            }, {
              "id" : "probability_red",
              "dataType" : "double",
              "opType" : "continuous"
            } ]
          }
        }
        

        The pattern is to move all model-related logic to the server side, so that Openscoring client applications could be developed and used on a wide variety of platforms by people with varying degrees of experience.

        All agents should be able to "parse" the above object at the basic model identification and schema level. For example, understanding that the REST endpoint /model/wine-color holds a classification-type decision tree model, which consumes an eight-element input data record, and produces a three-element output data record.

        More sophisticated agents could rise to elevated model verification and field schema levels. For example, checking that the reported file size and MD5 checksum are correct, and establishing field mappings between the model and the data store.

        Evaluation

        Evaluating the wine color model in single prediction mode:

        curl -X POST --data-binary @/path/to/data_record.json -H "Content-type: application/json" http://localhost:8080/openscoring/model/wine-color
        

        The request body is an org.openscoring.common.EvaluationRequest object:

        {
          "id" : "sample-1",
          "arguments" : {
            "fixed_acidity" : 7.4,
            "volatile_acidity" : 0.7,
            "citric_acid" : 0,
            "chlorides" : 0.076,
            "total_sulfur_dioxide" : 34,
            "density" : 0.9978,
            "pH" : 3.51,
            "sulphates" : 0.56
          }
        }
        

        The response body is an org.openscoring.common.EvaluationResponse object:

        {
          "id" : "sample-1",
          "result" : {
            "color" : "red",
            "probability_white" : 8.264462809917355E-4,
            "probability_red" : 0.9991735537190083
          }
        }
        

        Evaluating the wine color model in CSV mode:

        curl -X POST --data-binary @/path/to/wine.csv -H "Content-type: text/plain; charset=UTF-8" http://localhost:8080/openscoring/model/wine-color/csv > /path/to/wine-color.csv
        

        Undeployment

        Removing the wine color model:

        curl -X DELETE http://localhost:8080/openscoring/model/wine-color
        

        Openscoring client libraries

        The Openscoring REST API is fairly mature and stable. The majority of changes happen in the "REST over HTTP(S)" transport layer. For example, adding support for new data formats and encodings, new user authentication mechanisms, etc.

        Openscoring client libraries provide easy and effective means for keeping up with changes. Application developers get to focus on high-level routines such as "deploy", "evaluate" and "undeploy" commands, whose syntactics and semantics should remain stable for extended period of time.

        The Java client library is part of the Openscoring project. Other client libraries (Python, R, PHP) are isolated into their own projects.

        For example, the following Python script uses the Openscoring-Python library to replicate the example workflow.

        import openscoring
        
        os = openscoring.Openscoring("http://localhost:8080/openscoring")
        
        # Deployment
        os.deploy("wine-color", "/path/to/wine-color.pmml")
        
        # Evaluation in single prediction mode
        arguments = {
          "fixed_acidity" : 7.4,
          "volatile_acidity" : 0.7,
          "citric_acid" : 0,
          "chlorides" : 0.076,
          "total_sulfur_dioxide" : 34,
          "density" : 0.9978,
          "pH" : 3.51,
          "sulphates" : 0.56
        }
        result = os.evaluate("wine-color", arguments)
        print(result)
        
        # Evaluation in CSV mode
        os.evaluateCsv("wine-color", "/path/to/wine.csv", "/path/to/wine-color.csv")
        
        # Undeployment
        os.undeploy("wine-color")

        Converting R's random forest (RF) models to PMML documents

        Originally written by Villu Ruusmann

        The power and versatility of the R environment stems from its modular architecture. The functionality of the base platform can be quickly and easily expanded by downloading extension packages from the CRAN repository. For example, random forest models can be trained using the following functions:

        1. randomForest (randomForest package). Generic regression and classification. This is the reference implementation.
        2. cforest (party package). Generic regression and classification.
        3. randomUniformForest (randomUniformForest package). Generic regression and classification.
        4. bigrfc (bigrf package). Generic classification.
        5. logforest (LogicForest package). Binary classification.
        6. obliqueRF (obliqueRF package). Binary classification.
        7. quantregForest (quantregForest package). Quantile regression.

        Every function implements a variation of the "bagging of decision trees" idea. The result is returned as a random forest object, whose description is typically formalized using a package-specific S3 or S4 class definition.

        All such model objects are dummy data structures. They can only be executed using a corresponding function predict.<model_type>. For example, a random forest object that was trained using the function randomForest can only be executed by the function predict.randomForest (and not with some other function such as predict.cforest, predict.randomUniformForest etc.).

        This one-to-one correspondence between models and model execution functions makes the deployment of R models on Java and Python platforms very complicated. Basically, it will be necessary to implement a separate Java and Python executor for every model type.

        R_java.jpg

        Predictive Model Markup Language (PMML) is an XML-based industry standard for the representation of predictive solutions. PMML provides a MiningModel element that can encode a wide variety of bagging and boosting models (plus more complex model workflows). A model that has been converted to the PMML data format can be executed by any compliant PMML engine. A list of PMML producer and consumer software can be found at Data Mining Group (DMG) website under the PMML Poweredsection.

        PMML leads to simpler and more robust model deployment workflows. Basically, models are first converted from their function-specific R representation to the PMML representation, and then executed on a shared platform-specific PMML engine. For the Java platform this could be the JPMML-Evaluatorlibrary. For the Python platform this could be Augustus library.

        R_java2.jpg

        The conversion of model objects from R to PMML is straightforward, because these two languages share many of the core concepts. For example, they both regard data records as collections of key-value pairs (eg. individual fields are identified by name not by position), and decorate their data exchange interfaces (eg. model input and output data records) with data schema information.

        Conversion

        The first version of the pmml package was released in early 2007. This package has provided great service for the community over the years. However, it has largely failed to respond to new trends and developments, such as the emergence and widespread adoption of ensemble methods.

        This blog post is about introducing the r2pmml package. Today, it simply addresses the major shortcomings of the pmml package. Going forward, it aims to bring a completely new set of tools to the table. The long-term goal is to make R models together with associated data pre- and post-processing workflows easily exportable to other platforms.

        The exercise starts with training a classification-type random forest model for the "audit" dataset. All the data preparation work has been isolated to a separate R script "audit.R".

        source("audit.R")
        
        measure = function(fun){
          begin.time = proc.time()
          result = fun()
          end.time = proc.time();
        
          diff = (end.time - begin.time)
          print(paste("Operation completed in", round(diff[3] * 1000), "ms."))
        
          return (result)
        }
        
        audit = loadAuditData()
        audit = na.omit(audit)
        
        library("randomForest")
        
        set.seed(42)
        audit.rf = randomForest(Adjusted ~ ., data = audit, ntree = 100)
        format(object.size(audit.rf), unit = "kB")
        
        library("pmml")
        
        audit.pmml = measure(function(){ pmml(audit.rf) })
        format(object.size(audit.pmml), unit = "kB")
        measure(function(){ saveXML(audit.pmml, "/tmp/audit-pmml.pmml") })
        
        library("r2pmml")
        
        measure(function(){ r2pmml(audit.rf, "/tmp/audit-r2pmml.pmml") })
        measure(function(){ r2pmml(audit.rf, "/tmp/audit-r2pmml.pmml") })
        

        The summary of the training run:

        1. Model training:
          • The size of the audit.rf object is 2'031 kB.
        2. Model export using the pmml package:
          • The pmml function call is completed in 61'280 ms.
          • The size of the audit.pmml object is 280'058 kB.
          • The saveXML function call is completed in 33'926 ms.
          • The size of the XML-tidied audit-pmml.pmml file is 6'853 kB.
        3. Model export using the r2pmml package:
          • The first r2pmml function call is completed in 4'077 ms.
          • The second r2pmml function call is completed in 1'466 ms.
          • The size of the XML-tidied audit-r2pmml.pmml file is 6'106 kB.

        pmml package

        Typical usage:

        library("pmml")
        
        audit.pmml = pmml(audit.rf)
        saveXML(audit.pmml, "/tmp/audit-pmml.pmml")
        

        This package defines a conversion function pmml.<model_type> for every supported model type. However, in most cases, it is recommended to invoke the S3 generic function pmml instead. This function determines the type of the argument model object, and automatically selects the most appropriate conversion function.

        When the S3 generic function pmml is invoked using an unsupported model object, then the following error message is printed:

        Error in UseMethod("pmml") :
          no applicable method for 'pmml' applied to an object of class "RandomForest"
        

        The conversion produces an XMLNode object, which is a Document Object Model (DOM) representation of the PMML document. This object can be saved to a file using the function saveXML.

        This package has hard time handling large model objects (eg. bagging and boosting models) for two reasons. First, all the processing takes place in R memory space. In this example, the memory usage of user objects grows more than hundred times, because the ~2 MB random forest object audit.rf gives rise to a ~280 MB DOM object audit.pmml. Moreover, all this memory is allocated incrementally in small fragments (ie. every new DOM node becomes a separate object), not in a large contiguous block. On a more positive note, it is possible that the (desktop-) GNU R implementation is outperformed in memory management aspects by alternative (server side-) R implementations.

        Second, DOM is a low-level API, which is unsuitable for working with specific XML dialects such as PMML. Any proper medium- to high-level API should deliver much more compact representation of objects, plus take care of technical trivialities such as XML serialization and deserialization.

        r2pmml package

        Typical usage:

        library("r2pmml")
        
        r2pmml(audit.rf, "/tmp/audit-r2pmml.pmml")
        

        The package defines a sole conversion function r2pmml, which is a thin wrapper around the Java converter application class org.jpmml.converter.Main. Behind the scenes, this function performs the following operations:

        1. Serializing the argument model object in ProtoBuf data format to a temporary file.
        2. Initializing the JPMML-Converter instance:
          • Setting the ProtoBuf input file to the temporary ProtoBuf file
          • Setting the PMML output file to the argument file
        3. Executing the JPMML-Converter instance.
        4. Cleaning up the temporary ProtoBuf file.

        The capabilities of the function r2pmml (eg. the selection of supported model types) are completely defined by the capabilities of the JPMML-Converter library.

        This package addresses the technical limitations of the pmml package completely. First, all the processing (except for the serialization of the model object to a temporary file in the ProtoBuf data format) has been moved from the R memory space to a dedicated Java Virtual Machine (JVM) memory space. Second, model converter classes employ the JPMML-Model library, which delivers high efficiency without compromising on functionality. In this example, the ~2 MB random forest object audit.rf gives rise to a ~5.3 MB Java PMML class model object. That is 280 MB / 5.3 MB = ~50 times smaller than the DOM representation!

        The detailed timing information about the conversion is very interesting (the readings correspond to the first and second r2pmml function call):

        1. The R side of operations:
          • Serializing the model in ProtoBuf data format to the temporary file: 1'262 and 1'007 ms.
        2. The Java side of operations:
          • Deserializing the model from the temporary file: 166 and 14 ms.
          • Converting the model from R representation to PMML representation: 648 and 310 ms.
          • Serializing the model in PMML data format to the output file: 2'001 and 135 ms.

        The newly introduced r2pmml package fulfills all expectations by being 100 to 200 times faster than the pmml package (eg. 310 vs 61'280 ms. for model conversion, 135 vs 33'926 ms. for model serialization). The gains are even higher when working with real-life random forest models that are order(s) of magnitude larger. Some gains are attributable to JVM warmup, because the conversion of ensemble models involves performing many repetitive tasks. The other gains are attributable to the smart caching of PMML content by the JPMML-Converter library, which lets the memory usage to scale sublinearly (with respect to the size and complexity of the model).

        Also, the newly introduced r2pmml package is able to encode the same amount of information using fewer bytes than the pmml package. In this example, if the resulting files audit-r2pmml.pmml and audit-pmml.pmml are XML-tidied following the same procedure, then it becomes apparent that the former is approximately 10% smaller than the latter (6'106 vs 6'853 kB).

        Appendix

        The r2pmml package depends on the RProtoBuf package for ProtoBuf serialization and the rJavapackage for Java invocation functionality. Both packages can be downloaded and installed from the CRAN repository using R built-in function install.packages.

        Here, the installation and configuration is played out on a blank GNU/Linux system (Fedora). All system-level dependencies are handled using the Yum software package manager.

        RProtoBuf package

        This package depends on curl and protobuf system libraries. It is worth mentioning that if the package is built from its source form (default behavior on *NIX systems), then all the required system libraries must be present both in their standard (no suffix) and development flavors (identified by the "-dev" or "-devel" suffix).

        $ yum install curl curl-devel
        $ yum install protobuf protobuf-devel
        

        After that, the RProtoBuf package can be installed as usual:

        install.packages("RProtoBuf")
        

        If the system is missing the curl development library curl-devel, then the installation fails with the following error message:

        checking for curl-config... no
        Cannot find curl-config
        ERROR: configuration failed for package ‘RCurl’
        ERROR: dependency ‘RCurl’ is not available for package ‘RProtoBuf’
        

        If the system is missing the protobuf development library protobuf-devel, then the installation fails with the following error message:

        configure: error: ERROR: ProtoBuf headers required; use '-Iincludedir' in CXXFLAGS for unusual locations.
        ERROR: configuration failed for package ‘RProtoBuf’
        

        The format of ProtoBuf messages is defined by the proto file inst/proto/rexp.proto. Currently, the JPMML-Conversion library uses the proto file that came with the RProtoBuf package version 0.4.2. As a word of caution, it will be useless to force the r2pmml package to depend on any RProtoBuf package version older than that, because this proto file underwent incompatible changes between versions 0.4.1 and 0.4.2. The Java converter application throws an instance of com.google.protobuf.InvalidProtocolBufferException when the contents of the ProtoBuf input file does not match the expected ProtoBuf message format.

        The version of a package can be verified using the function packageVersion:

        packageVersion("RProtoBuf")
        

        rJava package

        This package depends on Java version 1.7.0 or newer.

        $ yum install java-1.7.0-openjdk
        

        The Java executable java must be available via system and/or user path. Everything should be good to go if the java version can be verified by launching the Java executable with the -version option:

        $ java -version
        

        After that, the rJava package can be installed as usual:

        install.packages("rJava")