JPMML-Evaluator Decision Tree/Random Forest Performance Test on Amazon EC2 t2.micro

One day I had a discussion with our CTO Villu Ruusmann about what scoring numbers we can actually use in our sales pitch or promote on the web page. Villu was telling me about single digit microseconds, which - I can't say I didn't believe, but was quite a bit skeptical about. I wanted to see it by myself, so Villu gave me the following command line to experiment with:

C:\github\jpmml-evaluator>java -jar pmml-evaluator-example\target\example-1.4-SNAPSHOT.jar --model c:\github\openscoring\openscoring-service\src\etc\DecisionTreeIris.pmml --input c:\github\openscoring\openscoring-service\src\etc\input.csv --output test.txt --intern --optimize --loop 1000000

It gave me some interesting numbers which seemed to confirm the single digit microseconds, but I won't paste them here right away, because this was run on my personal laptop, the CPU usage never exceeded 40% and the CSV file to be scored includes 3 records. Not good for comparison.

I wanted something repeatable, something everyone can actually try, check out the setup themselves, count the lines in the input CSV files, verify the number of trees in a random forest. So once again, I turned to Amazon EC2.


The Setup

I used the following setup for the tests:

  • All tests were done on Amazon EC2 t2.micro (available on free tier). I didn't accumulate any costs during the tests.
  • JPMML-Evaluator 1.4.0 was used in the test
  • CSV files were used as an input to the tests (it's by far easiest way to generate large number of records)
  • Each test included scoring a million data points:
    • 1-record CSV scored 1000000 times
    • 10-record CSV scored 100000 times
    • 100-record CSV scored 10000 times
    • 1000-record CSV scored 1000 times
    • 10000-record CSV scored 100 times
    • 100000-record CSV scored 10 times
    • (with 1 000 000 records once I ran into memory issues on t2.micro)
  • The same approach was used to score the following models:
  • The following command line parameters were added the the evaluator:
    • --model specifies the model to be used (model is loaded only once and not 1 million times when scoring single record)
    • --input points to the CSV file containing the records to be scored
    • --output specifies the output file
      • Output file is overwritten for every loop - scoring 1 record million times still produces 1 line in output file and means that file is opened, written and closed also a million times
      • Scoring 1 hundred thousand records ten times produces output file with 100 000 records
      • Using /dev/null as output didn't result any performance gains
    • --loop which specifies the number of times input file is run through the scoring process
    • --intern replaces recurring elements in PMML with one element, this improves memory usage, but not performance
    • --optimize which tries to convert java.lang.String elements to java.lang.Integer or java.lang.Double once in the beginning and not on every model execution
  • I ran every command 5 times, chose the best 3 from those 5 and averaged the results (this was done because while the highs were relatively consistent, there were 2-3 lows during the overall testing process which I couldn't anyhow relate to our app).


The Results: Iris Decision Tree


With a simple decision tree the single data point is scored wih 5 microseconds (that's five millionths of a second aka 0.005 milliseconds). The peak performance is with a combination of 100-line CSV scored 10 000 times, where nearly 250 000 records are scored with 1 second, bringing the single score down to 4 microseconds. My theory is that the peak is produced by an optimum between handling file writes (remember, when scoring a single record, file is opened, written and closed 1 million times) and managing the CSV in memory (addressing an array with 1000 items starts to take its toll).


The Results: Titanic Random Forest


With 20-tree random forest, the single data point is scored with 19 microseconds, which is considerably slower than with a single decision tree, but it's nowhere near 20 times slower as one would estimate - it's merely 3,6 times slower. The optimum performance at 100*10 000 is not as pronounced as proportionally more time is spent on calculating the scores than on other activities (like writing files).




I've included a graph showing Iris (tree model) and Titanic (random forest) model performance side by side - again, although the random forest contains 20 tree models whose output is averaged, it's not nearly 20 times slower than a single tree (which we consider really good!).



We can conclude that the base performance of JPMML library really is very good, with single data point scorings measurable in microseconds. And this is all in single thread, on the least powerful virtual machine AWS provides! In a less rigorous test on my Lenovo T470s laptop I could get around 700 000+ scores per second with the Iris decision tree at about 80% CPU utilization. Imagine what we could do in a multithreaded, optimized production environment running on a powerful server :-)


Next Steps and How You Can Help

Iris and Titanic are both decision tree models, we're interested in doing the same also with other model types, possibly trained in different environments and converted to PMML with different options. Same goes for our REST-based scoring engine.

You can help us by providing some actual models solving real-world problems along with sample dataset for scoring (say, 10 records) and we will run them under same conditions for comparison. Of course you can obfuscate the labels of inputs and outputs. Small enough models can be sent by email and the bigger ones shared using WeTransfer - my email is I'd appreciate a few words about what these models do, which I won't disclose without permission. Or alternatively - you can do your own tests on the AMIs provided below.


Amazon Machine Images

The AMI-s are as follows (let me know if you can't access any of these regions and I'll make the image available in your region too):

  • EU (Frankfurt) - ami-0a22a739ce1f9a9a7
  • US West (N. California) - ami-02a035e53d718a40f
  • US East (N. Virginia) - ami-0d90521380e13b956

When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).

JPMML-Evaluator is located in /home/ec2-user/jpmml-evaluator with Iris PMML and CSV-s in IrisTestData/ and TitanicTestData/ respectively.

The sample command line goes like this (assuming working directory ~/jpmml-evaluator):

java -jar pmml-evaluator-example/target/example-1.4-SNAPSHOT.jar --model TitanicTestData/Titanic.pmml --input TitanicTestData/Titanic-10.csv --output test.txt --intern --optimize --loop 100000

Here's the sample output:

[ec2-user@ip-172-31-41-51 jpmml-evaluator]$ pwd
[ec2-user@ip-172-31-41-51 jpmml-evaluator]$ java -jar pmml-evaluator-example/target/example-1.4-SNAPSHOT.jar --model TitanicTestData/Titanic.pmml --input TitanicTestData/Titanic-10.csv --output test.txt --intern --optimize --loop 100000 
3/18/18 9:59:54 AM
-- Timers --------
             count = 100000
         mean rate = 5353.47 calls/second
     1-minute rate = 4464.99 calls/second
     5-minute rate = 4279.49 calls/second
    15-minute rate = 4246.15 calls/second
               min = 0.14 milliseconds
               max = 188.50 milliseconds
              mean = 0.18 milliseconds
            stddev = 0.65 milliseconds
            median = 0.16 milliseconds
              75% <= 0.17 milliseconds
              95% <= 0.18 milliseconds
              98% <= 0.66 milliseconds
              99% <= 0.75 milliseconds
            99.9% <= 4.47 milliseconds

Please be aware that the calls per second mean rate is per one CSV file that you feed to JPMML-Evaluator (if you have 1 record CSV, the mean rate is per single data point, if you have million-line CSV, the mean rate is per million records and to get single data point score, you have to multiply it by million).

Just a quick note from Villu regarding the difference between max and mean time - the maximum time happens on the very first evaluation when the optimizations are done.

Training Random Forest Model in Apache Spark and Deploying It to Openscoring REST Web Service


In this tutorial we're building the following setup:

  1. We take the Titanic dataset 
  2. We are going to use a PySpark script based on this example (using SQL Dataframe, ML lib and Pipelines) and modify it a little bit to suit our needs
  3. We train a random forest model using Apache Spark Pipelines
  4. We convert the model to standard PMML using JPMML-SparkML package
  5. We deploy the random forest model to Openscoring web service using Openscoring Python client library
  6. We get real-time scoring of test data directly from Spark envrionment


Setting up the environment

We have some good news here - we've done all the heavy lifting for you on the Amazon Machine Images that you can use as a template to launch your own EC2 instance and try the same steps there. Please be aware that this AMI is more like a quick and dirty proof of concept setup and less like production. Of course you can do the whole setup from scratch by yourself, it's no harder than just tracing the dependencies and installing them one by one (you're going to need JDK8, Spark2.2, Maven, Git, Python, NumPy, pandas, Openscoring, JPMML-SparkML package - just to name a few).

The AMI-s are as follows (let me know if you can't access any of these regions and I'll make the image available in your region too):

  • EU (Frankfurt) - ami-0856fb979732e0016
  • EU (Ireland) - ami-0aa76e5c2adbaf682
  • US West (N. California) - ami-0c6ac011f918947e1
  • US East (N. Virginia) - ami-07f1b7187e58af6b4
  • Asia Pacific (Singapore) - ami-0ddae89443be18b52

When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).


Launching Spark

Once you've logged in to the EC2 instance you just created based on our template, you can use the following commands to launch Apache Spark:

$ cd spark-2.2.1-bin-hadoop2.7/bin
$ ./pyspark --jars /home/ec2-user/jpmml-sparkml-package/target/jpmml-sparkml-package-1.3-SNAPSHOT.jar

This will open the PySpark prompt and will also include the necessary JPMML-SparkML package jar file, that will take care of the pipeline conversion to PMML.


Training the Random Forest Model with Titanic data

Here comes the Python code for training the model, deploying it to Openscoring web service running on the same EC2 and getting the test scores:

Based on the prediction the young female passenger had a pretty good chance to survive the Titanic catastrophy while the middle-aged man sadly didn't:

>>> arguments = {"Fare":7.25,"Sex":"female","Age":20,"SibSp":1,"Pclass":1,"Embarked":"Q"}
>>> print(os.evaluate("Titanic", arguments))
{u'probability(0)': 0.1742083085440686, u'probability(1)': 0.8257916914559313, u'pmml(prediction)': u'1', u'Survived': u'1', u'prediction': 1.0}

>>> arguments = {"Fare":7.25,"Sex":"male","Age":45,"SibSp":1,"Pclass":3,"Embarked":"S"}
>>> print(os.evaluate("Titanic", arguments))
{u'probability(0)': 0.8934427101921765, u'probability(1)': 0.10655728980782375, u'pmml(prediction)': u'0', u'Survived': u'0', u'prediction': 0.0}


Accessing the Model using Openscoring REST API

You can access Openscoring web service REST API (scroll down for the API spec) not only with the Python client library from Spark, but with any tool that can send custom HTTP requests. We prefer to use Postman for this. The previous post in our blog discussed using Openscoring REST service in more detail.

In order to access Openscoring running on your EC2 instance from outside the machine itself, you need to find the security group associated with your EC2 instance and add one rule:

You need to choose Custom TCP, port 8080, Anywhere - this will allow you to access Openscoring web service running on port 8080 from anywhere in the world (you can always choose 'My IP' to just open it to yourself). Now go back to the instances list, grab the public DNS of your instance and create the following request in Postman:

  • Use POST
  • Set URL to: http://<public DNS>:8080/openscoring/model/Titanic/csv
  • Set Content-Type header to text/plain

You can use the following data as the request body (or compile your own based on the data dictionary):


Submitting the request in Postman should result something like this:

As per the model predictions, the chances of surviving the Titanic catastrophy were pretty bleak :-(

If you don't bother installing Postman, you can access these two URLs even from your browser:

http://<public DNS>:8080/openscoring/model/Titanic
http://<public DNS>:8080/openscoring/model/Titanic/pmml

The first one will give you brief overview of model inputs and outputs in JSON format while the second one will give you the PMML representation of whole random forest.

Feel free to let me know of any issues related to this guide in the comments section!

Testing PMML Execution with Openscoring REST Service on Amazon EC2

Last time I wrote about how to install Openscoring web service to your own computer. This has several major dependencies like installing Maven, GitHub Desktop and possibly messing with Java versions and environment variables. While not too complex, its still requires some error-prone steps, depends on operting system versions and so on.

Today we want to lower the entry barrier even further - we've prepared and Amazon AMI (Amazon Machine Image) for you with preinstalled Openscoring service. This post will walk you through the steps required to get things up and running.


This time, you're going to need:

Amazon EC2 Setup

The part with steps related to AWS is a bit long, but that's also due to the amount of images included. Bear with me, it's all web browsing and no command line at all :-)

Assuming you have AWS account or have signed up during the previous step, this is what you need to do:

  • On the very first screen that pops up, you need to choose EC2 (or "Launch a virtual machine with EC2").
  • This will lead you to EC2 management screen. Take a look at the upper right corner of the screen where the AWS region is displayed:

I've created the images for following regions:

  1. EU (Frankfurt) - ami-0856fb979732e0016
  2. EU (Ireland) - ami-0aa76e5c2adbaf682
  3. US West (N. California) - ami-0c6ac011f918947e1
  4. US East (N. Virginia) - ami-07f1b7187e58af6b4
  5. Asia Pacific (Singapore) - ami-0ddae89443be18b52

If you cannot use any of these regions (I really don't know if Texas is US West or US East; at least from EU, I couldn't access the EC2 instance I created to US East region), please let me know and I'll make the image available in your region too.

  • From the left menu, choose Images->AMIs, choose Public Images and search according to you region for either the respective AMI ID directly or for "openscoring":
  • From the right-click menu choose Launch.
  • Next up is Launch process. Choose Free Tier (if you haven't been using AWS for more than 12 months) and now Next: Configure Instance Details
  • Click through the steps always choosing Next:... (the default conf here is just fine), until you reach the security part:
  • What you need to do here is Add Rule and choose Custom TCP, port 8080, Anywhere (this will allow you to access Openscoring web service running on port 8080 from anywhere in the world).
  • It will briefly complain about leaving the instance open to the world, but we don't care as this is for testing anyway, so go ahead and Review and Launch the instance.
  • You can launch the instance without configuring any keys (of course if you want you can do it, but for our purposes no command line access will be needed; you'll also be able to launch more instances based on the same AMI later with keys if you want to dig around there).
  • Now navigate to the Instances screen and wait until the status is running:
  • From here, copy the Public DNS.

Executing the Model

Openscoring service is already running on your EC2 instance on port 8080. You copied the public DNS name in previous step, now you can open Postman and paste the name into URL field, add port 8080 and you can do everything our REST API allows (scroll down to API spec).

For more details how to use Postman, check out the last chapters of our post Deploying PMML Model to Openscoring REST Evaluation Service.

The EC2 instace comes with DecisionTreeIris PMML predeployed, so you can easily assemble the following request for first test:

http://<public DNS>:8080/openscoring/model/DecisionTreeIris

Set the request body to the following value:

    "id" : "record-001",
    "arguments" : {
        "Sepal_Length" : 5.1,
        "Sepal_Width" : 3.5,
        "Petal_Length" : 1.4,
        "Petal_Width" : 0.2

The Content-Type header to: application/json and you're good to Send the request.

And there you go again - based on the DecisionTreeIris model, your iris is of 'setosa' type with the probability score of 1.

If any of the steps don't work out for you, feel free to let me know either in the comments or by email: and I'll try to help.

Deploying PMML Model to Openscoring REST Evaluation Service

This article will guide you through the steps of deploying a PMML predictive model on our REST API based scoring service. It follows the guidelines provided in the project README file and tries to simplify some aspects for non-Java non-programmers :-) This is seriously entry level, so feel free to skip the parts you've gone through or done before (like GitHub or Java environment setup).


Why Care?

During many conversations with our licensees, we've found out that it's unnecessarily complex to deploy and execute a PMML model. Providing a REST API is a step towards simplifying the usage of PMML models. We also provide a range of converters which let you first convert your Scikit-learn, R or Apache Spark proprietary model formats to standard PMML.

We strongly believe that using standardized PMML model format is the future of machine learning models. Having a text-based representation gives you the following advantages:

  • The models are human readable (and although not suggested - can be modified if necessary)
  • The models can be properly versioned - you can use version control tools (even GitHub) to manage your models, see the differences of versions, etc
  • You're not married to some proprietary software that costs six-figure numbers :-)
  • And so on - probably we'll do a whole article about them in the future.


Necessary software

This guide is written for Windows 10. GitHub Desktop seems to be the only piece not available on Linux, but there sure are differences when it comes to setting environment variables and so on.

You will need the following software present to conveniently go through the below steps:

  1. GitHub Desktop
  2. Java JDK version 8.x (we're moving to JDK 9 soon)
  3. Maven
  4. Postman


Pulling the Sources

Run GitHub desktop - you should get to the following screen:

From there you need to choose Clone a repository, locate the URL tab and enter '':


Now you've got the source code on your machine and next step is building the application.


Building and Running the Application

For building you need to have both JDK and Maven installed.

Make sure you have both maven\bin and java\bin on your path. You can execute the following commands from your command prompt, just make sure you change the actual path according to your installation:

setx path "%path%;C:\apache-maven-3.5.2\bin"
setx path "%path%;C:\Program Files\Java\jdk1.8.0_161\bin"

Read this guide how to set JAVA_HOME (it's a system variable and cannot be set from command line unless you're admin).

Now open command prompt and navigate to where your Openscoring sources are and start the build: 

cd c:\github\openscoring
c:\github\openscoring> mvn clean install

This will take a few minutes as all the dependencies are downloaded, code is compiled and tests are run. It should end with the following messages:

Now navigate to openscoring-server directory from command line and run the following command:

C:\github\openscoring>cd openscoring-server
C:\github\openscoring\openscoring-server>java -jar target/server-executable-1.4-SNAPSHOT.jar

This should result the Openscoring server running:

Now Openscoring service is listening at http://localhost:8080/openscoring/. You can navigate there using your browser, but as there're no models deployed, it will result with an error message for now.


Deploying the Model Using Postman

Postman is a very simple and intuitive tool for sending HTTP requests. It requires registration, but this is worth the two minutes.

Now we're finally ready to deploy the model. Example PMML file and requests are located in openscoring\openscoring-service\src\etc directory. You need to pay attention to a few things:

  • Method must be PUT
  • URL must be http://localhost:8080//openscoring/model/<model name> (model name is something you should choose)
  • Content-Type header must be set to application/xml
  • In Body tab, you should choose binary and also locate the model PMML file (again, the examples are in openscoring\openscoring-service\src\etc directory, choose DecisionTreeIris.pmml from there)

And then you can click SEND. This is what should appear in a moment:

In the response body, you see the deployed model in JSON format. This means that the model is deployed and accessible at the endpoint you specified, in my example http://localhost:8080/openscoring/model/DecisionTreeIris.


Executing the Deployed PMML Model

When making the scoring request, you need to keep an eye on the following items:

  • HTTP method must be POST
  • URL will remain the same
  • On Body tab, choose raw and set the content type application/json (this will also change the Content-Type header on Headers tab)
  • The following request can be pasted to Body window:
    "id" : "record-001",
    "arguments" : {
        "Sepal_Length" : 5.1,
        "Sepal_Width" : 3.5,
        "Petal_Length" : 1.4,
        "Petal_Width" : 0.2

And there you go - based on the DecisionTreeIris model, your iris is of 'setosa' type with the probability score of 1.

You can check out the other methods in Openscoring REST API from the documentation.

Using Apache Spark ML pipeline models for real-time prediction: the Openscoring REST web service approach

Originally written by Villu Ruusmann

EDIT: There's also an updated version of the post, where we even provide Amazon Machine Image for testing.

Apache Spark follows the batch data processing paradigm, which has its strengths and weaknesses. On one hand, the batch processing is suitable for working with true Big Data datasets. Apache Spark splits the task into manageable-size batches and distributes the workfload across a cluster of machines. Apache Spark competitors such as R or Python cannot match that, because they typically require the task to fit into the RAM of a single machine.

On the other hand, the batch processing is characterized by high "inertia". Apache Spark falls short in application areas where it is necessary to work with small datasets (eg. single data records) in real time. Essentially, there is a lower bound (instead of an upper bound) to the effective size of a task.

This blog post is about demonstrating a workflow where Spark ML pipeline models are exported in Predictive Model Markup Language (PMML) data format, and then imported into Openscoring REST web service for easy interfacing with third-party applications.

Step 1: Exporting Spark ML pipeline models to PMML

The support for PMML was introduced in Apache Spark MLlib version 1.4.0 in the form of a org.apache.spark.mllib.pmml.PMMLExportable trait. The invocation of the PMMLExportable#toPMML()method (or one of its overloaded variants) produces a PMML document withich contains the symbolic description of the fitted model object.

Unfortunately, this solution is not very relevant to Apache Spark ML. First, Spark ML is organized around the pipeline concept. A Spark ML pipeline can be regarded as a directed graph of data transformations and models. When exporting a model, then it will be necessary to include all the preceding stages to the dump. Second, Spark ML comes with rich metadata. The DataFramerepresentation of a dataset is associated with a static schema, which can be queried for column names, data types and more. Finally, Spark ML has replaced and/or abstracted away a great deal of Spark MLlib APIs. Newer versions of Spark ML have almost completely ceased to rely on Spark MLlib classes that implement the PMMLExportable trait.

The JPMML-SparkML library is an independent effort to provide a fully-featured PMML exporter for Spark ML pipelines.

The main interaction point is the org.jpmml.sparkml.ConverterUtil#toPMML(StructType, PipelineModel) utility method. The conversion engine initializes a PMML document based on the StructType argument, and fills it with relevant content by iterating over all the stages of the PipelineModel argument.

The conversion engine requires a valid class mapping from to org.jpmml.sparkml.TransformerConverter for every stage class. The class mappings registry is automatically populated for most common Spark ML transformer and model types. Application developers can implement and register their own TransformerConverter classes when looking to move beyond that.

Typical usage:

DataFrame dataFrame = ...;
StructType schema = dataFrame.schema();

Pipeline pipeline = ...;
PipelineModel pipelineModel =;

PMML pmml = ConverterUtil.toPMML(schema, pipelineModel);

JAXBUtil.marshalPMML(pmml, new StreamResult(System.out));

The JPMML-SparkML library depends on a newer version of the JPMML-Model library than Spark MLlib, which introduces severe compile-time and run-time classpath conflicts. The solution is to employ Maven Shade Plugin and relocate the affected org.dmg.pmml and org.jpmml.(agent|model|schema) packages.

The JPMML-SparkML-Bootstrap project aims to provide a complete example about developing and packaging an JPMML-SparkML powered application.

The org.jpmml.sparkml.bootstrap.Main application class demonstrates a two-stage Spark ML pipeline. The first stage is a RFormula feature selector that selects columns from a CSV input file. The second stage is either a DecisionTreeRegressor or DecisionTreeClassifier estimator that finds the best approximation between the target column and active columns. The result is written to a PMML output file.

The exercise starts with training a classification-type decision tree model for the "wine quality" dataset:

spark-submit \
  --class org.jpmml.sparkml.bootstrap.Main \
  /path/to/jpmml-sparkml-bootstrap/target/bootstrap-1.0-SNAPSHOT.jar \
  --formula "color ~ . -quality" \
  --csv-input /path/to/jpmml-sparkml-bootstrap/src/test/resources/wine.csv \
  --function CLASSIFICATION \
  --pmml-output wine-color.pmml

The resulting wine-color.pmml file can be opened for inspection in a text editor.

Step 2: The essentials of PMML representation

A PMML document specifies a workflow for transforming an input data record to an output data record. The end user interacts with the entry and exit interfaces of the workflow, and can completely disregard its internals.

The design and implementation of these two interfaces is PMML engine specific. The JPMML-Evaluator library is geared towards maximum automation. The entry interface exposes complete description of active fields. Similarly, the exit interface exposes complete description of the primary target field and secondary output fields. A capable end user agent can use this information to format input data record and parse output data records without any external help.


The decision tree model is represented as the PMML/TreeModel element. Its schema is defined by the combination of MiningSchema and Output child elements.

A MiningField element serves as a collection of "import" and "export" statements. It refers to some field, and stipulates its role and requirements in the context of the current model element. The fields themselves are declared as PMML/DataDictionary/DataField and PMML/TransformationDictionary/DerivedField elements.

The wine color model defines eight input fields ("fixed_acidity", "volatile_acidity", .., "sulphates"). The values of input fields are prepared by performing type conversion from user-supplied representation to PMML representation, which is followed by categorization into valid, invalid or missing subspaces, and application of subspace-specific treatments.

The default definition of the "fixed_acidity" input field:

    <DataField name="fixed_acidity" optype="continuous" dataType="double"/>
      <MiningField name="fixed_acidity"/>

The same, after manual enhancement:

    <DataField name="fixed_acidity" optype="continuous" dataType="double">
      <Value value="?" property="missing"/>
      <Interval closure="closure" leftMargin="3.8" rightMargin="15.9"/>
      <MiningField name="fixed_acidity" invalidValueTreatment="returnInvalid" missingValueReplacement="7.215307" missingValueTreatment="asMean"/>

The enhanced definition reads:

  1. If the user didn't supply a value for the "fixed_acidity" input field, or its string representation is equal to string constant "?", then replace it with string constant "7.215307".
  2. Convert the value to double data type and continuous operational type.
  3. If the value is in range [3.8, 15.9], then pass it on to the model element. Otherwise, throw an "invalid value" exception.


The primary target field may be accompanied by a set of secondary output fields, which expose additional details about the prediction. For example, classification models typically return the label of the winning class as the primary result, and the breakdown of the class probability distribution as the secondary result.

Secondary output fields are declared as Output/OutputField elements.

Spark ML models indicate the availability of additional details by implementing marker interfaces. The conversion engine keeps an eye out for the interface. It is considered a proof that the classification model is capable of estimating class probability distribution, which is a prerequisite for encoding an Output element that contains probability-type OutputField child elements.

The wine color model defines a primary target field ("color"), and two secondary output fields ("probability_white" and "probability_red"):

    <DataField name="color" optype="categorical" dataType="string">
      <Value value="white"/>
      <Value value="red"/>
      <MiningField name="color" usageType="target"/>
      <OutputField name="probability_white" feature="probability" value="white"/>
      <OutputField name="probability_red" feature="probability" value="red"/>

In case of decision tree models, it is often desirable to obtain information about the decision path. The identifier of the winning decision tree leaf can be queried by declaring an extra entityId-type OutputField element:

      <OutputField name="winnerId" feature="entityId"/>

Spark ML does not assign explicit identifiers to decision tree nodes. Therefore, a PMML engine would be returning implicit identifiers in the form of a 1-based index, which are perfectly adequate for distinguishing between winning decision tree leafs.

The JPMML-Evaluator and JPMML-Model libraries provides rich APIs that can resolve node identifiers to org.dmg.pmml.Node class model objects, and backtrack these to the root of the decision tree.


From the PMML perspective, Spark ML data transformations can be classified as "real" or "pseudo". A "real" transformation performs a computation on a feature or a feature vector. It is encoded as one or more PMML/DataDictionary/DerivedField elements.

Examples of "real" transformer classes:

  • Binarizer
  • Bucketizer
  • MinMaxScaler
  • PCA
  • QuantileDiscretizer
  • StandardScaler

A Binarizer transformer for "discretizing" wine samples based on their sweetness:

Binarizer sweetnessBinarizer = new Binarizer()

The above, after conversion to PMML:

    <DerivedField name="sweet_indicator" dataType="double" optype="continuous">
      <Apply function="if">
        <Apply function="lessOrEqual">
          <FieldRef field="residual_sugar"/>

A "pseudo" transformation performs Spark ML-specific housekeeping work such as assembling, disassembling or subsetting feature vectors.

Examples of "pseudo" transformer classes:

  • ChiSqSelector
  • IndexToString
  • OneHotEncoder
  • RFormula
  • StringIndexer
  • VectorAssembler
  • VectorSlicer

The conversion engine is capable of performing smart analyses and optimizations in order to produce a maximally compact and expressive PMML document. The case in point is the identification and pruning of unused field declarations, which improves the robustness and performance of production workflows

For example, the wine.csv CSV data file contains 11 feature columns, but the wine color model reveals that three of them ("residual_sugar", "free_sulfur_dioxide" and "alcohol") do not contribute to the discrimination between white and red wines in any way. The conversion engine takes notice of that and omits all the related data transformations from the workflow, thereby eliminating three-elevenths of the complexity.

Step 3: Importing PMML to Openscoring REST web service

Openscoring provides a way to expose a predictive model as a REST web service. The primary design consideration is to make predictive models easily discoverable and usable (a variation of the HATEOAStheme) for human and machine agents alike. The PMML representation is perfect fit thanks to the availability of rich descriptive metadata. Other representations can be plugged into the framework with the help of wrappers that satisfy the requested metadata query needs.

Openscoring is minimalistic Java web application that conforms to Servlet and JAX-RS specifications.

It can be built from the source checkout using Apache Maven:

git clone
ch openscoring
mvn clean package

Openscoring exists in two variants. First, the standalone command-line application variant openscoring-server/target/server-executable-${version}.jar is based on Jetty web server. Easy configuration and almost instant startup and shutdown times make it suitable for local development and testing use cases. The web application (WAR) variant openscoring-webapp/target/openscoring-webapp-${version}.war is more suitable for production use cases. It can be deployed on any standards-compliant Java web- or application container, and secured and scaled according to organization's preferences.

Alternatively, release versions of the Openscoring WAR file can be downloaded from the org/openscoring/openscoring-webapp section of the Maven Central repository.

A demo instance of Openscoring can be launched by dropping its WAR file into the auto-deployment directory of a running Apache Tomcat web container:

  1. Download the latest openscoring-webapp-${version}.war file from the Maven Central repository to a temporary directory. At the time of writing this, it is openscoring-webapp-1.2.15.war.
  2. Rename the downloaded file to openscoring.war. Apache Tomcat generates the context path for a web application from the filename part of the WAR file. So, the context path for openscoring.warwill be "/openscoring/" (whereas for the original openscoring-webapp-${version}.war it would have been "/openscoring-webapp-${version}/").
  3. Move the openscoring.war file from the temporary directory to the $CATALINA_HOME/webapps auto-deployment directory. Allow the directory watchdog thread a couple of seconds to unpack and deploy the web application.
  4. Verify the deployment by accessing http://localhost:8080/openscoring/model. Upon success, the response body should be an empty JSON object { }.

Openscoring maps every PMML document to a /model/${id} endpoint, which provides model-oriented information and services according to the REST API specification.

Model deployment, download and undeployment are privileged actions that are only accessible to users with the "admin" role. All the unprivileged actions are accessible to all users. This basic access and authorization control can be overriden at the Java web container level. For example, configuring Servet filters that restrict the visibility of endpoints by some prefix/suffix, restrict the number of data records that can be evaluated in a time period, etc.


Adding the wine color model:

curl -X PUT --data-binary @/path/to/wine-color.pmml -H "Content-type: text/xml" http://localhost:8080/openscoring/model/wine-color

The response body is an org.openscoring.common.ModelResponse object:

  "id" : "wine-color",
  "miningFunction" : "classification",
  "summary" : "Tree model",
  "properties" : {
    "created.timestamp" : "2016-06-19T21:35:58.592+0000",
    "accessed.timestamp" : null,
    "file.size" : 13537,
    "file.md5sum" : "1a4eb6324dc14c00188aeac2dfd6bb03"
  "schema" : {
    "activeFields" : [ {
      "id" : "fixed_acidity",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "volatile_acidity",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "citric_acid",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "chlorides",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "total_sulfur_dioxide",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "density",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "pH",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "sulphates",
      "dataType" : "double",
      "opType" : "continuous"
    } ],
    "targetFields" : [ {
      "id" : "color",
      "dataType" : "string",
      "opType" : "categorical",
      "values" : [ "white", "red" ]
    } ],
    "outputFields" : [ {
      "id" : "probability_white",
      "dataType" : "double",
      "opType" : "continuous"
    }, {
      "id" : "probability_red",
      "dataType" : "double",
      "opType" : "continuous"
    } ]

The pattern is to move all model-related logic to the server side, so that Openscoring client applications could be developed and used on a wide variety of platforms by people with varying degrees of experience.

All agents should be able to "parse" the above object at the basic model identification and schema level. For example, understanding that the REST endpoint /model/wine-color holds a classification-type decision tree model, which consumes an eight-element input data record, and produces a three-element output data record.

More sophisticated agents could rise to elevated model verification and field schema levels. For example, checking that the reported file size and MD5 checksum are correct, and establishing field mappings between the model and the data store.


Evaluating the wine color model in single prediction mode:

curl -X POST --data-binary @/path/to/data_record.json -H "Content-type: application/json" http://localhost:8080/openscoring/model/wine-color

The request body is an org.openscoring.common.EvaluationRequest object:

  "id" : "sample-1",
  "arguments" : {
    "fixed_acidity" : 7.4,
    "volatile_acidity" : 0.7,
    "citric_acid" : 0,
    "chlorides" : 0.076,
    "total_sulfur_dioxide" : 34,
    "density" : 0.9978,
    "pH" : 3.51,
    "sulphates" : 0.56

The response body is an org.openscoring.common.EvaluationResponse object:

  "id" : "sample-1",
  "result" : {
    "color" : "red",
    "probability_white" : 8.264462809917355E-4,
    "probability_red" : 0.9991735537190083

Evaluating the wine color model in CSV mode:

curl -X POST --data-binary @/path/to/wine.csv -H "Content-type: text/plain; charset=UTF-8" http://localhost:8080/openscoring/model/wine-color/csv > /path/to/wine-color.csv


Removing the wine color model:

curl -X DELETE http://localhost:8080/openscoring/model/wine-color

Openscoring client libraries

The Openscoring REST API is fairly mature and stable. The majority of changes happen in the "REST over HTTP(S)" transport layer. For example, adding support for new data formats and encodings, new user authentication mechanisms, etc.

Openscoring client libraries provide easy and effective means for keeping up with changes. Application developers get to focus on high-level routines such as "deploy", "evaluate" and "undeploy" commands, whose syntactics and semantics should remain stable for extended period of time.

The Java client library is part of the Openscoring project. Other client libraries (Python, R, PHP) are isolated into their own projects.

For example, the following Python script uses the Openscoring-Python library to replicate the example workflow.

import openscoring

os = openscoring.Openscoring("http://localhost:8080/openscoring")

# Deployment
os.deploy("wine-color", "/path/to/wine-color.pmml")

# Evaluation in single prediction mode
arguments = {
  "fixed_acidity" : 7.4,
  "volatile_acidity" : 0.7,
  "citric_acid" : 0,
  "chlorides" : 0.076,
  "total_sulfur_dioxide" : 34,
  "density" : 0.9978,
  "pH" : 3.51,
  "sulphates" : 0.56
result = os.evaluate("wine-color", arguments)

# Evaluation in CSV mode
os.evaluateCsv("wine-color", "/path/to/wine.csv", "/path/to/wine-color.csv")

# Undeployment

Converting R's random forest (RF) models to PMML documents

Originally written by Villu Ruusmann

The power and versatility of the R environment stems from its modular architecture. The functionality of the base platform can be quickly and easily expanded by downloading extension packages from the CRAN repository. For example, random forest models can be trained using the following functions:

  1. randomForest (randomForest package). Generic regression and classification. This is the reference implementation.
  2. cforest (party package). Generic regression and classification.
  3. randomUniformForest (randomUniformForest package). Generic regression and classification.
  4. bigrfc (bigrf package). Generic classification.
  5. logforest (LogicForest package). Binary classification.
  6. obliqueRF (obliqueRF package). Binary classification.
  7. quantregForest (quantregForest package). Quantile regression.

Every function implements a variation of the "bagging of decision trees" idea. The result is returned as a random forest object, whose description is typically formalized using a package-specific S3 or S4 class definition.

All such model objects are dummy data structures. They can only be executed using a corresponding function predict.<model_type>. For example, a random forest object that was trained using the function randomForest can only be executed by the function predict.randomForest (and not with some other function such as predict.cforest, predict.randomUniformForest etc.).

This one-to-one correspondence between models and model execution functions makes the deployment of R models on Java and Python platforms very complicated. Basically, it will be necessary to implement a separate Java and Python executor for every model type.


Predictive Model Markup Language (PMML) is an XML-based industry standard for the representation of predictive solutions. PMML provides a MiningModel element that can encode a wide variety of bagging and boosting models (plus more complex model workflows). A model that has been converted to the PMML data format can be executed by any compliant PMML engine. A list of PMML producer and consumer software can be found at Data Mining Group (DMG) website under the PMML Poweredsection.

PMML leads to simpler and more robust model deployment workflows. Basically, models are first converted from their function-specific R representation to the PMML representation, and then executed on a shared platform-specific PMML engine. For the Java platform this could be the JPMML-Evaluatorlibrary. For the Python platform this could be Augustus library.


The conversion of model objects from R to PMML is straightforward, because these two languages share many of the core concepts. For example, they both regard data records as collections of key-value pairs (eg. individual fields are identified by name not by position), and decorate their data exchange interfaces (eg. model input and output data records) with data schema information.


The first version of the pmml package was released in early 2007. This package has provided great service for the community over the years. However, it has largely failed to respond to new trends and developments, such as the emergence and widespread adoption of ensemble methods.

This blog post is about introducing the r2pmml package. Today, it simply addresses the major shortcomings of the pmml package. Going forward, it aims to bring a completely new set of tools to the table. The long-term goal is to make R models together with associated data pre- and post-processing workflows easily exportable to other platforms.

The exercise starts with training a classification-type random forest model for the "audit" dataset. All the data preparation work has been isolated to a separate R script "audit.R".


measure = function(fun){
  begin.time = proc.time()
  result = fun()
  end.time = proc.time();

  diff = (end.time - begin.time)
  print(paste("Operation completed in", round(diff[3] * 1000), "ms."))

  return (result)

audit = loadAuditData()
audit = na.omit(audit)


audit.rf = randomForest(Adjusted ~ ., data = audit, ntree = 100)
format(object.size(audit.rf), unit = "kB")


audit.pmml = measure(function(){ pmml(audit.rf) })
format(object.size(audit.pmml), unit = "kB")
measure(function(){ saveXML(audit.pmml, "/tmp/audit-pmml.pmml") })


measure(function(){ r2pmml(audit.rf, "/tmp/audit-r2pmml.pmml") })
measure(function(){ r2pmml(audit.rf, "/tmp/audit-r2pmml.pmml") })

The summary of the training run:

  1. Model training:
    • The size of the audit.rf object is 2'031 kB.
  2. Model export using the pmml package:
    • The pmml function call is completed in 61'280 ms.
    • The size of the audit.pmml object is 280'058 kB.
    • The saveXML function call is completed in 33'926 ms.
    • The size of the XML-tidied audit-pmml.pmml file is 6'853 kB.
  3. Model export using the r2pmml package:
    • The first r2pmml function call is completed in 4'077 ms.
    • The second r2pmml function call is completed in 1'466 ms.
    • The size of the XML-tidied audit-r2pmml.pmml file is 6'106 kB.

pmml package

Typical usage:


audit.pmml = pmml(audit.rf)
saveXML(audit.pmml, "/tmp/audit-pmml.pmml")

This package defines a conversion function pmml.<model_type> for every supported model type. However, in most cases, it is recommended to invoke the S3 generic function pmml instead. This function determines the type of the argument model object, and automatically selects the most appropriate conversion function.

When the S3 generic function pmml is invoked using an unsupported model object, then the following error message is printed:

Error in UseMethod("pmml") :
  no applicable method for 'pmml' applied to an object of class "RandomForest"

The conversion produces an XMLNode object, which is a Document Object Model (DOM) representation of the PMML document. This object can be saved to a file using the function saveXML.

This package has hard time handling large model objects (eg. bagging and boosting models) for two reasons. First, all the processing takes place in R memory space. In this example, the memory usage of user objects grows more than hundred times, because the ~2 MB random forest object audit.rf gives rise to a ~280 MB DOM object audit.pmml. Moreover, all this memory is allocated incrementally in small fragments (ie. every new DOM node becomes a separate object), not in a large contiguous block. On a more positive note, it is possible that the (desktop-) GNU R implementation is outperformed in memory management aspects by alternative (server side-) R implementations.

Second, DOM is a low-level API, which is unsuitable for working with specific XML dialects such as PMML. Any proper medium- to high-level API should deliver much more compact representation of objects, plus take care of technical trivialities such as XML serialization and deserialization.

r2pmml package

Typical usage:


r2pmml(audit.rf, "/tmp/audit-r2pmml.pmml")

The package defines a sole conversion function r2pmml, which is a thin wrapper around the Java converter application class org.jpmml.converter.Main. Behind the scenes, this function performs the following operations:

  1. Serializing the argument model object in ProtoBuf data format to a temporary file.
  2. Initializing the JPMML-Converter instance:
    • Setting the ProtoBuf input file to the temporary ProtoBuf file
    • Setting the PMML output file to the argument file
  3. Executing the JPMML-Converter instance.
  4. Cleaning up the temporary ProtoBuf file.

The capabilities of the function r2pmml (eg. the selection of supported model types) are completely defined by the capabilities of the JPMML-Converter library.

This package addresses the technical limitations of the pmml package completely. First, all the processing (except for the serialization of the model object to a temporary file in the ProtoBuf data format) has been moved from the R memory space to a dedicated Java Virtual Machine (JVM) memory space. Second, model converter classes employ the JPMML-Model library, which delivers high efficiency without compromising on functionality. In this example, the ~2 MB random forest object audit.rf gives rise to a ~5.3 MB Java PMML class model object. That is 280 MB / 5.3 MB = ~50 times smaller than the DOM representation!

The detailed timing information about the conversion is very interesting (the readings correspond to the first and second r2pmml function call):

  1. The R side of operations:
    • Serializing the model in ProtoBuf data format to the temporary file: 1'262 and 1'007 ms.
  2. The Java side of operations:
    • Deserializing the model from the temporary file: 166 and 14 ms.
    • Converting the model from R representation to PMML representation: 648 and 310 ms.
    • Serializing the model in PMML data format to the output file: 2'001 and 135 ms.

The newly introduced r2pmml package fulfills all expectations by being 100 to 200 times faster than the pmml package (eg. 310 vs 61'280 ms. for model conversion, 135 vs 33'926 ms. for model serialization). The gains are even higher when working with real-life random forest models that are order(s) of magnitude larger. Some gains are attributable to JVM warmup, because the conversion of ensemble models involves performing many repetitive tasks. The other gains are attributable to the smart caching of PMML content by the JPMML-Converter library, which lets the memory usage to scale sublinearly (with respect to the size and complexity of the model).

Also, the newly introduced r2pmml package is able to encode the same amount of information using fewer bytes than the pmml package. In this example, if the resulting files audit-r2pmml.pmml and audit-pmml.pmml are XML-tidied following the same procedure, then it becomes apparent that the former is approximately 10% smaller than the latter (6'106 vs 6'853 kB).


The r2pmml package depends on the RProtoBuf package for ProtoBuf serialization and the rJavapackage for Java invocation functionality. Both packages can be downloaded and installed from the CRAN repository using R built-in function install.packages.

Here, the installation and configuration is played out on a blank GNU/Linux system (Fedora). All system-level dependencies are handled using the Yum software package manager.

RProtoBuf package

This package depends on curl and protobuf system libraries. It is worth mentioning that if the package is built from its source form (default behavior on *NIX systems), then all the required system libraries must be present both in their standard (no suffix) and development flavors (identified by the "-dev" or "-devel" suffix).

$ yum install curl curl-devel
$ yum install protobuf protobuf-devel

After that, the RProtoBuf package can be installed as usual:


If the system is missing the curl development library curl-devel, then the installation fails with the following error message:

checking for curl-config... no
Cannot find curl-config
ERROR: configuration failed for package ‘RCurl’
ERROR: dependency ‘RCurl’ is not available for package ‘RProtoBuf’

If the system is missing the protobuf development library protobuf-devel, then the installation fails with the following error message:

configure: error: ERROR: ProtoBuf headers required; use '-Iincludedir' in CXXFLAGS for unusual locations.
ERROR: configuration failed for package ‘RProtoBuf’

The format of ProtoBuf messages is defined by the proto file inst/proto/rexp.proto. Currently, the JPMML-Conversion library uses the proto file that came with the RProtoBuf package version 0.4.2. As a word of caution, it will be useless to force the r2pmml package to depend on any RProtoBuf package version older than that, because this proto file underwent incompatible changes between versions 0.4.1 and 0.4.2. The Java converter application throws an instance of when the contents of the ProtoBuf input file does not match the expected ProtoBuf message format.

The version of a package can be verified using the function packageVersion:


rJava package

This package depends on Java version 1.7.0 or newer.

$ yum install java-1.7.0-openjdk

The Java executable java must be available via system and/or user path. Everything should be good to go if the java version can be verified by launching the Java executable with the -version option:

$ java -version

After that, the rJava package can be installed as usual: