Training Random Forest Model in Apache Spark and Deploying It to Openscoring REST Web Service


In this tutorial we're building the following setup:

  1. We take the Titanic dataset 
  2. We are going to use a PySpark script based on this example (using SQL Dataframe, ML lib and Pipelines) and modify it a little bit to suit our needs
  3. We train a random forest model using Apache Spark Pipelines
  4. We convert the model to standard PMML using JPMML-SparkML package
  5. We deploy the random forest model to Openscoring web service using Openscoring Python client library
  6. We get real-time scoring of test data directly from Spark envrionment

Here's a visual representation which JPMML/Openscoring components will be used in the following tutorial as well as all the other possibilities to follow the path from model training to model evaluation provided by us.

Training model on Apache Spark and deploying to Openscoring REST evaluator (click for bigger version)


Setting up the Environment

We have some good news here - we've done all the heavy lifting for you on the Amazon Machine Images that you can use as a template to launch your own EC2 instance and try the same steps there. Please be aware that this AMI is more like a quick and dirty proof of concept setup and less like production. Of course you can do the whole setup from scratch by yourself, it's no harder than just tracing the dependencies and installing them one by one (you're going to need JDK8, Spark2.2, Maven, Git, Python, NumPy, pandas, Openscoring, JPMML-SparkML package - just to name a few).

The AMI-s are as follows (let me know if you can't access any of these regions and I'll make the image available in your region too):

  • EU (Frankfurt) - ami-0856fb979732e0016
  • EU (Ireland) - ami-0aa76e5c2adbaf682
  • US West (N. California) - ami-0c6ac011f918947e1
  • US East (N. Virginia) - ami-07f1b7187e58af6b4
  • Asia Pacific (Singapore) - ami-0ddae89443be18b52

When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).


Launching Spark

Once you've logged in to the EC2 instance you just created based on our template, you can use the following commands to launch Apache Spark:

$ cd spark-2.2.1-bin-hadoop2.7/bin
$ ./pyspark --jars /home/ec2-user/jpmml-sparkml-package/target/jpmml-sparkml-package-1.3-SNAPSHOT.jar

This will open the PySpark prompt and will also include the necessary JPMML-SparkML package jar file, that will take care of the pipeline conversion to PMML.


Training the Random Forest Model with Titanic Data

Here comes the Python code for training the model, deploying it to Openscoring web service running on the same EC2 and getting the test scores. Openscoring-provided Python client library is used to deploy the model to REST evaluator.

Based on the prediction the young female passenger had a pretty good chance to survive the Titanic catastrophy while the middle-aged man sadly didn't:

>>> arguments = {"Fare":7.25,"Sex":"female","Age":20,"SibSp":1,"Pclass":1,"Embarked":"Q"}
>>> print(os.evaluate("Titanic", arguments))
{u'probability(0)': 0.1742083085440686, u'probability(1)': 0.8257916914559313, u'pmml(prediction)': u'1', u'Survived': u'1', u'prediction': 1.0}

>>> arguments = {"Fare":7.25,"Sex":"male","Age":45,"SibSp":1,"Pclass":3,"Embarked":"S"}
>>> print(os.evaluate("Titanic", arguments))
{u'probability(0)': 0.8934427101921765, u'probability(1)': 0.10655728980782375, u'pmml(prediction)': u'0', u'Survived': u'0', u'prediction': 0.0}


Accessing the Model Using Openscoring REST API

You can access Openscoring web service REST API (scroll down for the API spec) not only with the Python client library from Spark, but with any tool that can send custom HTTP requests. We prefer to use Postman for this. The previous post in our blog discussed using Openscoring REST service in more detail.

In order to access Openscoring running on your EC2 instance from outside the machine itself, you need to find the security group associated with your EC2 instance and add one rule:

You need to choose Custom TCP, port 8080, Anywhere - this will allow you to access Openscoring web service running on port 8080 from anywhere in the world (you can always choose 'My IP' to just open it to yourself). Now go back to the instances list, grab the public DNS of your instance and create the following request in Postman:

  • Use POST
  • Set URL to: http://<public DNS>:8080/openscoring/model/Titanic/csv
  • Set Content-Type header to text/plain

You can use the following data as the request body (or compile your own based on the data dictionary):


Submitting the request in Postman should result something like this:

As per the model predictions, the chances of surviving the Titanic catastrophy were pretty bleak :-(

If you don't bother installing Postman, you can access these two URLs even from your browser:

http://<public DNS>:8080/openscoring/model/Titanic
http://<public DNS>:8080/openscoring/model/Titanic/pmml

The first one will give you brief overview of model inputs and outputs in JSON format while the second one will give you the PMML representation of whole random forest.

Feel free to let me know of any issues related to this guide in the comments section!