There seems to be a large gap between developing maching learning models and deploying them to production. This seems to be especially problematic when it comes to models created in R. In this tutorial, we go through the process of training a simple regression model based on Titanic dataset in R, export it to PMML using R2PMML converter (a tool from Openscoring JPMML family) and then deploy it on Openscoring REST-based evaluation service.
We here at Openscoring believe and invest heavily in PMML-provided interoperability between training and production environments, where the software stacks are usually very different - when model development happens in R and Python but production systems run on Java. We've developed a toolset of converters and evaluators, with PMML providing the interoperability inbetween.
Usually we've tried to provide full setup on the Amazon Machine Image, but this time we're taking a bit longer route as AMI doesn't provide possibilities for graphic user interfaces like RStudio. So this is what you need:
- R itself
- RStudio (there's free desktop edition)
- Java JDK version 8.x (we'll be supporting JDK 9 soon). Although there's r2pmml wrapper, it relies on Java libraries in the background.
- Titanic dataset
- Good old Postman
Training the Model
The following script imports the necessary libraries and trains the model (make sure you get the Titanic dataset path right):
# Installation preconditions install.packages("xgboost") library("xgboost") library("devtools") install_git("git://github.com/jpmml/r2pmml.git") library("r2pmml") # Reading data and training the model - you need to change the path! training.data.raw <- read.csv('C:/Users/karelk/Documents/jpmml/Titanic/Titanic.csv',header=T,na.strings=c("")) data$Age[is.na(data$Age)] <- mean(data$Age,na.rm=T) data <- subset(training.data.raw,select=c(2,3,5,6,7,8,10)) data$Survived = as.factor(data$Survived) model <- glm(Survived ~.,family=binomial(link='logit'),data=data) r2pmml(model, "Titanic-R.pmml")
We're not going to discuss the quality of the model - it's kept simple for example purposes. My intention here is to show the conversion and deployment part, not maximize prediction accuracy. There's a very thorough tutorial exploring the various relationships in the Titanic dataset in this Youtube playlist.
There're two ways to deploy the model and both of them are much simpler than learning even basic R:
The easiest way: Testing PMML Execution with Openscoring REST Service on Amazon EC2 - in this tutorial we provide you the Amazon Machine Image with Openscoring REST evaluation service already listening on port 8080 once you start the image.
Still easy way, but with a bit more dependencies: Deploying PMML Model to Openscoring REST Evaluation Service - in this tutorial we describe how to set up Openscoring REST evaluation service on your own Windows computer and deploy the model.
In the below screenshot I'm using Postman to deploy the model to Openscoring service running on localhost:
And the next picture I'm already calculating the scores: