There seems to be a large gap between developing maching learning models and deploying them to production. This seems to be especially problematic when it comes to models created in R. In this tutorial, we go through the process of training a simple regression model based on Auto-MPG dataset in R, export it to PMML using R2PMML (a tool from Openscoring JPMML family) and then deploy it on Apache Pig as UDF (User Defined Function). UDFs are a great way to deploy predictive models close to your big data to avoid sending the data over the network for scoring.
We here at Openscoring believe and invest heavily in PMML-provided interoperability between training and production environments, where the software stacks are usually very different - when model development happens in R and Python but production systems run on Java. We've developed a toolset of converters and evaluators, with PMML providing the interoperability inbetween.
As usual, we try to provide full setup on the Amazon Machine Image.
The image contains preinstalled R (including all necessary packages like plyr, devtools and r2pmml), Apache Pig and JPMML-Evaluator for Pig (both of which in turn depend on many other software packages, like Java, Maven, JPMML-Evaluator and JPMML-Model, etc). In case you want to install everything by yourself, you need to do some trial and error and install all the dependencies - some of which are non-trivial on EC2.
We've created the following images for different regions (let me know if you can't access any of these regions and I'll make the image available in your region too):
- EU (Frankfurt) - ami-011c1a70ec49fe054
- US West (N. California) - ami-073a84c441b7f905c
- US East (N. Virginia) - ami-0528869f6be13dbf6
When launching the AMI, you'll need to create a keypair to access it (there're very good tutorials provided by AWS for this).
In the prepared image:
- R is on path
- Apache Pig is located in /home/ec2-user/pig-0.17.0/bin/ and also added to path
- JPMML-Evaluator for Pig is located in /home/ec2-user/jpmml-evaluator-pig/ and the precompiled JARs are already present in target/ directory
- All the necessary data is located in /home/ec2-user/sampledata directory
Training and Converting the Model
You can launch R by simply typing R.
The following script imports the necessary libraries and trains the model:
# Installation preconditions library("plyr") library("devtools") library("r2pmml") # If any of the commands fail, you need to do: # install.packages("plyr") # install.packages("devtools") # install_git("git://github.com/jpmml/r2pmml.git") # Devtools package has several non-trivial dependencies on EC2: # sudo yum install openssl-devel # sudo yum install libcurl-devel # Load and prepare the Auto-MPG dataset auto = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", quote = "\"", header = FALSE, na.strings = "?", row.names = NULL, col.names = c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model_year", "origin", "car_name")) auto$origin = as.factor(auto$origin) auto$car_name = NULL auto = na.omit(auto) # Train a model auto.glm = glm(mpg ~ (. - horsepower - weight - origin) ^ 2 + I(displacement / cylinders) + cut(horsepower, breaks = c(0, 50, 100, 150, 200, 250)) + I(log(weight)) + revalue(origin, replace = c("1" = "US", "2" = "Europe", "3" = "Japan")), data = auto) # Export the model to PMML r2pmml(auto.glm, "/home/ec2-user/sampledata/auto_glm.pmml")
As usual in these tutorials we're not going to discuss the quality of the model. This tutorial is about converting and deploying the model, not data analysis.
Creating a Simple PIG UDF and Evaluating Some Data
First of all, let's launch Pig and register the necessary JAR file:
~/pig-0.17.0/bin/pig -x local
And on Pig prompt:
Now let's create a simple UDF based on the PMML file we got from R (feel free to overwrite the sampledata/auto_glm.pmml file with what you got from R):
DEFINE AutoMPG org.jpmml.evaluator.pig.EvaluatorFunc('/home/ec2-user/sampledata/auto_glm.pmml');
Now we'll load some data and do the scoring:
AutoMPG_test = LOAD '/home/ec2-user/sampledata/mpg-test.csv' USING PigStorage(';') AS (cylinders:double, displacement:double, horsepower:double, weight:double, acceleration:double, model_year:double, origin:int, name:chararray); DESCRIBE AutoMPG_test; DUMP AutoMPG_test; AutoMPG_calc = FOREACH AutoMPG_test GENERATE AutoMPG(*); DESCRIBE AutoMPG_calc; DUMP AutoMPG_calc;
Please be warned that both DUMPs will output a few hundred lines of data (so if you haven't configured your scrollback buffer to be long enough, you might lose the beginning of the log).
Creating PMML Functions Using the ArciveBuilder UDF
Now we're going to take it to the META-level and use our UDF to create UDF-s :-) On Pig, you don't really need to go here (as opposed to Apache Hive), but the solution is so elegent, we couldn't keep us from implementing it!
First, let's create a file Udf.csv with the following content (it actually already exists in sampledata/ directory):
And from Pig prompt we need to execute the following commands (this assumes that the jpmml-evaluator-pig-runtime is registered):
Udf = LOAD '/home/ec2-user/sampledata/Udf.csv' USING PigStorage(',') AS (Class_Name:chararray, PMML_File:chararray, Model_Jar_File:chararray); Udf_model_jar_file = FOREACH Udf GENERATE org.jpmml.evaluator.pig.ArchiveBuilderFunc(*); DUMP Udf_model_jar_file;
What happens here is that:
- We load the class name from Udf.csv (there's no need for the actual .java file to exist with this class definition, we generate its code on the fly)
- We use the PMML file referenced in Udf.csv (make sure it exists at the path given)
- We output the generated and packaged code in the file Model_Jar_File at the specified path
- DUMP actually creates the JAR file and returns the full path to it
As the next step, register the JAR file and define UDF based on it:
DEFINE AutoMPG io.openscoring.AutoMPG;
And now we can do the same scoring part as shown for the simple function.