Quick and Easy Deployment of Machine Learning (ML) Models
A standards-based, open-source software suite for moving Apache Spark, R and Scikit-Learn models from “lab” to “factory”
A standards-based, open-source software suite for moving Apache Spark, R and Scikit-Learn models from “lab” to “factory”
Business value is created by finding a solution to a problem, and delivering it to customers.
The more engineering you need to do, the slower the pace, and the higher the cost. Use a deterministic helper (“buy”) instead of taking on an open-ended development and maintenance commitment (“build”).
Openscoring provides end-to-end, no-code/low-code workflows.
Humans outdo machines only on creative tasks. Creativity is involved in designing and assembling ML workflows, not running them.
Good software is an easier hire than a good data scientist or data engineer.
Openscoring provides self-documenting, self-testing, self-integrating ML models.
Finalized ML models are typically regarded as “black boxes” that can only deliver numeric predictions.
Yet, if you approach them right, they can become intellectual property assets, which give valuable insights into business processes.
Openscoring provides rich APIs for accessing each and every aspect of models and predictions.
Building a model evaluator instance from a PMML XML file:
import java.io.File;
import org.jpmml.evaluator.Evaluator;
import org.jpmml.evaluator.LoadingModelEvaluatorBuilder;
Evaluator evaluator = new LoadingModelEvaluatorBuilder()
.load(new File("LogisticRegression.pmml"))
.build();
Doing the same, plus transpiling PMML XML markup to Java PMML API-backed Java bytecode for 5-15x performance improvement:
import org.jpmml.transpiler.FileTranspiler;
import org.jpmml.transpiler.Transpiler;
import org.jpmml.transpiler.TranspilerTransformer;
LoadingModelEvaluatorBuilder evaluatorBuilder = new LoadingModelEvaluatorBuilder()
.load(new File("XGBoost.pmml"));
try {
Transpiler transpiler = new FileTranspiler(null, new File("XGBoost.pmml.jar"));
evaluatorBuilder = evaluatorBuilder
.transform(new TranspilerTransformer(transpiler));
} catch(IOException ioe){
// Ignored - the buildable evaluator shall fall back to the default evaluation mode
}
Evaluator evaluator = evaluatorBuilder.build();
Using the embedded verification dataset for self-testing and overall warm-up:
evaluator.verify();
Evaluating data records:
import org.dmg.pmml.FieldName
import org.jpmml.evaluator.EvaluatorUtil;
import org.jpmml.evaluator.FieldValue;
import org.jpmml.evaluator.InputField;
import org.jpmml.evaluator.OutputField;
import org.jpmml.evaluator.TargetField;
Map<String, Object> userArguments = readArguments();
Map<FieldName, FieldValue> pmmlArguments = new HashMap<>();
List<? extends InputField> inputFields = evaluator.getInputFields();
for(InputField inputField : inputFields){
Object userValue = userArguments.get((inputField.getName()).getValue());
// Transform an arbitrary Java primitive value to a known-good PMML argument value
FieldValue pmmlValue = inputField.prepare(userValue);
pmmlArguments.put(inputField.getName(), pmmlValue);
}
// Evaluate
Map<FieldName, ?> pmmlResults = evaluator.evaluate(pmmlArguments);
Map<String, Object> userResults = new HashMap<>();
// Primary result(s) (eg. y)
List<? extends TargetField> targetFields = evaluator.getTargetFields();
for(TargetField targetField : targetFields){
Object targetValue = pmmlResults.get(targetField.getName());
// Transform a PMML result value to a Java primitive value
targetValue = EvaluatorUtil.decode(targetValue);
userResults.put((targetField.getName()).getValue(), targetValue);
}
// Secondary results (eg. probability(y), affinity(y), entityId(y))
List<? extends OutputField> outputFields = evaluator.getOutputFields();
for(OutputField outputField : outputFields){
Object outputValue = pmmlResults.get(outputField.getName());
userResults.put((outputField.getName()).getValue(), outputValue);
}
writeResults(userResults);
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
import pandas
df = pandas.read_csv("Audit.csv")
pipeline = PMMLPipeline([
("transformer", ColumnTransformer([
("continuous", "passthrough", ["Age", "Hours", "Income"]),
("categorical", OneHotEncoder(), ["Employment", "Education", "Marital", "Occupation", "Gender", "Deductions"])
])),
("classifier", LogisticRegression(multi_class = "ovr"))
])
pipeline.fit(df, df["Adjusted"])
pipeline.verify(df.sample(10))
sklearn2pmml(pipeline, "LogisticRegression.pmml")
library("dplyr")
library("r2pmml")
df = read.csv("Audit.csv")
df$Adjusted = as.factor(df$Adjusted)
audit.glm = glm("Adjusted ~ .", data = df, family = "binomial")
audit.glm = verify(audit.glm, newdata = sample_n(df, 10))
r2pmml(audit.glm, "LogisticRegression.pmml")
import java.io.File
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.RFormula
import org.jpmml.sparkml.PMMLBuilder
import org.jpmml.sparkml.model.HasPredictionModelOptions;
val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("Audit.csv")
val rFormula = new RFormula().setFormula("Adjusted ~ .")
val lr = new LogisticRegression().setLabelCol(rFormula.getLabelCol).setFeaturesCol(rFormula.getFeaturesCol)
val pipeline = new Pipeline().setStages(Array(rFormula, lr))
val pipelineModel = pipeline.fit(df)
val pmmlBuilder = new PMMLBuilder(df.schema, pipelineModel).putOption(HasPredictionModelOptions.OPTION_KEEP_PREDICTIONCOL, false).verify(df.sample(false, 0.01).limit(10))
pmmlBuilder.buildFile(new File("LogisticRegression.pmml"))
The Audit dataset (binary target; three continuous and six categorical features)
Running the Openscoring server application:
$ java -jar openscoring-server-executable-${version}.jar
Deploying a model from a PMML XML file, using it for evaluation (batch mode), and undeploying:
$ curl -X PUT --data-binary @LogisticRegression.pmml -H "Content-type: text/xml" http://localhost:8080/openscoring/model/MyAuditModel
$ curl -X POST --data-binary @Audit.csv -H "Content-type: text/plain; charset=UTF-8" http://localhost:8080/openscoring/model/MyAuditModel/csv > Audit-results.csv
$ curl -X DELETE http://localhost:8080/openscoring/model/MyAuditModel
Doing the same using the Openscoring-Python client library:
from openscoring import Openscoring
os = Openscoring(base_url = "http://localhost:8080/openscoring")
os.deployFile("MyAuditModel", "LogisticRegression.pmml")
os.evaluateCsvFile("MyAuditModel", "Audit.csv", "Audit-results.csv")
os.undeploy("MyAuditModel")
PAPIs 2018 tool demonstration: "Putting five ML models to production in five minutes"
The majority of Openscoring software is released under the terms and conditions of the GNU Affero General Public License (AGPL), version 3.0. AGPLv3 is a free software license [1]. AGPLv3 is very similar to the GNU General Public License (GPL), version 3, but comes with an additional provision, which addresses the use of software over a computer network.
If AGPLv3 is not acceptable, then it is possible to enter into a licensing agreement, which makes Openscoring software available under the terms and conditions of the BSD 3-Clause License. The re-licensing process is quick and easy (attainable by exchanging three e-mails), and protects the interests of both parties.
[1] “Free software” is a matter of liberty, not price. To understand the concept, you should think of “free” as in “free speech”, not as in “free beer”. For more information, please see The Free Software Definition.
Machine learning (ML) deals with small- and medium-size statistical and data mining models, whose inner workings can be easily reasoned. In contrast, Artificial Intelligence (AI) deals with unfathomable neural networks.
ML is applicable to business problems that are based on structured datasets (spreadsheets, relational databases). AI is applicable to large unstructured datasets (media collections).
Statistics/ML may not sound as cool as deep learning/AI, but its economic impact is way higher.
ML models “at rest” can be captured and systematized using very simple data structures.
The Data Mining Group (DMG) is an independent vendor led consortium that works on the standardization of ML model data structures.
DMG released the Predictive Model Markup Language (PMML) in its current form (the 3.0+ schema version) in 2004, and has been continuously updating and maintaining it since then. DMG has also released the Portable Format for Analytics (PFA), but its uptake has been negligible due to technical complexity.
PMML uses high-level data structures, which abstract away all medium- and low-level data structures used by popular ML frameworks and application.
For example, PMML defines singular tree and regression table data structures that can capture the entirety of decision tree, linear regression and logistic regression models in everyday use (Apache Spark, R, Scikit-Learn, LightGBM, XGBoost, etc).
PMML data structures can be extended by adding new elements and attributes to them. The evolution of data structures is carried out both in backwards- and forwards-compatible manner.
Any PMML document generated in the past 15 years is valid and usable today, and will likely be 15 years into the future.
PMML data structures balance expressiveness with executability.
A PMML document can be worked on (analyzed, executed) manually, or using a wide variety of tools from open-source and proprietary vendors. The default data format for PMML documents is XML, but they can equally well be stored in alternative text (JSON, YAML) or binary data formats.
Openscoring has established itself as the most innovative and most capable vendor of PMML tools.
Detailed guidance, feature requests, bug reports about specific products? Please open a new GitHub issue with the appropriate Java PMML API or Openscoring REST API repository.
Questions about PMML and its applicability to ML workflows? Please open a new thread in the JPMML Mailing List.
Other exciting opportunities? Contact privately.