Why Use PMML for Your Machine Learning Models?

PMML (Predictive Model Markup Language) is a XML-based format created by DMG to store predictive analytical models in a common format.

Although the support for PMML among the tools on the market seems to be ranging from correct implementations to mediocre to non-existent, it still pretty much the only open standard that provides the common denominator between different machine learning tools and packages. Yes, there's also PFA, but Google search for "PFA evaluator" suggests me to search "PFA calculator" instead...

So unless you can train your models and make predictions with the same tool and in the same environment, you'll find yourself looking for ways to be interoperable, for example - deploy R models in Java applications. Sooner or later you'll find yourself researching options and one of those is probably PMML.

As we've been creating tools that convert R/Scikit-learn/Spark/Tensorflow models to PMML and also providing the PMML evaluator, we know a thing or two about it. The below arguments concentrate more on model deployment and management processes and less on the bleeding edge research into new predictive model types.


#1 PMML is an open standard

Every time there's a question whether to prefer an open standard or a proprietary format, we'd bet on the open standard.

We don't even want to elaborate or bring examples of this not to draw the attention away from this single most important argument. It really beats us why would someone need bundles of JSON/Protobuf files without clearly documented structure, when all this can be represented as a single and clear PMML file.


#2 PMML model is a text file

PMML is a text based representation of your machine learning model - it's no longer a black box where data goes in and scores come out. Many of the below arguments stem from the fact that we have a textual representation of the model in the first place.


#3 PMML model is human-readable and editable

You can just open the file and see what type of model is it, what are the inputs/outputs and data transformations.

Although you seldom want to go that route, you still have the option for a quick fix whenever this is really needed - for example, in the post about deploying PMML as Hive UDF we can quickly remove parentheses from the output field names as these are reserved characters in Hive.


#4 PMML model programmatically editable

In addition to being human readable and editable, it's also editable by a program - you just need a proper XML parsing tool. JPMML converters use this feature extensively, from storing additional metadata (for debugging purposes) in the Extension block to converting binary splits in decision trees to equivalent multiway splits, which save time and memory.


#5 PMML model can be versioned

PMML models can be pretty printed and checked into version control systems like Git or Subversion. It's hard to version binary things, but there're plenty of tools to version, diff (software engineers lingo for 'finding the difference') and merge text based formats. In this way, you can keep the history of your model evolution, collaborate or even maintain multiple branches of it.


#6 PMML model can be stored for long time

Say you have a (regulatory?) requirement to have repeatable predictions so whenever you have the original data points and timestamp, you can find the PMML model that was in production at the time and get exactly the same result that you got originally. JPMML even has MathML support to retrieve the whole prediction as one MathML expression.

You don't have to worry whether you still have any programs to read the model and make predictions based on it.


#7 PMML models are backward compatible by design

New PMML schema versions allow you to express the same logic more concisely/elegantly. All older PMML schema versions that require more markup to achieve the same objective, still continue to be valid.

Quoting our CTO Villu Ruusmann: "Give us any PMML file that has been produced in past 10 years, and we can make a prediction with it. This statement will hold true also 10 years into the future."


#8 PMML representation provides interoperability between tools

PMML models are independent from tools and their versions. If any of the vendors changes their proprietary data format, your old models might not be supported or even readable - when developing converters, we regularly stumble upon small, but sometimes breaking changes like this. Not the case with PMML.


#9 PMML provides interoperability between teams and environments

PMML represenation provides interoperability between training environments and deployment environments. You can train your model once and run it anywhere.

You don't have to force specific tools for data scientists just because your production environment is only able to use models created in, say, Python. You can have multiple sources of models provided that they all can be exported to PMML. And you can execute the models in any environent that is capable of producing predictions based on PMML.

You might wonder how Openscoring/JPMML fits into this picture. We certainly make our bet on an open standard and have created tools to enable PMML export from many open platforms like R, Scikit-learn and Apache Spark.

Our tools produce PMML models and let you evaluate them in multitude of environments including plain Java applications for which we provide the JPMML-Evaluator library, but also Apache Hive and Pig for which we provide integration via UDF, we integrate with Apache Spark and we also create a client-agnostic REST web service for PMML model evaluation.

We're also working on a machine learning model management software, where you could convert, store, version and verify your models.