Using MADlib for Analytics

If the pod that runs a primary Greenplum segment instance fails or is deleted, the Greenplum StatefulSet restarts the pod. However, the Greenplum master instance remains offline so you can fail over to the standby master instance. This topic describes how to configure the MADlib open-source library for scalable in-database analytics in Greenplum for Kubernetes.

About MADlib in Greenplum for Kubernetes

Unlike with other Pivotal Greenplum distributions, Pivotal Greenplum for Kubernetes automatically installs the MADlib software as part of the Greenplum Docker image. For example, after initializing a new Greenplum cluster in Kubernetes, you can see that MADlib is available as an installed Debian Package:

$ kubectl exec -it master-0 bash -- -c "dpkg -s madlib"
Package: madlib
Status: install ok installed
Priority: optional
Section: devel
Installed-Size: 31586
Architecture: amd64
Version: 1.15.1
Description: Apache MADlib is an Open-Source Library for Scalable in-Database Analytics

To begin using MADlib, you simply use the madpack utility to add MADlib functions to your database, as described in the next section.

Adding MADlib Functions

To install the MADlib functions to a dabase, use the madpack utility. For example:

$ kubectl exec -it master-0 bash -- -c "source ./.bashrc; madpack -p greenplum install" INFO : Detected Greenplum DB version 5.12.0. INFO : *** Installing MADlib *** INFO : MADlib tools version    = 1.15.1 (/usr/local/madlib/Versions/1.15.1/bin/../madpack/ INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib) INFO : Testing PL/Python environment... INFO : > Creating language PL/Python... INFO : > PL/Python environment OK (version: 2.7.12) INFO : > Preparing objects for the following modules: INFO : > - array_ops INFO : > - bayes INFO : > - crf
... INFO : Installing MADlib: INFO : > Created madlib schema INFO : > Created madlib.MigrationHistory table INFO : > Wrote version info in MigrationHistory table INFO : MADlib 1.15.1 installed successfully in madlib schema.

This installs MADlib functions into the default schema named madlib. Execute madpack -h or see the Greenplum MADlib Extension for Analytics documentation for Pivotal Greenplum Database for more information about using madpack.

Getting More Information

For more information about using MADlib, see: