Deploying PXF with Greenplum (Beta)

This section describes procedures for deploying a Greenplum for Kubernetes cluster that includes the Pivotal extension framework (PXF).

Note: Pivotal PXF in Greenplum for Kubernetes is a Beta feature.

About PXF on Greenplum for Kubernetes

When you deploy PXF to Greenplum for Kubernetes, the Greenplum Operator creates one or more dedicated pods, or replicas, to host the PXF server instances. This differs from Pivotal Greenplum deployed to other platforms, where a PXF server instance is deployed to each Greenplum segment host. With Greenplum for Kubernetes, you can choose to deploy as many PXF server replicas as needed to provide redundancy should a PXF pod fail and to distribute load.

You store all PXF configuration files for a Greenplum for Kubernetes cluster externally, on an S3 data source. The Greenplum for Kubernetes manifest file then specifies the S3 bucket-path to use for downloading the PXF configuration to all configured PXF servers.

When you install a new Greenplum cluster using the template PXF manifest file, workspace/samples/my-gp-with-pxf-instance.yaml, PXF is installed and initialized with a default (empty) PXF configuration directory. After deploying the cluster, you can customize the configuration by creating PXF server configurations for multiple data sources, and then redeploy with an updated manifest file to use the PXF configuration in your cluster.

Deploying a Cluster with PXF Enabled

Follow these steps to deploy a Greenplum for Kubernetes cluster with PXF enabled. You can deploy PXF servers either in their default, initialized state, or you can use an existing PXF configuration, stored in an S3 bucket location, to use as the PXF configuration for your cluster.

See also Configuring PXF Servers for information about how to create and apply PXF server configurations to a Greenplum for Kubernetes cluster.

  1. Use the procedure described in Deploying or Redeploying a Greenplum Cluster to deploy the cluster, but use the samples/my-gp-with-pxf-instance.yaml as the basis for your deployment. Copy the file into your /workspace directory. For example:

    $ cd ./greenplum-for-kubernetes-*/workspace
    $ cp ./samples/my-gp-with-pxf-instance.yaml .
    
  2. Edit the file as necessary for your deployment. my-gp-with-pxf-instance.yaml includes properties to configure PXF in the basic Greenplum cluster:

    apiVersion: "greenplum.pivotal.io/v1"
    kind: "GreenplumCluster"
    metadata:
      name: my-greenplum
    spec:
      masterAndStandby:
        hostBasedAuthentication: |
          # host   all   gpadmin   1.2.3.4/32   trust
          # host   all   gpuser    0.0.0.0/0   md5
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        antiAffinity: "yes"
        workerSelector: {}
      segments:
        primarySegmentCount: 1
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 2G
        antiAffinity: "yes"
        workerSelector: {}
      pxf:
        serviceName: "my-greenplum-pxf"    
    ---
    apiVersion: "greenplum.pivotal.io/v1beta1"
    kind: "GreenplumPXFService"
    metadata:
      name: my-greenplum-pxf
    spec:
      replicas: 2
      cpu: "0.5"
      memory: "1Gi"
      workerSelector: {}
    #  pxfConf:
    #    s3Source:
    #      secret: "my-greenplum-pxf-configs"
    #      endpoint: "s3.amazonaws.com"
    #      bucket: "YOUR_S3_BUCKET_NAME"
    #      folder: "YOUR_S3_BUCKET_FOLDER-Optional"
    #
    # Note: If using pxfConf.s3Source, in addition to applying the above yaml be sure to create a secret using a command similar to:
    # kubectl create secret generic my-greenplum-pxf-configs --from-literal=‘access_key_id=XXX’ --from-literal=‘secret_access_key=XXX’
    

    The entry:

      pxf:
        serviceName: "my-greenplum-pxf"    
    

    Indicates that the cluster will use the PXF service configuration named my-greenplum-pxf that follows at the end of the yaml file. The sample configuration creates two PXF replica pods for redundancy with minimal settings for CPU and memory. You can customize these values as needed, as well as the workerSelector value if you want to constrain the replica pods to labeled nodes in your cluster. See Greenplum PXF Service Properties for information about each available property.

  3. If you have an existing PXF configuration that you want to apply to the Greenplum for Kubernetes cluster, follow these additional steps to edit your manifest file and provide access to the configuration:

    1. Uncomment the pxfConf configuration properties at the end of the template file:

      pxfConf:
        s3Source:
          secret: "my-greenplum-pxf-configs"
          endpoint: "s3.amazonaws.com"
          bucket: "YOUR_S3_BUCKET_NAME"
          folder: "YOUR_S3_BUCKET_FOLDER-Optional"
      
    2. Set the endpoint:, bucket:, and folder: properties to specify the full S3 location that contains your PXF configuration files. All directories and files located in the specified S3 bucket-folder are copied into the PXF_CONF directory on each PXF server in the cluster. See Configuring PXF Servers for an example configuration that uses MinIO.

    3. Create a secret that can be used to authenticate access to the S3 bucket and folder that contains the PXF configuration directory. The name of the secret must match the name specified in the manifest file (secret: "my-greenplum-pxf-configs" by default). For example:

      $ kubectl create secret generic my-greenplum-pxf-configs --from-literal='access_key_id=AKIAIOSFODNN7EXAMPLE'
          --from-literal='secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
      
      secret/my-greenplum-pxf-configs created
      

      The above command creates a secret named my-greenplum-pxf-configs using the S3 access and secret keys that you provide. Replace the access and secret key values with the actual values for your system. If necessary, use your S3 implementation documentation to generate a secret access key.

  4. Use the kubectl apply command with your modified PXF manifest file to send the deployment request to the Greenplum Operator. For example:

    $ kubectl apply -f ./my-gp-with-pxf-instance.yaml 
    
    greenplumcluster.greenplum.pivotal.io/my-greenplum created
    greenplumpxfservice.greenplum.pivotal.io/my-greenplum-pxf created
    

    If you are deploying another instance of a Greenplum cluster, specify the Kubernetes namespace where you want to deploy the new cluster. For example, if you previously deployed a cluster in the namespace gpinstance-1, you could deploy a second Greenplum cluster in the gpinstance-2 namespace using the command:

    $ kubectl apply -f ./my-gp-with-pxf-instance.yaml -n gpinstance-2
    

    The Greenplum Operator deploys the necessary Greenplum and PXF resources according to your specification, and also initializes the Greenplum cluster.

  5. Execute the following command to monitor the deployment of the cluster. While the cluster is initializing the status will be Pending:

    $ watch kubectl get all
    
    NAME                                      READY     STATUS    RESTARTS   AGE
    pod/greenplum-operator-79cddcf586-ctftb   1/1       Running   0          2m40s
    pod/master-0                              1/1       Running   0          23s
    pod/master-1                              1/1       Running   0          22s
    pod/my-greenplum-pxf-676fd6fdd7-825gq     0/1       Running   0          28s
    pod/my-greenplum-pxf-676fd6fdd7-mjt6w     0/1       Running   0          28s
    pod/segment-a-0                           1/1       Running   0          22s
    pod/segment-b-0                           1/1       Running   0          22s
    
    NAME                                                            TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
    service/agent                                                   ClusterIP      None             <none>        22/TCP          23s
    service/greenplum                                               LoadBalancer   10.104.112.154   <pending>     5432:32294/TCP  23s
    service/greenplum-validating-webhook-service-79cddcf586-ctftb   ClusterIP      10.105.7.189     <none>        443/TCP         2m38s
    service/kubernetes                                              ClusterIP      10.96.0.1        <none>        443/TCP         19m
    service/my-greenplum-pxf                                        ClusterIP      10.105.235.115   <none>        5888/TCP        29s
    
    NAME                                 READY     UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/greenplum-operator   1/1       1            1           2m40s
    deployment.apps/my-greenplum-pxf     0/2       2            0           28s
    
    NAME                                            DESIRED   CURRENT   READY     AGE
    replicaset.apps/greenplum-operator-79cddcf586   1         1         1         2m40s
    replicaset.apps/my-greenplum-pxf-676fd6fdd7     2         2         0         28s
    
    NAME                         READY     AGE
    statefulset.apps/master      2/2       23s
    statefulset.apps/segment-a   1/1       23s
    statefulset.apps/segment-b   1/1       23s
    
    NAME                                                 STATUS    AGE
    greenplumcluster.greenplum.pivotal.io/my-greenplum   Pending   29s
    
    NAME                                                        AGE
    greenplumpxfservice.greenplum.pivotal.io/my-greenplum-pxf   29s
    

    Note that the Greenplum PXF service, deployment, and replicas are created in addition to the Greenplum cluster.

  6. Describe your Greenplum cluster to verify that it was created successfully. The Phase should eventually transition to Running:

    $ kubectl describe greenplumClusters/my-greenplum
    
    Name:         my-greenplum
    Namespace:    default
    Labels:       <none>
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"greenplum.pivotal.io/v1","kind":"GreenplumCluster", "metadata":{"annotations":{},"name":"my-greenplum", "namespace":"default"...
    API Version:  greenplum.pivotal.io/v1
    Kind:         GreenplumCluster
    Metadata:
      Creation Timestamp:  2019-04-01T15:19:17Z
      Generation:          1
      Resource Version:    1469567
      Self Link:           /apis/greenplum.pivotal.io/v1/namespaces/default/greenplumclusters/my-greenplum
      UID:                 83e0bdfd-5491-11e9-a268-c28bb5ff3d1c
    Spec:
      Master And Standby:
        Anti Affinity:              yes
        Cpu:                        0.5
        Host Based Authentication:  # host   all   gpadmin   1.2.3.4/32   trust
    # host   all   gpuser    0.0.0.0/0   md5
    
        Memory:              800Mi
        Storage:             1G
        Storage Class Name:  standard
        Worker Selector:
      Segments:
        Anti Affinity:          yes
        Cpu:                    0.5
        Memory:                 800Mi
        Primary Segment Count:  1
        Storage:                2G
        Storage Class Name:     standard
        Worker Selector:
    Status:
      Instance Image:    greenplum-for-kubernetes:latest
      Operator Version:  greenplum-operator:latest
      Phase:             Running
    Events:
      Type    Reason                    Age   From               Message
      ----    ------                    ----  ----               -------
      Normal  CreatingGreenplumCluster  2m    greenplumOperator  Creating Greenplum cluster my-greenplum in default
      Normal  CreatedGreenplumCluster   8s    greenplumOperator  Successfully created Greenplum cluster my-greenplum in default
    

    If you are deploying a brand new cluster, the Greenplum Operator automatically initializes the Greenplum cluster. The Phase should eventually transition from Pending to Running and the Events should match the output above.


    Note: If you redeployed a previously-deployed Greenplum cluster, the phase will begin at Pending. The cluster uses its existing Persistent Volume Claims if they are available. In this case, the master and segment data directories will already exist in their former state. The master-0 pod automatically starts the Greenplum Cluster, after which the phase transitions to Running.

  7. At this point, you can work with the deployed Greenplum cluster by executing Greenplum utilities from within Kubernetes, or by using a locally-installed tool, such as psql, to access the Greenplum instance running in Kubernetes. Examine the PXF_CONF directory on master:

    $ kubectl exec -it master-0 bash -- -c "ls -R /etc/pxf"
    
    /etc/pxf:
    conf  keytabs  lib  logs  servers  templates
    
    /etc/pxf/conf:
    pxf-env.sh  pxf-log4j.properties  pxf-profiles.xml
    
    /etc/pxf/keytabs:
    
    /etc/pxf/lib:
    
    /etc/pxf/logs:
    
    /etc/pxf/servers:
    default
    
    /etc/pxf/servers/default:
    
    /etc/pxf/templates:
    adl-site.xml   hbase-site.xml  jdbc-site.xml    s3-site.xml
    core-site.xml  hdfs-site.xml   mapred-site.xml  wasbs-site.xml
    gs-site.xml    hive-site.xml   minio-site.xml   yarn-site.xml
    

    The above output shows a default Pivotal PXF service has just been initialized, where the PXF_CONF directory (/etc/pxf) contains only the default subdirectories and template configuration files. If you applied an existing PXF configuration, verify that your customer PXF server configuration files are present. If you did not apply an existing PXF configuration, continue with the instructions in Configuring PXF Servers to verify basic PXF functionality in the new cluster.

Configuring PXF Servers

With Greenplum for Kubernetes, all PXF configuration files for a cluster are stored externally, on an S3 data source. The Greenplum for Kubernetes manifest file then specifies the S3 bucket-path to use for downloading the PXF configuration to all configured PXF servers. Any directories and files at the specified bucket-path are copied as-is to all PXF Servers configured for the cluster.

This procedure describes how to add or modify PXF configuration to a Greenplum for Kubernetes cluster.

Prerequisites

This procedure uses MinIO as an example data source both for storing the PXF server configuration and for accessing remote data via PXF. If you want to follow along using the MinIO example, install the MinIO client, mc to your local system. See the MinIO Client Quickstart Guide for installation instructions.

You should also have access to a Greenplum for Kubernetes deployment that includes PXF. See Deploying a Cluster with PXF Enabled.

Procedure

  1. To use MinIO as a sample data source, first install a standalone MinIO server to your cluster using helm. For example:

    $ helm install stable/minio
    
    NAME:   voting-coral                                                                                                                                                                                                LAST DEPLOYED: Wed Oct 16 07:44:38 2019
    NAMESPACE: default
    STATUS: DEPLOYED
    
    RESOURCES:
    ==> v1/ConfigMap
    NAME                DATA  AGE
    voting-coral-minio  1     0s
    
    ==> v1/Deployment
    NAME                READY  UP-TO-DATE  AVAILABLE  AGE
    voting-coral-minio  0/1    1           0          0s
    
    ==> v1/PersistentVolumeClaim
    NAME                STATUS  VOLUME                                    CAPACITY  ACCESS MODES  STORAGECLASS      AGE
    voting-coral-minio  Bound   pvc-a7d03ab0-4fda-4328-a1ed-2f603888ed13  10Gi      RWO           standard      0s
    
    ==> v1/Pod(related)
    NAME                                 READY  STATUS             RESTARTS  AGE
    voting-coral-minio-7fd9b4c78b-ddp97  0/1    ContainerCreating  0         0s
    
    ==> v1/Secret
    NAME                TYPE    DATA  AGE
    voting-coral-minio  Opaque  2     0s
    
    ==> v1/Service
    NAME                TYPE       CLUSTER-IP    EXTERNAL-IP  PORT(S)   AGE
    voting-coral-minio  ClusterIP  10.97.90.127  <none>       9000/TCP  0s
    
    ==> v1/ServiceAccount
    NAME                SECRETS  AGE
    voting-coral-minio  1        0s
    
    NOTES:
    
    voting-coral-minio.default.svc.cluster.local                                                              
    
    To access Minio from localhost, run the below commands:                                                   
    
    1. export POD_NAME=$(kubectl get pods --namespace default -l "release=voting-coral" -o jsonpath="{.item[0].metadata.name}")                                                                                         
    
    2. kubectl port-forward $POD_NAME 9000 --namespace default                                                
    
    You can now access Minio server on http://localhost:9000. Follow the below steps to connect to Minio server with mc client:                                                                                    
    
    3. mc ls voting-coral-minio-local                                                                         
    
    Alternately, you can use your browser or the Minio SDK to access the server - https://docs.minio.io/categories/17
    
  2. Execute the commands shown at the end of the MinIO deployment output to make the MinIO server accessible from the local host. Using the above output as an example:

    $ export POD_NAME=$(kubectl get pods --namespace default -l "release=voting-coral" -o jsonpath="{.item[0].metadata.name}")
    $ kubectl port-forward $POD_NAME 9000 --namespace default
    
    Forwarding from 127.0.0.1:9000 -> 9000
    Forwarding from [::1]:9000 -> 9000
    

    Also make note of the MinIO service name used within the cluster (“voting-coral-minio” in the above example). You will use this name when defining the MinIO endpoint in the PXF configuration.

  3. Follow these steps to create the sample data file and copy it to MinIO:

    1. Configure the mc client to use the MinIO server you just deployed:

      $ mc config host add minio http://localhost:9000 AKIAIOSFODNN7EXAMPLE wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
      
      Added `minio` successfully.
      

      The accessKey and secretKey in the above command are used in the default helm chart deployment.

    2. Make two new buckets named to store the sample data and PXF configuration:

      $ mc mb minio/pxf-config
      $ mc mb minio/pxf-data
      
      Bucket created successfully `minio/pxf-config`.
      Bucket created successfully `minio/pxf-data`.
      
    3. Create a delimited plain text data file named pxf_s3_simple.txt to provide the sample data:

      $ echo 'Prague,Jan,101,4875.33
      Rome,Mar,87,1557.39
      Bangalore,May,317,8936.99
      Beijing,Jul,411,11600.67' > ./pxf_s3_simple.txt
      
    4. Copy the sample data file to the MinIO bucket you created:

      $  mc cp ./pxf_s3_simple.txt minio/pxf-data
      
      ./pxf_s3_simple.txt:                      192 B / 192 B [===============] 100.00% 6.46 KiB/s 0s
      
  4. Follow these steps to create the PXF server configuration file to access MinIO, and store it on the MinIO server:

    1. Copy the template PXF MinIO configuration file from Greenplum for Kubernetes to your local host:

      $ kubectl cp master-0:/etc/pxf/templates/minio-site.xml ./minio-site.xml
      
    2. Open the copied template file in a text editor, and edit the file entries to access the MinIO server that you deployed with helm. The file contents should be similar to:

      <?xml version="1.0" encoding="UTF-8"?>
      <configuration>
          <property>
              <name>fs.s3a.endpoint</name>
              <value>http://voting-coral-minio:9000</value>
          </property>
          <property>
              <name>fs.s3a.access.key</name>
              <value>AKIAIOSFODNN7EXAMPLE</value>
          </property>
          <property>
              <name>fs.s3a.secret.key</name>
              <value>wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY</value>
          </property>
          <property>
              <name>fs.s3a.fast.upload</name>
              <value>true</value>
          </property>
          <property>
              <name>fs.s3a.path.style.access</name>
              <value>true</value>
          </property>
      </configuration>
      

      Be sure to change the first property value, fs.s3a.endpoint, to the URL of the MinIO service that was deployed in your cluster. The access.key and secret.key values in the above output are used in the default helm chart deployment. All other property values are the defaults provided in the template file.

    3. Save the file and exit your text editor.

    4. Copy your modified minio-site.xml file for use as the default PXF server configuration on all PXF pods deployed in your server. To do this, you will place it in the example pxf-config bucket under the /servers/default directory. For example:

      $ mc cp ./minio-site.xml minio/pxf-config/servers/default/minio-site.xml
      
      ./minio-site.xml:       643 B / 643 B [===============] 100.00% 6.46 KiB/    s 0s
      

      At this point, you have deployed a MinIO server with sample data, and placed a sample PXF server configuration for the minio server in a location where it will be copied and used as the default PXF server configuration.

  5. Follow these steps to update your Greenplum cluster to use the new PXF server configuration file that you created and staged in MinIO:

    1. Move to the Greenplum for Kubernetes workspace directory you used to deploy the Greenplum cluster.
    2. Edit the manifest file for your cluster (for example, my-gp-with-pxf-instance.yaml) in a text editor.
    3. Uncomment and edit the pxfConf configuration properties at the end of the template file to describe the MinIO location where you copied the PXF configuration file. For example:

      pxfConf:
        s3Source:
          secret: "my-greenplum-pxf-configs"
          endpoint: "voting-coral-minio:9000"
          protocol:  "http"
          bucket: "pxf-config"
      
    4. Create a secret that can be used to authenticate access to the S3 bucket and folder that contains the PXF configuration directory. The name of the secret must match the name specified in the manifest file (secret: "my-greenplum-pxf-configs" by default). For example:

      $ kubectl create secret generic my-greenplum-pxf-configs --from-literal='access_key_id=AKIAIOSFODNN7EXAMPLE'
          --from-literal='secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
      
      secret/my-greenplum-pxf-configs created
      
    5. First delete the existing Greenplum for Kubernetes cluster deployment, and then apply the modified configuration:

      $ kubectl delete -f ./my-gp-with-pxf-instance.yaml --wait=false
      
      greenplumcluster.greenplum.pivotal.io "my-greenplum" deleted
      greenplumpxfservice.greenplum.pivotal.io "my-greenplum-pxf" deleted
      
      $ kubectl apply -f ./my-gp-with-pxf-instance.yaml
      
      greenplumcluster.greenplum.pivotal.io/my-greenplum unchanged
      greenplumpxfservice.greenplum.pivotal.io/my-greenplum-pxf
      
  6. Perform the remaining steps on the Greenplum master pod to create and query an external table that references the sample MinIO data:

    1. Open a bash shell on the master-0 pod:

      $ kubectl exec -it master-0 bash
      
    2. Start the psql subsystem:

      $ psql -d postgres
      
      psql (8.3.23)
      Type "help" for help.
      
      postgres=#
      
    3. Create the PXF extension in the database:

      postgres=# create extension pxf;
      
      CREATE EXTENSION
      
    4. Use the PXF s3:text profile to create a Greenplum Database external table that references the pxf_s3_simple.txt file that you just created and added to MinIO. This command omits the typical &SERVER=<server_name> option in the PXF location URL, because the procedure created only the default server configuration:

      postgres=# CREATE EXTERNAL TABLE pxf_s3_textsimple(location text, month text, num_orders int, total_sales float8) 
                 LOCATION ('pxf://pxf/pxf_s3_simple.txt?PROFILE=s3:text') 
                 FORMAT 'TEXT' (delimiter=E',');
      
      CREATE EXTERNAL TABLE
      
    5. Query the external table to access the sample data stored on MinIO:

      postgres=# SELECT * FROM pxf_s3_textsimple;
      
       location  | month | num_orders | total_sales
      -----------+-------+------------+-------------
       Prague    | Jan   |        101 |     4875.33
       Rome      | Mar   |         87 |     1557.39
       Bangalore | May   |        317 |     8936.99
       Beijing   | Jul   |        411 |    11600.67
      (4 rows)
      

    If you receive any errors when querying the external table, verify the contents of the /etc/pxf/servers/default/minio-site.xml file on each PXF server in the cluster. Also use the mc client to verify the contents and location of the sample data file on MinIO.

    Further PXF troubleshooting information is available in the Greenplum Database documentation at Troubleshooting PXF.