Deploying PXF with Greenplum (Beta)

This section describes procedures for deploying a Greenplum for Kubernetes cluster that includes the Pivotal extension framework (PXF).

Note: Pivotal PXF in Greenplum for Kubernetes is a Beta feature.

About PXF on Greenplum for Kubernetes

When you deploy PXF to Greenplum for Kubernetes, the Greenplum Operator creates one or more dedicated pods, or replicas, to host the PXF server instances. This differs from Pivotal Greenplum deployed to other platforms, where a PXF server instance is deployed to each Greenplum segment host. With Greenplum for Kubernetes, you can choose to deploy as many PXF server replicas as needed to provide redundancy should a PXF pod fail.

When you install a new Greenplum cluster using the template PXF manifest file, workspace/samples/my-gp-with-pxf-instance.yaml, PXF is installed and initialized with a default (empty) PXF configuration directory. After deploying the cluster, you can customize the configuration by creating PXF server configurations for multiple data sources, using the instructions in the Pivotal Greenplum Documentation such as:

For subsequent deployments, you can export your customized PXF configuration directory ($PXF_CONF) to an S3 bucket, and then describe the bucket-path in the Greenplum for Kubernetes deployment manifest. The specified S3 location is downloaded and used as the PXF_CONF directory when deploying the cluster.

Deploying PXF with the Default Configuration

Follow these steps to deploy PXF with the default, initialized configuration. You will need to modify the template PXF configuration files to access the external data sources required for your system.

  1. Use the procedure described in Deploying a New Greenplum Cluster to deploy the cluster, but use the samples/my-gp-with-pxf-instance.yaml as the basis for your deployment. Copy the file into your /workspace directory. For example:

    $ cd ./greenplum-for-kubernetes-*/workspace
    $ cp ./samples/my-gp-with-pxf-instance.yaml .
    
  2. Edit the file as necessary for your deployment. my-gp-with-pxf-instance.yaml includes properties to configure PXF in the basic Greenplum cluster:

    apiVersion: "greenplum.pivotal.io/v1"
    kind: "GreenplumCluster"
    metadata:
      name: my-greenplum
    spec:
      masterAndStandby:
        hostBasedAuthentication: |
          # host   all   gpadmin   1.2.3.4/32   trust
          # host   all   gpuser    0.0.0.0/0   md5
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        antiAffinity: "yes"
        workerSelector: {}
      segments:
        primarySegmentCount: 1
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 2G
        antiAffinity: "yes"
        workerSelector: {}
      pxf:
        serviceName: "my-greenplum-pxf"    
    ---
    apiVersion: "greenplum.pivotal.io/v1beta1"
    kind: "GreenplumPXFService"
    metadata:
      name: my-greenplum-pxf
    spec:
      replicas: 2
      cpu: "0.5"
      memory: "1Gi"
      workerSelector: {}
    #  pxfConf:
    #    s3Source:
    #      secret: "my-greenplum-pxf-configs"
    #      endpoint: "s3.amazonaws.com"
    #      bucket: "YOUR_S3_BUCKET_NAME"
    #      folder: "YOUR_S3_BUCKET_FOLDER-Optional"
    #
    # Note: If using pxfConf.s3Source, in addition to applying the above yaml be sure to create a secret using a command similar to:
    # kubectl create secret generic my-greenplum-pxf-configs --from-literal=‘access_key_id=XXX’ --from-literal=‘secret_access_key=XXX’
    

    The entry:

      pxf:
        serviceName: "my-greenplum-pxf"    
    

    Indicates that the cluster will use the PXF service configuration named my-greenplum-pxf that follows at the end of the yaml file. The sample configuration creates two PXF replica pods for redundancy with minimal settings for CPU and memory. You can customize these values as needed, as well as the workerSelector value if you want to constrain the replica pods to labeled nodes in your cluster. See Greenplum PXF Service Properties for information about each available property.


    The commented properties are used only after you have created a PXF configuration directory that you want to re-use when deploying the cluster. See Exporting a PXF Configuration to S3 for more information.

  3. Use kubectl apply command with your modified PXF manifest file to send the deployment request to the Greenplum Operator. For example:

    $ kubectl apply -f ./my-gp-with-pxf-instance.yaml 
    
    greenplumcluster.greenplum.pivotal.io/my-greenplum created
    greenplumpxfservice.greenplum.pivotal.io/my-greenplum-pxf created
    

    If you are deploying another instance of a Greenplum cluster, specify the Kubernetes namespace where you want to deploy the new cluster. For example, if you previously deployed a cluster in the namespace gpinstance-1, you could deploy a second Greenplum cluster in the gpinstance-2 namespace using the command:

    $ kubectl apply -f ./my-gp-with-pxf-instance.yaml -n gpinstance-2
    

    The Greenplum Operator deploys the necessary Greenplum and PXF resources according to your specification, and also initializes the Greenplum cluster.

  4. Execute the following command to monitor the deployment of the cluster. While the cluster is initializing the status will be Pending:

    $ watch kubectl get all
    
    NAME                                      READY     STATUS    RESTARTS   AGE
    pod/greenplum-operator-79cddcf586-ctftb   1/1       Running   0          2m40s
    pod/master-0                              1/1       Running   0          23s
    pod/master-1                              1/1       Running   0          22s
    pod/my-greenplum-pxf-676fd6fdd7-825gq     0/1       Running   0          28s
    pod/my-greenplum-pxf-676fd6fdd7-mjt6w     0/1       Running   0          28s
    pod/segment-a-0                           1/1       Running   0          22s
    pod/segment-b-0                           1/1       Running   0          22s
    
    NAME                                                            TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)
    AGE
    service/agent                                                   ClusterIP      None             <none>        22/TCP
    23s
    service/greenplum                                               LoadBalancer   10.104.112.154   <pending>     5432:32294/TCP
    23s
    service/greenplum-validating-webhook-service-79cddcf586-ctftb   ClusterIP      10.105.7.189     <none>        443/TCP
    2m38s
    service/kubernetes                                              ClusterIP      10.96.0.1        <none>        443/TCP
    19m
    service/my-greenplum-pxf                                        ClusterIP      10.105.235.115   <none>        5888/TCP
    29s
    
    NAME                                 READY     UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/greenplum-operator   1/1       1            1           2m40s
    deployment.apps/my-greenplum-pxf     0/2       2            0           28s
    
    NAME                                            DESIRED   CURRENT   READY     AGE
    replicaset.apps/greenplum-operator-79cddcf586   1         1         1         2m40s
    replicaset.apps/my-greenplum-pxf-676fd6fdd7     2         2         0         28s
    
    NAME                         READY     AGE
    statefulset.apps/master      2/2       23s
    statefulset.apps/segment-a   1/1       23s
    statefulset.apps/segment-b   1/1       23s
    
    NAME                                                 STATUS    AGE
    greenplumcluster.greenplum.pivotal.io/my-greenplum   Pending   29s
    
    NAME                                                        AGE
    greenplumpxfservice.greenplum.pivotal.io/my-greenplum-pxf   29s
    

    Note that the Greenplum PXF service, deployment, and replicas are created in addition to the Greenplum cluster.

  5. Describe your Greenplum cluster to verify that it was created successfully. The Phase should eventually transition to Running:

    $ kubectl describe greenplumClusters/my-greenplum
    
    Name:         my-greenplum
    Namespace:    default
    Labels:       <none>
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"greenplum.pivotal.io/v1","kind":"GreenplumCluster", "metadata":{"annotations":{},"name":"my-greenplum", "namespace":"default"...
    API Version:  greenplum.pivotal.io/v1
    Kind:         GreenplumCluster
    Metadata:
      Creation Timestamp:  2019-04-01T15:19:17Z
      Generation:          1
      Resource Version:    1469567
      Self Link:           /apis/greenplum.pivotal.io/v1/namespaces/default/greenplumclusters/my-greenplum
      UID:                 83e0bdfd-5491-11e9-a268-c28bb5ff3d1c
    Spec:
      Master And Standby:
        Anti Affinity:              yes
        Cpu:                        0.5
        Host Based Authentication:  # host   all   gpadmin   1.2.3.4/32   trust
    # host   all   gpuser    0.0.0.0/0   md5
    
        Memory:              800Mi
        Storage:             1G
        Storage Class Name:  standard
        Worker Selector:
      Segments:
        Anti Affinity:          yes
        Cpu:                    0.5
        Memory:                 800Mi
        Primary Segment Count:  1
        Storage:                2G
        Storage Class Name:     standard
        Worker Selector:
    Status:
      Instance Image:    greenplum-for-kubernetes:latest
      Operator Version:  greenplum-operator:latest
      Phase:             Running
    Events:
      Type    Reason                    Age   From               Message
      ----    ------                    ----  ----               -------
      Normal  CreatingGreenplumCluster  2m    greenplumOperator  Creating Greenplum cluster my-greenplum in default
      Normal  CreatedGreenplumCluster   8s    greenplumOperator  Successfully created Greenplum cluster my-greenplum in default
    

    If you are deploying a brand new cluster, the Greenplum Operator automatically initializes the Greenplum cluster. The Phase should eventually transition from Pending to Running and the Events should match the output above.


    Note: If you redeployed a previously-deployed Greenplum cluster, the phase will stay at Pending. It uses the previous Persistent Volume Claims if available. In this case, the master and segment data directories will already exist in their former state. In this case, master-0 pod automatically starts Greenplum Cluster. The phase should transition to Running.

  6. At this point, you can work with the deployed Greenplum cluster by executing Greenplum utilities from within Kubernetes, or by using a locally-installed tool, such as psql, to access the Greenplum instance running in Kubernetes. To begin PXF configuration, examine the PXF_CONF directory on master:

    $ kubectl exec -it master-0 bash -- -c "ls -R /etc/pxf"
    
    /etc/pxf:
    conf  keytabs  lib  logs  servers  templates
    
    /etc/pxf/conf:
    pxf-env.sh  pxf-log4j.properties  pxf-profiles.xml
    
    /etc/pxf/keytabs:
    
    /etc/pxf/lib:
    
    /etc/pxf/logs:
    
    /etc/pxf/servers:
    default
    
    /etc/pxf/servers/default:
    
    /etc/pxf/templates:
    adl-site.xml   hbase-site.xml  jdbc-site.xml    s3-site.xml
    core-site.xml  hdfs-site.xml   mapred-site.xml  wasbs-site.xml
    gs-site.xml    hive-site.xml   minio-site.xml   yarn-site.xml
    

    The Pivotal PXF service has just been initialized, and the PXF_CONF directory (/etc/pxf) contains the subdirectories and template configuration files required to begin configuring PXF for your external data sources. Follow the instructions in Configuring PXF Hadoop Connectors (Optional), Configuring Connectors to Azure, Google Cloud Storage, Minio, and S3 Object Stores (Optional), or Configuring the JDBC Connector (Optional) in the Pivotal Greenplum Documentation. Be sure to test access to each data source from within the Kubernetes cluster for connectivity.

Exporting a PXF Configuration to S3

Follow these steps to export an existing PXF Configuration to an S3 bucket, so that you can use the configuration in later deployments of Pivotal Greenplum for Kubernetes. This procedures requires that you have a working PXF service configuration (a customized $PXF_CONF directory) available in a Greenplum for Kubernetes cluster.

  1. Create a temporary directory, and copy in the complete PXF_CONF directory from your Greenplum for Kubernetes cluster. For example:

    $ mkdir ./pxf-temp
    $ kubectl cp master-0:/etc/pxf ./pxf-temp/
    
  2. Copy the PXF_CONF contents from your temporary directory to a S3 bucket and folder. For example:

    $ aws s3 cp pxf-temp s3://<bucket>/<folder>/ --recursive
    
    upload: pxf-temp/conf/pxf-env.sh to s3://<bucket>/<folder>//conf/pxf-env.sh
    upload: pxf-temp/conf/pxf-log4j.properties to s3://<bucket>/<folder>/conf/pxf-log4j.properties
    ...
    
  3. Create a secrets file that can be used to authenticate access to the S3 bucket and folder that contains the PXF configuration directory. For example:

    $ kubectl create secret generic my-greenplum-pxf-configs --from-literal='access_key_id=<accessKey>' --from-literal='secret_access_key=<secretKey>'
    
    secret/my-greenplum-pxf-configs created
    

    The above command creates a secret named my-greenplum-pxf-configs using the S3 access and secret keys that you provide. Replace <accessKey> and <secretKey> with the actual S3 access and secret key values for your system. If necessary, use your S3 implementation documentation to generate a secret access key.

  4. Edit the Greenplum for Kubernetees manifest file you use to deploy Greenplum and PXF. Uncomment or add the required PXF configuration properties:

      pxfConf:
        s3Source:
          secret: "my-greenplum-pxf-configs"
          endpoint: "s3.amazonaws.com"
          bucket: "YOUR_S3_BUCKET_NAME"
          folder: "YOUR_S3_BUCKET_FOLDER-Optional"
    

    Replace my-greenplum-pxf-configs with the actual secret that you created in the previous step (if you used a different name). Similarly, replace the remaining property values with the endpoint, bucket name, and folder in which you have placed the full contents of the PXF_CONF directory.


    When you deploy a new cluster with the above property values, the Greenplum operator uses the secret and S3 location that you specified to copy the PXF_CONF directory to all segment hosts in your cluster.

  5. Use the procedure described in Deploying a New Greenplum Cluster to deploy a new cluster, but use your modified manifest file containing the S3 PXF configuration.

  6. After the new cluster deployment is complete, validate your PXF configuration and test PXF connectivity. For example, examine the PXF_CONF directory to ensure that your customized PXF configuration was installed:

    $ kubectl exec -it master-0 bash -- -c "ls -R /etc/pxf"