Deploying GPText with Greenplum (Beta)

This section describes procedures for deploying a Pivotal Greenplum cluster with GPText on Kubernetes.

Note: GPText is a Beta feature with Pivotal Greenplum for Kubernetes.

About GPText with Pivotal Greenplum for Kubernetes

When you deploy GPText with Pivotal Greenplum for Kubernetes, the Greenplum Operator creates resources to run the Apache Solr Cloud and ZooKeeper instances necessary for using GPText. ZooKeeper can be deployed to multiple replica pods as needed for redundancy. Currently, Apache Solr Cloud can only be deployed to a single pod.

Note that Zookeeper instances are not deployed on the Greenplum segment hosts (a ‘binding’ ZooKeeper cluster), as described in the Pivotal Greenplum Text Documentation.

Deploying GPtext with Pivotal Greenplum for Kubernetes

Follow these steps to deploy GPText with a new Pivotal Greenplum cluster on Kubernetes.

  1. Use the procedure described in Deploying or Redeploying a Greenplum Cluster to deploy the cluster, but use the samples/my-gp-with-gptext-instance.yaml as the basis for your deployment. Copy the file into your /workspace directory. For example:

    $ cd ./greenplum-for-kubernetes-*/workspace
    $ cp ./samples/my-gp-with-gptext-instance.yaml .
  2. Edit the file as necessary for your deployment. samples/my-gp-with-gptext-instance.yaml includes additional properties to configure Greenplum Text in the new cluster:

    apiVersion: ""
    kind: "GreenplumCluster"
      name: my-greenplum
        hostBasedAuthentication: |
          # host   all   gpadmin    trust
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        workerSelector: {}
        primarySegmentCount: 1
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        workerSelector: {}
        serviceName: "my-greenplum-gptext"
    apiVersion: ""
    kind: "GreenplumTextService"
      name: my-greenplum-gptext
        replicas: 1
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
        replicas: 3
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M

    The entry:

        serviceName: "my-greenplum-gptext"    

    Indicates that the cluster will use the GPText service configuration named my-greenplum-gptext, that follows at the end of the yaml file. The sample configuration creates a single Solr pod (required) and three ZooKeeper replica pods (the minimum required for Apache Solr Cloud). Minimal settings for CPU and memory are defined for each pod. You can customize these values as needed, as well as the workerSelector value if you want to constrain the replica pods to labeled nodes in your cluster. You can also customize the storageClassName if necessary to provide specialized storage for storing GPText indexes differently than Greenplum Database.

  3. Use kubectl apply command with your modified Greenplum manifest file to send the deployment request to the Greenplum Operator. For example:

    $ kubectl apply -f ./my-gp-with-gptext-instance.yaml  created created

    If you are deploying another instance of a Greenplum cluster, specify the Kubernetes namespace where you want to deploy the new cluster. For example, if you previously deployed a cluster in the namespace gpinstance-1, you could deploy a second Greenplum cluster in the gpinstance-2 namespace using the command:

    $ kubectl apply -f ./my-gp-with-gptext-instance.yaml -n gpinstance-2

    The Greenplum Operator deploys the necessary Greenplum and GPText resources according to your specification, and also initializes the Greenplum cluster.

  4. Execute the following command to monitor the deployment of the cluster. While the cluster is initializing the status will be Pending:

    $ watch kubectl get all
    NAME                                      READY   STATUS    RESTARTS   AGE
    pod/greenplum-operator-6ff95b6b79-nw77p   1/1     Running   0          5m32s
    pod/master-0                              1/1     Running   0          2m26s
    pod/my-greenplum-gptext-solr-0            1/1     Running   0          2m33s
    pod/my-greenplum-gptext-zookeeper-0       1/1     Running   0          2m33s
    pod/my-greenplum-gptext-zookeeper-1       1/1     Running   0          2m30s
    NAME                                                            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
    service/agent                                                   ClusterIP      None            <none>        22/TCP                       2m26s
    service/greenplum                                               LoadBalancer   <pending>     5432:31387/TCP               2m26s
    service/greenplum-validating-webhook-service-6ff95b6b79-nw77p   ClusterIP   <none>        443/TCP                      5m30s
    service/kubernetes                                              ClusterIP       <none>        443/TCP                      30m
    service/my-greenplum-gptext-solr                                ClusterIP      None            <none>        8983/TCP                     2m33s
    service/my-greenplum-gptext-zookeeper                           ClusterIP      None            <none>        2888/TCP,3888/TCP,2181/TCP   2m33s
    NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/greenplum-operator   1/1     1            1           5m32s
    NAME                                            DESIRED   CURRENT   READY   AGE
    replicaset.apps/greenplum-operator-6ff95b6b79   1         1         1       5m32s
    NAME                                             READY   AGE
    statefulset.apps/master                          1/1     2m26s
    statefulset.apps/my-greenplum-gptext-solr        1/1     2m33s
    statefulset.apps/my-greenplum-gptext-zookeeper   3/3     2m33s
    statefulset.apps/segment-a                       1/1     2m26s
    NAME                                                 STATUS    AGE   Running   2m33s
    NAME                                                            AGE   2m33s

    Note that the Solr and ZooKeeper services are created along with the Greenplum Database cluster.

  5. Describe your Greenplum cluster to verify that it was created successfully. The Phase should eventually transition to Running:

    $ kubectl describe greenplumClusters/my-greenplum
    Name:         my-greenplum
    Namespace:    default
    Labels:       <none>
    Annotations:  API Version:
    Kind:         GreenplumCluster
      Creation Timestamp:  2020-05-13T22:12:50Z
      Generation:        3
      Resource Version:  2814
      Self Link:         /apis/
      UID:               697b412b-719d-446f-bc5a-af51c5d0ae00
        Service Name:  my-greenplum-gptext
      Master And Standby:
        Cpu:                        0.5
        Host Based Authentication:  # host   all   gpadmin   trust
        Memory:              800Mi
        Storage:             1G
        Storage Class Name:  standard
        Worker Selector:
        Cpu:                    0.5
        Memory:                 800Mi
        Primary Segment Count:  1
        Storage:                1G
        Storage Class Name:     standard
        Worker Selector:
      Instance Image:    greenplum-for-kubernetes:v2.0.0
      Operator Version:  greenplum-operator:v2.0.0
      Phase:             Running
    Events:              <none>

    If you are deploying a brand new cluster, the Greenplum Operator automatically initializes the Greenplum cluster. The Phase should eventually transition from Pending to Running and the Events should match the output above.

    Note: If you redeployed a previously-deployed Greenplum cluster, the phase will begin at Pending. The cluster uses its existing Persistent Volume Claims if they are available. In this case, the master and segment data directories will already exist in their former state. The master-0 pod automatically starts the Greenplum Cluster, after which the phase transitions to Running.

  6. At this point, you can work with the deployed Greenplum cluster by executing Greenplum utilities from within Kubernetes, or by using a locally-installed tool, such as psql, to access the Greenplum instance running in Kubernetes. To validate the initial GPText service deployment configuration, follow the instructions in Verifying GPText. Or, to begin working with GPText see Working With GPText Indexes in the GPText documentation.

Verifying GPText

Follow these steps to quickly verify GPText operation in your new cluster, using downloaded sample data.

  1. Open a bash shell on the master-0 pod:

    $ kubectl exec -it master-0 -- bash
  2. Set the environment for accessing Greenplum Database and GPText tools:

    $ source /usr/local/greenplum-db/
    $ source /opt/gptext/
  3. Start the psql subsystem:

    $ psql -d postgres
    psql (9.4.24)
    Type "help" for help.
  4. Query the version of GPText that is installed:

    gpadmin=# select * from gptext.version();
     Greenplum Text Analytics 3.4.2
    (1 row)
  5. Execute the following series of commands to create an external index and add several PDF documents to the index:

    gpadmin=# SELECT * FROM gptext.create_index_external('gptext-docs');
    INFO:  Created index gptext-docs
    (1 row)
    gpadmin=# SELECT * FROM gptext.index_external(
    }', 'gptext-docs');
     dbid | num_docs
        2 |       16
    (1 row)    
    gpadmin=# SELECT * FROM gptext.commit_index('gptext-docs');
    (1 row)
  6. Perform a simple search to find the text “Solr” in the title field of the example external index:

    gpadmin=# SELECT * FROM 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
                                id                             |  score   | hs | rf
    -----------------------------------------------------------+----------+----+---- | 2.103843 |    |
    (1 row)
  7. Optionally, complete additional example tasks described in Using GPText in the Greenplum GPText documentation to learn more about GPText functionality. For example, perform the tutorials in Working With GPText Indexes or Querying GPText Indexes.

    Note: In Pivotal Greenplum for Kubernetes, the scripts used to set the environment for Greenplum Database and GPText are /usr/local/greenplum-db/ and /opt/gptext/, respectively. These paths differ from the paths used with GPText deployed to non-Kubernetes environments.