Deploying GPText with Greenplum (Beta)

This section describes procedures for deploying a Greenplum for Kubernetes cluster that includes the Pivotal GPText.

Note: Pivotal GPText in Greenplum for Kubernetes is a Beta feature.

About GPText on Greenplum for Kubernetes

When you deploy GPText to Greenplum for Kubernetes, the Greenplum Operator creates resources to run the Apache Solr Cloud and ZooKeeper instances necessary for using GPText. ZooKeeper can be deployed to multiple replica pods as needed for redundancy. Currently, Apache Solr Cloud can only be deployed to a single pod.

Note that Zookeeper instances are not deployed on the Greenplum segment hosts (a ‘binding’ ZooKeeper cluster), as described in the Pivotal Greenplum Text Documentation.

Deploying GPtext with the Greenplum for Kubernetes

Follow these steps to deploy GPText with a new Greenplum for Kubernetes cluster.

  1. Use the procedure described in Deploying a New Greenplum Cluster to deploy the cluster, but use the samples/my-gp-with-gptext-instance.yaml as the basis for your deployment. Copy the file into your /workspace directory. For example:

    $ cd ./greenplum-for-kubernetes-*/workspace
    $ cp ./samples/my-gp-with-gptext-instance.yaml .
    
  2. Edit the file as necessary for your deployment. samples/my-gp-with-gptext-instance.yaml includes additional properties to configure Greenplum Text in the new cluster:

    apiVersion: "greenplum.pivotal.io/v1"
    kind: "GreenplumCluster"
    metadata:
      name: my-greenplum
    spec:
      masterAndStandby:
        hostBasedAuthentication: |
          # host   all   gpadmin   1.2.3.4/32   trust
          # host   all   gpuser    0.0.0.0/0   md5
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        antiAffinity: "yes"
        workerSelector: {}
      segments:
        primarySegmentCount: 1
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        antiAffinity: "yes"
        workerSelector: {}
      gptext:
        serviceName: "my-greenplum-gptext"
    ---
    apiVersion: "greenplum.pivotal.io/v1beta1"
    kind: "GreenplumTextService"
    metadata:
      name: my-greenplum-gptext
    spec:
      solr:
        replicas: 1
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
      zookeeper:
        replicas: 3
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
    

    The entry:

      gptext:
        serviceName: "my-greenplum-gptext"    
    

    Indicates that the cluster will use the GPText service configuration named my-greenplum-gptext, that follows at the end of the yaml file. The sample configuration creates a single Solr pod (required) and three ZooKeeper replica pods (the minimum required for Apache Solr Cloud). Minimal settings for CPU and memory are defined for each pod. You can customize these values as needed, as well as the workerSelector value if you want to constrain the replica pods to labeled nodes in your cluster. You can also customize the storageClassName if necessary to provide specialized storage for storing GPText indexes differently than Greenplum Database.

  3. Use kubectl apply command with your modified PXF manifest file to send the deployment request to the Greenplum Operator. For example:

    $ kubectl apply -f ./my-gp-with-gptext-instance.yaml 
    
    greenplumcluster.greenplum.pivotal.io/my-greenplum created
    greenplumtextservice.greenplum.pivotal.io/my-greenplum-gptext created
    

    If you are deploying another instance of a Greenplum cluster, specify the Kubernetes namespace where you want to deploy the new cluster. For example, if you previously deployed a cluster in the namespace gpinstance-1, you could deploy a second Greenplum cluster in the gpinstance-2 namespace using the command:

    $ kubectl apply -f ./my-gp-with-gptext-instance.yaml -n gpinstance-2
    

    The Greenplum Operator deploys the necessary Greenplum and GPText resources according to your specification, and also initializes the Greenplum cluster.

  4. Execute the following command to monitor the deployment of the cluster. While the cluster is initializing the status will be Pending:

    $ watch kubectl get all
    
    NAME                                      READY     STATUS    RESTARTS   AGE
    pod/greenplum-operator-79cddcf586-ctftb   1/1       Running   0          11m
    pod/master-0                              1/1       Running   0          15s
    pod/master-1                              1/1       Running   0          15s
    pod/my-greenplum-gptext-solr-0            0/1       Running   0          17s
    pod/my-greenplum-gptext-zookeeper-0       1/1       Running   0          17s
    pod/my-greenplum-gptext-zookeeper-1       1/1       Running   0          12s
    pod/my-greenplum-gptext-zookeeper-2       0/1       Pending   0          0s
    pod/segment-a-0                           1/1       Running   0          15s
    pod/segment-b-0                           1/1       Running   0          15s
    
    NAME                                                            TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
    service/agent                                                   ClusterIP      None           <none>        22/TCP                       15s
    service/greenplum                                               LoadBalancer   10.100.229.5   <pending>     5432:32275/TCP               15s
    service/greenplum-validating-webhook-service-79cddcf586-ctftb   ClusterIP      10.105.7.189   <none>        443/TCP                      11m
    service/kubernetes                                              ClusterIP      10.96.0.1      <none>        443/TCP                      28m
    service/my-greenplum-gptext-solr                                ClusterIP      None           <none>        8983/TCP                     17s
    service/my-greenplum-gptext-zookeeper                           ClusterIP      None           <none>        2888/TCP,3888/TCP,2181/TCP   17s
    
    NAME                                 READY     UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/greenplum-operator   1/1       1            1           11m
    
    NAME                                            DESIRED   CURRENT   READY     AGE
    replicaset.apps/greenplum-operator-79cddcf586   1         1         1         11m
    
    NAME                                             READY     AGE
    statefulset.apps/master                          2/2       15s
    statefulset.apps/my-greenplum-gptext-solr        0/1       17s
    statefulset.apps/my-greenplum-gptext-zookeeper   2/3       17s
    statefulset.apps/segment-a                       1/1       15s
    statefulset.apps/segment-b                       1/1       15s
    
    NAME                                                 STATUS    AGE
    greenplumcluster.greenplum.pivotal.io/my-greenplum   Pending   17s
    
    NAME                                                            AGE
    greenplumtextservice.greenplum.pivotal.io/my-greenplum-gptext   17s
    

    Note that the Solr and ZooKeeper services are created along with the Greenplum Database cluster.

  5. Describe your Greenplum cluster to verify that it was created successfully. The Phase should eventually transition to Running:

    $ kubectl describe greenplumClusters/my-greenplum
    
    Name:         my-greenplum
    Namespace:    default
    Labels:       <none>
    Annotations:  kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"greenplum.pivotal.io/v1","kind":"GreenplumCluster","metadata":{"annotations":{},"name":"my-greenplum","namespace":"default"},"spec":{"gp...
    API Version:  greenplum.pivotal.io/v1
    Kind:         GreenplumCluster
    Metadata:
      Creation Timestamp:  2019-10-02T23:43:05Z
      Finalizers:
        stopcluster.greenplumcluster.pivotal.io
      Generation:        2
      Resource Version:  7399
      Self Link:         /apis/greenplum.pivotal.io/v1/namespaces/default/greenplumclusters/my-greenplum
      UID:               b25e90e5-3ac2-40d6-94cb-a8b159b8134a
    Spec:
      Gptext:
        Service Name:  my-greenplum-gptext
      Master And Standby:
        Anti Affinity:              no
        Cpu:                        0.5
        Host Based Authentication:  # host   all   gpadmin   1.2.3.4/32   trust
    # host   all   gpuser    0.0.0.0/0   md5
    
        Memory:              800Mi
        Storage:             1G
        Storage Class Name:  standard
        Worker Selector:
      Segments:
        Anti Affinity:          no
        Cpu:                    0.5
        Memory:                 800Mi
        Primary Segment Count:  1
        Storage:                1G
        Storage Class Name:     standard
        Worker Selector:
    Status:
      Instance Image:    greenplum-for-kubernetes:v1.7.2.dev.51.g4530ad36
      Operator Version:  greenplum-operator:v1.7.2.dev.51.g4530ad36
      Phase:             Pending
    Events:
      Type    Reason                    Age   From               Message
      ----    ------                    ----  ----               -------
      Normal  CreatingGreenplumCluster  4m    greenplumOperator  Creating Greenplum cluster my-greenplum in default
      Normal  CreatedGreenplumCluster   8s    greenplumOperator  Successfully created Greenplum cluster my-greenplum in default
    

    If you are deploying a brand new cluster, the Greenplum Operator automatically initializes the Greenplum cluster. The Phase should eventually transition from Pending to Running and the Events should match the output above.


    Note: If you redeployed a previously-deployed Greenplum cluster, the phase will begin at Pending. The cluster uses its existing Persistent Volume Claims if they are available. In this case, the master and segment data directories will already exist in their former state. The master-0 pod automatically starts the Greenplum Cluster, after which the phase transitions to Running.

  6. At this point, you can work with the deployed Greenplum cluster by executing Greenplum utilities from within Kubernetes, or by using a locally-installed tool, such as psql, to access the Greenplum instance running in Kubernetes. To validate the initial GPText service deployment configuration, follow the instructions in Verifying GPText. Or, to begin working with GPText see Working With GPText Indexes in the Pivotal GPText documentation.

Verifying GPText

Follow these steps to quickly verify GPText operation in your new cluster, using downloaded sample data.

  1. Open a bash shell on the master-0 pod:

    $ kubectl exec -it master-0 bash
    
  2. Set the environment for accessing Greenplum Database and GPText tools:

    $ source /opt/gpdb/greenplum_path.sh
    $ source /opt/gptext/greenplum-text_path.sh
    
  3. Start the psql subsystem: bash $ psql -d postgres “` bash psql (8.3.23) Type "help” for help.

    postgres=# “`

  4. Query the version of GPText that is installed:

    gpadmin=# select * from gptext.version();
    
                version
    --------------------------------
     Greenplum Text Analytics 3.3.0
    (1 row)
    
  5. Execute the following series of commands to create an external index and add several PDF documents to the index:

    gpadmin=# SELECT * FROM gptext.create_index_external('gptext-docs');
    
    INFO:  Created index gptext-docs
     create_index_external
    -----------------------
     t
    (1 row)
    
    gpadmin=# SELECT * FROM gptext.index_external(
            '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf,
              http://gptext.docs.pivotal.io/latest/topics/administering.html,
              http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html,
              http://gptext.docs.pivotal.io/latest/topics/function_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/guc_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/ha.html,
              http://gptext.docs.pivotal.io/latest/topics/index.html,
              http://gptext.docs.pivotal.io/latest/topics/indexes.html,
              http://gptext.docs.pivotal.io/latest/topics/intro.html,
              http://gptext.docs.pivotal.io/latest/topics/managed-schema.html,
              http://gptext.docs.pivotal.io/latest/topics/performance.html,
              http://gptext.docs.pivotal.io/latest/topics/queries.html,
              http://gptext.docs.pivotal.io/latest/topics/type_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/upgrading.html,
              http://gptext.docs.pivotal.io/latest/topics/utility_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/installing.html}', 'gptext-docs');
    
     dbid | num_docs
    ------+----------
        2 |       16
    (1 row)    
    
    gpadmin=# SELECT * FROM gptext.commit_index('gptext-docs');
    
     commit_index
    --------------
     t
    (1 row)
    
  6. Perform a simple search to find the text “Solr” in the title field of the example external index:

    gpadmin=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
    
                                id                             |  score   | hs | rf
    -----------------------------------------------------------+----------+----+----
     http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 2.103843 |    |
    (1 row)
    
  7. Optionally, complete additional example tasks described in Using GPText in the Greenplum GPText documentation to learn more about GPText functionality. For example, perform the tutorials in Working With GPText Indexes or Querying GPText Indexes.


    Note: In Greenplum for Kubernetes, the scripts used to set the environment for Greenplum Database and GPText are /opt/gpdb/greenplum_path.sh and /opt/gptext/greenplum-text_path.sh, respectively. These paths differ from the paths described in the Pivotal Greenplum or GPText documentation.