Deploying GPText with Greenplum

This section describes procedures for deploying a VMware Tanzu Greenplum cluster with GPText on Kubernetes.

About GPText with VMware Tanzu Greenplum for Kubernetes

When you deploy GPText with VMware Tanzu Greenplum for Kubernetes, the Greenplum Operator creates resources to run the Apache Solr Cloud and ZooKeeper instances necessary for using GPText. Solr and ZooKeeper can be deployed to multiple replica pods as needed for redundancy.

Note that Zookeeper instances are not deployed on the Greenplum segment hosts (a ‘binding’ ZooKeeper cluster), as described in the VMware Tanzu Greenplum Text Documentation.

Deploying a new Greenplum cluster with GPtext Enabled

Follow these steps to deploy GPText with a new VMware Tanzu Greenplum cluster on Kubernetes.

  1. Use the procedure described in Deploying or Redeploying a Greenplum Cluster to deploy the cluster, but use the samples/my-gp-with-gptext-instance.yaml as the basis for your deployment. Copy the file into your /workspace directory. For example:

    $ cd ./greenplum-for-kubernetes-*/workspace
    $ cp ./samples/my-gp-with-gptext-instance.yaml .
    
  2. Edit the file as necessary for your deployment. samples/my-gp-with-gptext-instance.yaml includes additional properties to configure Greenplum Text in the new cluster:

    apiVersion: "greenplum.pivotal.io/v1"
    kind: "GreenplumCluster"
    metadata:
      name: my-greenplum
    spec:
      masterAndStandby:
        hostBasedAuthentication: |
          # host   all   gpadmin   0.0.0.0/0    trust
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        workerSelector: {}
      segments:
        primarySegmentCount: 1
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        workerSelector: {}
      gptext:
        serviceName: "my-greenplum-gptext"
    ---
    apiVersion: "greenplum.pivotal.io/v1beta1"
    kind: "GreenplumTextService"
    metadata:
      name: my-greenplum-gptext
    spec:
      solr:
        replicas: 2
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
      zookeeper:
        replicas: 3
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
    

    The entry:

      gptext:
        serviceName: "my-greenplum-gptext"    
    

    Indicates that the cluster will use the GPText service configuration named my-greenplum-gptext, that follows at the end of the yaml file. The sample configuration creates two Solr pods and three ZooKeeper replica pods (the minimum required for Apache Solr Cloud). Minimal settings for CPU and memory are defined for each pod. You can customize these values as needed, as well as the workerSelector value if you want to constrain the replica pods to labeled nodes in your cluster. You can also customize the storageClassName if necessary to provide specialized storage for storing GPText indexes differently than Greenplum Database.

  3. Use kubectl apply command with your modified Greenplum manifest file to send the deployment request to the Greenplum Operator. For example:

    $ kubectl apply -f ./my-gp-with-gptext-instance.yaml 
    
    greenplumcluster.greenplum.pivotal.io/my-greenplum created
    greenplumtextservice.greenplum.pivotal.io/my-greenplum-gptext created
    

    If you are deploying another instance of a Greenplum cluster, specify the Kubernetes namespace where you want to deploy the new cluster. For example, if you previously deployed a cluster in the namespace gpinstance-1, you could deploy a second Greenplum cluster in the gpinstance-2 namespace using the command:

    $ kubectl apply -f ./my-gp-with-gptext-instance.yaml -n gpinstance-2
    

    The Greenplum Operator deploys the necessary Greenplum and GPText resources according to your specification, and also initializes the Greenplum cluster.

  4. Execute the following command to monitor the deployment of the cluster. While the cluster is initializing the status will be Pending:

    $ watch kubectl get all
    
    NAME                                      READY   STATUS    RESTARTS   AGE
    pod/greenplum-operator-6ff95b6b79-nw77p   1/1     Running   0          5m32s
    pod/master-0                              1/1     Running   0          2m26s
    pod/my-greenplum-gptext-solr-0            1/1     Running   0          2m33s
    pod/my-greenplum-gptext-solr-1            1/1     Running   0          2m33s
    pod/my-greenplum-gptext-zookeeper-0       1/1     Running   0          2m33s
    pod/my-greenplum-gptext-zookeeper-1       1/1     Running   0          2m30s
    pod/my-greenplum-gptext-zookeeper-2       1/1     Running   0          2m00s
    NAME                                                            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
    service/agent                                                   ClusterIP      None            <none>        22/TCP                       2m26s
    service/greenplum                                               LoadBalancer   10.109.56.155   <pending>     5432:31387/TCP               2m26s
    service/greenplum-validating-webhook-service-6ff95b6b79-nw77p   ClusterIP      10.109.191.15   <none>        443/TCP                      5m30s
    service/kubernetes                                              ClusterIP      10.96.0.1       <none>        443/TCP                      30m
    service/my-greenplum-gptext-solr                                ClusterIP      None            <none>        8983/TCP                     2m33s
    service/my-greenplum-gptext-zookeeper                           ClusterIP      None            <none>        2888/TCP,3888/TCP,2181/TCP   2m33s
    
    NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/greenplum-operator   1/1     1            1           5m32s
    
    NAME                                            DESIRED   CURRENT   READY   AGE
    replicaset.apps/greenplum-operator-6ff95b6b79   1         1         1       5m32s
    
    NAME                                             READY   AGE
    statefulset.apps/master                          1/1     2m26s
    statefulset.apps/my-greenplum-gptext-solr        2/2     2m33s
    statefulset.apps/my-greenplum-gptext-zookeeper   3/3     2m33s
    statefulset.apps/segment-a                       1/1     2m26s
    
    NAME                                                 STATUS    AGE
    greenplumcluster.greenplum.pivotal.io/my-greenplum   Running   2m33s
    
    NAME                                                            AGE
    greenplumtextservice.greenplum.pivotal.io/my-greenplum-gptext   2m33s
    

    Note that the Solr and ZooKeeper services are created along with the Greenplum Database cluster.

  5. Describe your Greenplum cluster to verify that it was created successfully. The Phase should eventually transition to Running:

    $ kubectl describe greenplumClusters/my-greenplum
    
    Name:         my-greenplum
    Namespace:    default
    Labels:       <none>
    Annotations:  API Version:  greenplum.pivotal.io/v1
    Kind:         GreenplumCluster
    Metadata:
      Creation Timestamp:  2020-05-13T22:12:50Z
      Finalizers:
        stopcluster.greenplumcluster.pivotal.io
      Generation:        3
      Resource Version:  2814
      Self Link:         /apis/greenplum.pivotal.io/v1/namespaces/default/greenplumclusters/my-greenplum
      UID:               697b412b-719d-446f-bc5a-af51c5d0ae00
    Spec:
      Gptext:
        Service Name:  my-greenplum-gptext
      Master And Standby:
        Cpu:                        0.5
        Host Based Authentication:  # host   all   gpadmin   0.0.0.0/0   trust
    
        Memory:              800Mi
        Storage:             1G
        Storage Class Name:  standard
        Worker Selector:
      Segments:
        Cpu:                    0.5
        Memory:                 800Mi
        Primary Segment Count:  1
        Storage:                1G
        Storage Class Name:     standard
        Worker Selector:
    Status:
      Instance Image:    greenplum-for-kubernetes:v2.0.0
      Operator Version:  greenplum-operator:v2.0.0
      Phase:             Running
    Events:              <none>
    

    If you are deploying a brand new cluster, the Greenplum Operator automatically initializes the Greenplum cluster. The Phase should eventually transition from Pending to Running and the Events should match the output above.


    Note: If you redeployed a previously-deployed Greenplum cluster, the phase will begin at Pending. The cluster uses its existing Persistent Volume Claims if they are available. In this case, the master and segment data directories will already exist in their former state. The master-0 pod automatically starts the Greenplum Cluster, after which the phase transitions to Running.

  6. At this point, you can work with the deployed Greenplum cluster by executing Greenplum utilities from within Kubernetes, or by using a locally-installed tool, such as psql, to access the Greenplum instance running in Kubernetes. To validate the initial GPText service deployment configuration, follow the instructions in Verifying GPText. Or, to begin working with GPText see Working With GPText Indexes in the GPText documentation.

Adding GPText to an Existing Greenplum Cluster

Follow these steps to deploy a GPText Service and associate it with an existing Greenplum cluster deployment.

  1. Delete the existing Greenplum cluster, but leave the existing PVCs so that the data in your existing Greenplum cluster is preserved. For example:

    $ kubectl delete -f workspace/my-gp-instance.yaml
    

    Ensure that the GreenplumCluster and its associated resources are completely gone before continuing:

    $ watch kubectl get all
    
  2. Edit your manifest file to add a GreenplumTextService. Associate the new GreenplumTextService with the previously-existing GreenplumCluster resource by setting the GPText serviceName value. See the GPText Reference and GreenplumCluster Reference pages for more information and descriptions of the various configuration options. Below is an example manifest file:

    $ cat workspace/my-gp-instance.yaml
    
    ---
    apiVersion: "greenplum.pivotal.io/v1"
    kind: "GreenplumCluster"
    metadata:
      name: my-greenplum
    spec:
      masterAndStandby:
        hostBasedAuthentication: |
          # host   all   gpadmin   0.0.0.0/0    trust
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 1G
        workerSelector: {}
      segments:
        primarySegmentCount: 1
        memory: "800Mi"
        cpu: "0.5"
        storageClassName: standard
        storage: 2G
        workerSelector: {}
      gptext:
        serviceName: "my-greenplum-gptext"
    ---
    apiVersion: "greenplum.pivotal.io/v1beta1"
    kind: "GreenplumTextService"
    metadata:
      name: my-greenplum-gptext
    spec:
      solr:
        replicas: 2
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
      zookeeper:
        replicas: 3
        cpu: "0.5"
        memory: "1Gi"
        workerSelector: {}
        storageClassName: standard
        storage: 100M
    
  3. Apply your updated manifest file to create the GreenplumCluster and GreenplumTextService resources:

    $ kubectl apply -f workspace/my-gp-instance.yaml
    
  4. Describe your Greenplum cluster to verify that it was created successfully.

    $ kubectl describe greenplumClusters/my-greenplum
    
  5. The following steps are required only if your cluster is configured to use a standby master. (If you do not use a standby master, skip to the next step.)

    1. Connect to the master-0 pod:

      $ kubectl exec master-0 -- /bin/bash
      
    2. As the gpadmin user, start the Greenplum cluster:

      $ su - gpadmin
      $ gpstart
      
    3. Create and execute required GPText configuration scripts by copying and pasting the commands below.


      Note: These commands assume that you used the default GPText service name, my-greenplum-gptext, as shown in the previous steps. If you used a different service name, substitute that name for my-greenplum-gptext in the commands below.

      $ cat <<EOF > /greenplum/data-1/gptext.conf
      id,host,port,solrdir,zoocluster
      1,my-greenplum-gptext-solr-0.my-greenplum-gptext-solr,8983,/solr/data-1,"my-greenplum-gptext-zookeeper:2181/gptext"
      2,my-greenplum-gptext-solr-1.my-greenplum-gptext-solr,8983,/solr/data-1,"my-greenplum-gptext-zookeeper:2181/gptext"
      EOF
      $ chmod 600 /greenplum/data-1/gptext.conf
      $ cat <<EOF > /greenplum/data-1/zoo_cluster.conf
      id,host,port,confdir
      1,my-greenplum-gptext-zookeeper,2181,
      EOF
      gpadmin@master-0:~$ chmod 600 /greenplum/data-1/zoo_cluster.conf
      gpadmin@master-0:~$ cat <<EOF > /greenplum/data-1/gptxtenvs.conf
      envname,value
      GPTXTHOME,/opt/gptext
      GPTEXT_CUSTOM_CONFIG_DIR,/data/gptext_conf
      MOUNTED_BINARY,True
      EOF
      gpadmin@master-0:~$ chmod 600 /greenplum/data-1/gptxtenvs.conf
      gpadmin@master-0:~$ source /opt/gptext/greenplum-text_path.sh
      gpadmin@master-0:~$ gptext-installsql gpadmin
      
  6. At this point, you can work with the deployed Greenplum cluster by executing Greenplum utilities from within Kubernetes, or by using a locally-installed tool, such as psql, to access the Greenplum instance running in Kubernetes. To validate the initial GPText service deployment configuration, follow the instructions in Verifying GPText. Or, to begin working with GPText see Working With GPText Indexes in the GPText documentation.

Verifying GPText

Follow these steps to quickly verify GPText operation in your new cluster, using downloaded sample data.

  1. Open a bash shell on the master-0 pod:

    $ kubectl exec -it master-0 -- bash
    
  2. Run gptext-state to confirm that solr nodes are running. If gptext-state reports some nodes as being down, wait for them to start up before proceeding.

    gpadmin@master-0:~$ gptext-state
    20200924:19:21:49:014027 gptext-state:master-0:gpadmin-[INFO]:-Execute GPText state ...
    20200924:19:21:50:014027 gptext-state:master-0:gpadmin-[INFO]:-Check zookeeper cluster state ...
    20200924:19:21:54:014027 gptext-state:master-0:gpadmin-[INFO]:-Check GPText cluster status...
    20200924:19:21:54:014027 gptext-state:master-0:gpadmin-[INFO]:-Current GPText Version: 3.4.3
    20200924:19:21:54:014027 gptext-state:master-0:gpadmin-[INFO]:-All nodes are up and running.
    
  3. Start the psql subsystem:

    $ psql
    
    psql (9.4.24)
    Type "help" for help.
    
    gpadmin=#
    
  4. Query the version of GPText that is installed:

    gpadmin=# select * from gptext.version();
    
                version
    --------------------------------
     Greenplum Text Analytics 3.4.2
    (1 row)
    
  5. Execute the following series of commands to create an external index and add several PDF documents to the index:

    gpadmin=# SELECT * FROM gptext.create_index_external('gptext-docs');
    
    INFO:  Created index gptext-docs
     create_index_external
    -----------------------
     t
    (1 row)
    
    gpadmin=# SELECT * FROM gptext.index_external(
            '{http://gptext.docs.pivotal.io/archives/GPText-docs-213.pdf,
              http://gptext.docs.pivotal.io/latest/topics/administering.html,
              http://gptext.docs.pivotal.io/latest/topics/ext-indexes.html,
              http://gptext.docs.pivotal.io/latest/topics/function_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/guc_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/ha.html,
              http://gptext.docs.pivotal.io/latest/topics/index.html,
              http://gptext.docs.pivotal.io/latest/topics/indexes.html,
              http://gptext.docs.pivotal.io/latest/topics/intro.html,
              http://gptext.docs.pivotal.io/latest/topics/managed-schema.html,
              http://gptext.docs.pivotal.io/latest/topics/performance.html,
              http://gptext.docs.pivotal.io/latest/topics/queries.html,
              http://gptext.docs.pivotal.io/latest/topics/type_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/upgrading.html,
              http://gptext.docs.pivotal.io/latest/topics/utility_ref.html,
              http://gptext.docs.pivotal.io/latest/topics/installing.html}', 'gptext-docs');
    
     dbid | num_docs
    ------+----------
        2 |       16
    (1 row)    
    
    gpadmin=# SELECT * FROM gptext.commit_index('gptext-docs');
    
     commit_index
    --------------
     t
    (1 row)
    
  6. Perform a simple search to find the text “Solr” in the title field of the example external index:

    gpadmin=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
    
                                id                             |  score   | hs | rf
    -----------------------------------------------------------+----------+----+----
     http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 2.103843 |    |
    (1 row)
    
  7. Optionally, complete additional example tasks described in Using GPText in the Greenplum GPText documentation to learn more about GPText functionality. For example, perform the tutorials in Working With GPText Indexes or Querying GPText Indexes.


    Note: In VMware Tanzu Greenplum for Kubernetes, the scripts used to set the environment for Greenplum Database and GPText are /usr/local/greenplum-db/greenplum_path.sh and /opt/gptext/greenplum-text_path.sh, respectively. These paths differ from the paths used with GPText deployed to non-Kubernetes environments.