Troubleshooting Common Problems
- Enabling Debug Logging
- Read-Only File System Error
- Node Not Labeled
- Executable Not Found
- Sandbox Has Changed
- Connection Timed
- Unable to Connect to the Server
- Permission Denied Error when Stopping Greenplum
- Socket: too many open files
- No matches for CustomResourceDefinition on helm install/upgrade
- Cannot pull images
- Failures During Initialization (Pending State)
- Failures During Expansion (Pending State)
- Failures During Installation After An Incomplete Uninstallation
- VMware Tanzu Kubernetes Grid Integrated (TKGI) Edition Deployment Errors
- Troubleshooting GreenplumPXFService pod startup
- Troubleshooting GPText errors
Enabling Debug Logging
By default, VMware Tanzu Greenplum for Kubernetes logs info
level messages. You can obtain more detailed log messages for certain problems by changing the log level to debug
. Note that changes to the logging level must be applied before the Greenplum Operator is installed.
To change the log level:
Go to the
operator
subdirectory of your VMware Tanzu Greenplum software directory. For example:$ cd ~/greenplum-for-kubernetes-*/operator
Open the
values.yaml
file in a text editor.To change the default log level to
debug
, add the following line to the end of the file:logLevel: debug
To revert to default logging, either remove this line or change it to read
logLevel: info
Install the Greenplum Operator to use the new logging level.
Read-Only File System Error
Symptom:
The command kubectl logs <pod-name>
shows the error:
install: cannot create directory '/sys/fs/cgroup/devices/kubepods': Read-only file system
Resolution:
The VMware Tanzu Greenplum deployment process requires the ability to map the host system’s /sys/fs/cgroup
directory onto each container’s /sys/fs/cgroup
. Ensure that no kernel security module (for example, AppArmor) uses a profile that disallows mounting /sys/fs/cgroup
.
Node Not Labeled
Symptom:
node "gke-gpdb-test-default-pool-20a900ca-3trh" not labeled
Resolution:
This is a common output from GCP. It indicates that the node is already labeled correctly, so no label action was necessary.
Executable Not Found
Symptom:
executable not found in $PATH
Resolution:
This error appears on the events tab of a container to indicate that xfs is not supported on Container OS (COS). To resolve this, use Ubuntu OS on the node.
Sandbox Has Changed
Symptom:
Sandbox has changed
Resolution:
This error appears on the events tab of a container to indicate that sysctl
settings have become corrupted or failed. To resolve this, remove the sysctl
settings from pod YAML file.
Connection Timed
Symptom:
getsocket() connection timed
Resolution:
This error can occur when accessing http://localhost:8001/ui, a kubectl
proxy address. Make sure that there is a connection between the master and worker nodes where kubernetes-dashboard is running. A network tag on the nodes like gpcloud-internal
can establish a route among the nodes.
Unable to Connect to the Server
Symptom:
kubectl get nodes -w
Unable to connect to the server: x509: certificate is valid for 10.100.200.1, 35.199.191.209, not 35.197.83.225
Resolution:
This error indicates that you have chosen to update the wrong Load Balancer. Each cluster has its own load balancer for the Kubernetes master, with a certificate for access. Refer to the workspace/samples/scripts/create_pks_cluster_on_gcp.bash
script for Bash commands that help to determine the master IP address for a given cluster name, and the commands used to attach to a Load Balancer.
Permission Denied Error when Stopping Greenplum
Symptom:
kubectl exec -it master bash
$ gpstop -u
#
# Exits with Permission denied error
#
.
20180828:14:34:53:002448 gpstop:master:gpadmin-[CRITICAL]:-Error occurred: Error Executing Command:
Command was: 'ssh -o 'StrictHostKeyChecking no' master ". /usr/local/greenplum-db/greenplum_path.sh; $GPHOME/bin/pg_ctl reload -D /greenplum/data-1"'
rc=1, stdout='', stderr='pg_ctl: could not send reload signal (PID: 2137): Permission denied'
Resolution:
This error occurs because of the ssh context that Docker uses. Commands that are issued to a process have to use the same context as the originator of the process. This issue is fixed in recent Docker versions, but the fixes have not reached the latest kubernetes release. To avoid this issue, use the same ssh context that you used to initialize the Greenplum cluster. For example, if you used a kubectl
session to initialize Greenplum, then use another kubectl
session to run gpstop
and stop Greenplum.
Socket: too many open files
Symptom:
Executing any kubectl
command yields an error similar to:
dial udp 1.2.3.4:53: socket: too many open files
Resolution:
Configure the underlying node to support a larger number of files. See Files in the Node Requirements documentation.
No matches for CustomResourceDefinition on helm install/upgrade
Symptom
When using a version of Greenplum for Kubernetes that is v2.0.0 or above with Kubernetes version earlier than 1.16, helm install/upgrade fails with the following error:
bash
unable to recognize "": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1"
Resolution
Upgrade Kubernetes to version 1.16 or higher and redeploy the operator.
Cannot pull images
Symptom
When you try to deploy the Greenplum Operator using helm
you see an error similar to:
$ helm install greenplum-operator -f workspace/operator-values-overrides.yaml operator/
$ kubectl describe pod -l app=greenplum-operator
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned default/greenplum-operator-79bd8ccbc4-4lbxx to gke-oz-acceptance-default-pool-c7870f59-6h3f
Normal Pulling 1m (x2 over 1m) kubelet, default-pool-c7870f59-6h3f pulling image "greenplum-operator:v0.6.0.dev.103.gadfb9a1"
Warning Failed 1m (x2 over 1m) kubelet, default-pool-c7870f59-6h3f Failed to pull image "greenplum-operator:v0.6.0.dev.103.gadfb9a1": rpc error: code = Unknown desc = Error response from daemon: repository greenplum-operator not found: does not exist or no pull access
Warning Failed 1m (x2 over 1m) kubelet, default-pool-c7870f59-6h3f Error: ErrImagePull
Normal SandboxChanged 1m (x7 over 1m) kubelet, default-pool-c7870f59-6h3f Pod sandbox changed, it will be killed and re-created.
Normal BackOff 1m (x6 over 1m) kubelet, default-pool-c7870f59-6h3f Back-off pulling image "greenplum-operator:v0.6.0.dev.103.gadfb9a1"
Warning Failed 1m (x6 over 1m) kubelet, default-pool-c7870f59-6h3f Error: ImagePullBackOff
Resolution
This error indicates that the created regsecret
does not have permission to pull images from the specified container registry or the secret was never created.
Make sure the regsecret
docker-registry secret is created and contains appropriate privileges to fetch images.
Failures During Initialization (Pending State)
Symptom
When creating a Greenplum instance, initialization can fail due to unavailable resources. For example:
- The number of available nodes may not be able to accommodate the specified segment count.
- Required labels may not be available on nodes, as specified in the configuration.
In either of the above cases, the segment pods will remain in Pending
state until the required resources become available.
Resolution
- Try to increase number of nodes in your existing Kubernetes cluster. For example, add more node pools in your GKE cluster.
- If you cannot increase the node size, then reduce the CPU or memory resources allocated to each pod in the manifest file. Then, delete the existing GreenplumCluster, including PVCs, and re-apply the configuration.
- If enough nodes are available but pods remain in
Pending
state, verify that your nodes are properly labeled to match your deployment specification.
Failures During Expansion (Pending State)
Symptom
When you try to expand a Greenplum instance, the expansion can fail due to unavailable resources. For example:
- The number of available nodes may not be able to accommodate the new segment count.
- Required labels may not be available on nodes, as specified in the configuration.
In either of the above cases, the segment pods will remain in Pending
state until the required resources become available.
Resolution
- Try to increase number of nodes in your existing Kubernetes cluster. For example, add more node pools in your GKE cluster.
- If enough nodes are available but pods remain in
Pending
state, verify that your nodes are properly labeled to match your deployment specification. - After the new or reconfigured segment pods are up and
Running
, perform these steps to recover from the failed expansion process:- Execute
kubectl exec -it master-0 bash
to log into the master pod. - Execute the rollback command
gpexpand -r -D gpadmin
to rollback any changes. - Restart the cluster with
gpstart -a
if the Greenplum cluster has already stopped. - Execute the command
gpexpand -D gpadmin -i /tmp/gpexpand_config
to perform the Greenplum expansion. - Logout of
master-0
and runkubectl delete job.batch/my-greenplum-gpexpand-job
to delete the failedgpexpand
jobs.
- Execute
Failures During Installation After An Incomplete Uninstallation
Symptom
After reinstalling the greenplum operator and attempting to create a greenplum cluster, you see this error:
error: unable to recognize "workspace/my-gp-instance.yaml": no matches for kind "GreenplumCluster" in version "greenplum.pivotal.io/v1"
This might happen if you uninstalled the greenplum operator without deleting all previously deployed greenplum cluster resources. To avoid this problem in the future, make sure to follow all the steps from Uninstalling VMware Tanzu Greenplum for Kubernetes
Resolution
Uninstall the greenplum operator helm chart and reinstall it.
VMware Tanzu Kubernetes Grid Integrated (TKGI) Edition Deployment Errors
Greenplum Query Fails to Write an Outgoing Packet
Symptom:
The Greenplum cluster is initialized and running, but a query returns an error similar to:
ERROR: Interconnect error writing an outgoing packet: Invalid argument (seg0 slice1 <ip>:<port> pid=1193)
Resolution:
This error occurs when ports are not garbage collected quickly enough. The problem is commonplace in systems that have many containers on a single kubernetes node, and the containers heavily use different ports to communicate with one another (as is the case with Greenplum segments).
To work around this problem, set the following sysctl
attribute on the worker nodes:
net.ipv4.neigh.default.gc_thresh1 = 30000
Authorization Errors
Symptom
After certificate change (after any URL change for UAA domain name), you may see a 401
error from BOSH similar to:
bosh -e pks vms
Using environment '192.168.101.10' as anonymous user
Finding deployments:
Director responded with non-successful status code '401' response 'Not authorized: '/deployments'
'
Exit code 1
Resolution
Go to the credentials web page (similar to https://<ops manager>/infrastructure/director/credentials
) and look for Bosh command line credentials. The credentials look similar to:
{"credential":"BOSH_CLIENT=<Some User> BOSH_CLIENT_SECRET=<Some Secret> BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.101.10 bosh "}
In the command for uaac token owner get
, use:
* BOSH_CLIENT as the “Client ID”
* BOSH_CLIENT_SECRET as “Client secret”
* “Admin” for User name
* associated password from Uaa Admin User Credentials
For example:
$ uaac token owner get
Client ID: ops_manager
Client secret: ********************************
User name: Admin
Password: ********************************
Cannot Access UAA
Symptom
You can access Ops Manager, but you have problems accessing UAA. For example:
pks login -a https://pks-0.gpcloud.gpdb.pivotal.io:9021 -u dummy -p <some password> -k
Error: Post https://pks-0.gpcloud.gpdb.pivotal.io:8443/oauth/token: dial tcp 35.197.67.138:8443: getsockopt: connection refused
Resolution
This problem can be a symptom of having recycled the VM running the TKGI API, such that the external IP address defined in the
domain name is old. Use gcloud
or Google Cloud Console to determine the current VM for the VMware Tanzu Kubernetes Grid Integrated (TKGI) Edition API. You can
distinguish it because it has two labels, “job” and “instance-group”, which both have values “pivotal container service”.
Get the external IP address for this and change the DNS definition to that external IP address. Use a command like:
$ watch dig '<my domain name>'
Wait to see that the DNS entry is updated for your local workstation. When it updates, try the pks login
command again.
Troubleshooting GreenplumPXFService pod startup
Note: When diagnosing errors during GreenplumPXFService startup, it may help to resize the deployment to 1 replica to make it easier to find the most recent pod. Be sure to set the replicas back to the desired number (2 by default) once the issue has been resolved.
CreateContainerConfigError state
Symptom
When you attempt to deploy PXF, the GreenplumPXFService pods go into state CreateContainerConfigError
and you see an error similar to:
kubectl describe pod -l app=greenplum-pxf
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m48s default-scheduler Successfully assigned default/my-greenplum-pxf-8487655d9b-sxxsh to minikube
Warning Failed 49s (x12 over 2m47s) kubelet, minikube Error: secret "my-greenplum-pxf-config" not found
Normal Pulled 38s (x13 over 2m47s) kubelet, minikube Container image "greenplum-for-kubernetes:latest" already present on machine
Warning Failed 49s (x12 over 2m47s) kubelet, minikube Error: secret "my-greenplum-pxf-config" not found
Resolution
This is likely caused by one of two issues:
- The secret name is not specified correctly in the GreenplumPXFService manifest file.
- The secret does not exist in the Kubernetes API.
If it is the first issue, correct the manifest file and re-apply.
If it is the second, create the necessary secret. In this case you will need to delete the GreenplumPXFService before re-applying to ensure pods will be able to locate the new secret.
Failure to download PXF configuration
Symptom
When you attempt to deploy PXF, GreenplumPXFService pods go into state Error
or CrashLoopBackOff
When you check the logs for the GreenplumPXFService pod, you will see an error similar to
{"level":"INFO","ts":"2020-03-07T00:19:58.539Z","logger":"startPXF","msg":"downloading PXF configuration", "bucket", "my-bucket", "folder", "my-folder"}
{"level":"ERROR","ts":"2020-03-07T00:19:59.089Z","logger":"startPXF","msg":"PXF failed to start","error", <error message>
the exact error message will vary.
Resolution
This indicates that the PXF pod was unable to download configuration objects from S3
- Check the error message to discover what went wrong. Some example error messages are:
> “error”:“s3 list objects: Access denied.”
>
> “error”:“s3 list objects: The request signature we calculated does not match the signature you provided. Check your Google secret key and signing method.”
>
> “error”:“s3 list objects: The specified bucket does not exist.”
>
> “error”:“no objects found in pxfConf.s3Source location:
” - Check to make sure you are using the correct key to access your S3 data.
- Check the permissions on your bucket and files.
- Check your manifest for your GreenplumPXFService. Make sure the
pxfConf.s3Source
field contains the correct values forendpoint
,bucket
, andfolder
.
If the solution to the error involved editing the GreenplumPXFService manifest, simply re-apply the file and the pods will be recreated.
If the solution did not involve editing the manifest, such as changing the secret or making changes in S3, delete the GreenplumPXFService and re-apply to see the changes.
Warning from PXF conf directory structure
Symptom
When you check the logs of a GreenplumPXFService pod, you see the following warning:
warning: PXF S3 conf data doesn’t look like a PXF conf directory. Please check pxfConf.s3Source setting
Resolution
This warning is emitted when the PXF pod is able to download objects from S3, but they did not follow the expected structure of a PXF configuration directory, as described in the PXF documentation
- Check your manifest for your GreenplumPXFService. Make sure the
pxfConf.s3Source
field contains the correct values forbucket
andfolder
. - Check the contents of S3. They should contain at least one of the following directories:
conf/
keytabs/
lib/
logs/
servers/
templates/
Troubleshooting GPText errors
Solr errors when running gptext.search
Symptom
The Solr service can take several minutes to start running and make existing indexes available to query after pods are created or recreated. Solr pods may be in the Running
state and show that they are Ready
, but some gptext commands may not be available.
Users may encounter some of the following errors when running gptext.search
:
gpadmin=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
ERROR: Function "gptext.__index_context_input(cstring)": Code: UNKNOWN, Message: '<CURL>:Couldn't resolve host name.' (UDF.cpp:215)
LINE 1: ... FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-do...
gpadmin=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
ERROR: Function "gptext.search(anytable,gptext.__index_context,text,text[],text)": There's no active replicas for shard 'shard1'Function: PickShardsReplicasAndNodeUrls, File: IndexSearchContext.cpp(361). (UDF.cpp:215) (seg0 slice2 10.0.1.19:40000 pid=252) (UDF.cpp:215)
gpadmin=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
id | score | hs | rf
-----------------------------------------------------------+-----------+----+----
http://gptext.docs.pivotal.io/latest/topics/type_ref.html | 0.8142733 | |
(1 row)
gpadmin=# SELECT * FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-docs', 'title:Solr', null, null);
ERROR: Function "gptext.__index_context_input(cstring)": Code: RUNTIME_ERROR, Message: 'Solr service down. No live nodes exist. Root: '/gptext'Expression: 'SolrConfig::Instance()->ChooseOneLiveNode(&live_node)'.' (UDF.cpp:215)
LINE 1: ... FROM gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gptext-do...
Resolution
Wait for Solr to become ready. Some options to track the progress:
gptext-state --details
can be used to watch when solr nodes become available and to track index status. Wait until index status show Green
.
gpadmin@master-0:~$ gptext-state --details
20200924:18:38:07:013715 gptext-state:master-0:gpadmin-[INFO]:-Execute GPText state ...
20200924:18:38:07:013715 gptext-state:master-0:gpadmin-[INFO]:-Check zookeeper cluster state ...
20200924:18:38:11:013715 gptext-state:master-0:gpadmin-[INFO]:-Check GPText cluster status...
20200924:18:38:11:013715 gptext-state:master-0:gpadmin-[INFO]:-Current GPText Version: 3.4.3
20200924:18:38:11:013715 gptext-state:master-0:gpadmin-[INFO]:-All nodes are up and running.
20200924:18:38:12:013715 gptext-state:master-0:gpadmin-[INFO]:------------------------------------------------
20200924:18:38:12:013715 gptext-state:master-0:gpadmin-[INFO]:-Index state details.
20200924:18:38:12:013715 gptext-state:master-0:gpadmin-[INFO]:------------------------------------------------
20200924:18:38:12:013715 gptext-state:master-0:gpadmin-[INFO]:- database index name state
20200924:18:38:12:013715 gptext-state:master-0:gpadmin-[INFO]:- gpadmin gpadmin.public.gptext_demo Green
20200924:18:38:12:013715 gptext-state:master-0:gpadmin-[INFO]:-Done.
Inside the database, gptext.index_status
can also be used for a specific index if you know it should have populated data. If the query result shows 0 rows
or the replica state shows down
or recovering
, the index is not ready yet.
gpadmin=# SELECT index_name, replica_state FROM gptext.index_status('gpadmin.public.gptext_demo');
index_name | replica_state
----------------------------+---------------
(0 rows)
gpadmin=# SELECT index_name, replica_state FROM gptext.index_status('gpadmin.public.gptext_demo');
index_name | replica_state
----------------------------+---------------
gpadmin.public.gptext_demo | down
gpadmin.public.gptext_demo | active
(2 rows)
When all replica states show active
, the index should be ready.
gpadmin=# SELECT index_name, replica_state FROM gptext.index_status('gpadmin.public.gptext_demo');
index_name | replica_state
----------------------------+---------------
gpadmin.public.gptext_demo | active
gpadmin.public.gptext_demo | active
(2 rows)