Troubleshooting Common Problems
- Enabling Debug Logging
- Read-Only File System Error
- Could Not Find Tiller
- Node Not Labeled
- Forbidden Namespace or Unknown User
- Executable Not Found
- Sandbox Has Changed
- Connection Timed
- Unable to Connect to the Server
- Permission Denied Error when Stopping Greenplum
- Socket: too many open files
- PKS Deployment Errors
Enabling Debug Logging
By default, Greenplum for Kubernetes logs info
level messages. You can obtain more detailed log messages for certain problems by changing the log level to debug
. Note that changes to the logging level must be applied before the Greenplum Operator is installed.
To change the log level:
Go to the
operator
subdirectory of your Greenplum for Kubernetes software directory. For example:$ cd ~/greenplum-for-kubernetes-*/operator
Open the
values.yaml
file in a text editor.To change the default log level to
debug
, add the following line to the end of the file:logLevel: debug
To revert to default logging, either remove this line or change it to read
logLevel: info
Install the Greenplum Operator to use the new logging level.
Read-Only File System Error
Symptom:
The command kubectl logs <pod-name>
shows the error:
install: cannot create directory '/sys/fs/cgroup/devices/kubepods': Read-only file system
Resolution:
The Greenplum for Kubernetes deployment process requires the ability to map the host system’s /sys/fs/cgroup
directory onto each container’s /sys/fs/cgroup
. Ensure that no kernel security module (for example, AppArmor) uses a profile that disallows mounting /sys/fs/cgroup
.
Could Not Find Tiller
Symptom:
Error: could not find tiller
Resolution:
Remove any existing helm
installation and re-install helm
with sufficient privileges. The initialize_helm_rbac.yaml
is available in the top-level directory of the Greenplum for Kubernetes software distribution:
$ helm reset
$ kubectl create -f ./initialize_helm_rbac.yaml
$ helm init --service-account tiller --upgrade --wait
Node Not Labeled
Symptom:
node "gke-gpdb-test-default-pool-20a900ca-3trh" not labeled
Resolution:
This is a common output from GCP. It indicates that the node is already labeled correctly, so no label action was necessary.
Forbidden Namespace or Unknown User
Symptom:
namespaces "default" is forbidden: User "system:serviceaccount:kube-system:default" cannot get namespaces in the namespace "default": Unknown user "system:serviceaccount:kube-system:default"
Resolution:
This message indicates that the Kubernetes system is v1.8 or greater, which enables role-based access. (See this bug report and commit.)
Helm requires additional permissions (more than the default level) in v1.8 or greater. Execute these commands:
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
helm init --service-account tiller --upgrade --wait
Executable Not Found
Symptom:
executable not found in $PATH
Resolution:
This error appears on the events tab of a container to indicate that xfs is not supported on Container OS (COS). To resolve this, use Ubuntu OS on the node.
Sandbox Has Changed
Symptom:
Sandbox has changed
Resolution:
This error appears on the events tab of a container to indicate that sysctl
settings have become corrupted or failed. To resolve this, remove the sysctl
settings from pod YAML file.
Connection Timed
Symptom:
getsocket() connection timed
Resolution:
This error can occur when accessing http://localhost:8001/ui, a kubectl
proxy address. Make sure that there is a connection between the master and worker nodes where kubernetes-dashboard is running. A network tag on the nodes like gpcloud-internal
can establish a route among the nodes.
Unable to Connect to the Server
Symptom:
kubectl get nodes -w
Unable to connect to the server: x509: certificate is valid for 10.100.200.1, 35.199.191.209, not 35.197.83.225
Resolution:
This error indicates that you have chosen to update the wrong Load Balancer. Each cluster has its own load balancer for the Kubernetes master, with a certificate for access. Refer to the workspace/samples/scripts/create_pks_cluster_on_gcp.bash
script for Bash commands that help to determine the master IP address for a given cluster name, and the commands used to attach to a Load Balancer.
Permission Denied Error when Stopping Greenplum
Symptom:
kubectl exec -it master bash
$ gpstop -u
#
# Exits with Permission denied error
#
.
20180828:14:34:53:002448 gpstop:master:gpadmin-[CRITICAL]:-Error occurred: Error Executing Command:
Command was: 'ssh -o 'StrictHostKeyChecking no' master ". /opt/gpdb/greenplum_path.sh; $GPHOME/bin/pg_ctl reload -D /greenplum/data-1"'
rc=1, stdout='', stderr='pg_ctl: could not send reload signal (PID: 2137): Permission denied'
Resolution:
This error occurs because of the ssh context that Docker uses. Commands that are issued to a process have to use the same context as the originator of the process. This issue is fixed in recent Docker versions, but the fixes have not reached the latest kubernetes release. To avoid this issue, use the same ssh context that you used to initialize the Greenplum cluster. For example, if you used a kubectl
session to initialize Greenplum, then use another kubectl
session to run gpstop
and stop Greenplum.
Socket: too many open files
Symptom:
Executing any kubectl
command yields an error similar to:
dial udp 1.2.3.4:53: socket: too many open files
Resolution:
Configure the underlying node to support a larger number of files. See Files in the Node Requirements documentation.
PKS Deployment Errors
Greenplum Query Fails to Write an Outgoing Packet
Symptom:
The Greenplum cluster is initialized and running, but a query returns an error similar to:
ERROR: Interconnect error writing an outgoing packet: Invalid argument (seg0 slice1 <ip>:<port> pid=1193)
Resolution:
This error occurs when ports are not garbage collected quickly enough. The problem is commonplace in systems that have many containers on a single kubernetes node, and the containers heavily use different ports to communicate with one another (as is the case with Greenplum segments).
To work around this problem, set the following sysctl
attribute on the worker nodes:
net.ipv4.neigh.default.gc_thresh1 = 30000
Authorization Errors
Symptom
After certificate change (after any URL change for UAA domain name), you may see a 401
error from BOSH similar to:
bosh -e pks vms
Using environment '192.168.101.10' as anonymous user
Finding deployments:
Director responded with non-successful status code '401' response 'Not authorized: '/deployments'
'
Exit code 1
Resolution
Go to the credentials web page (similar to https://<ops manager>/infrastructure/director/credentials
) and look for Bosh command line credentials. The credentials look similar to:
{"credential":"BOSH_CLIENT=<Some User> BOSH_CLIENT_SECRET=<Some Secret> BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.168.101.10 bosh "}
In the command for uaac token owner get
, use:
* BOSH_CLIENT as the “Client ID”
* BOSH_CLIENT_SECRET as “Client secret”
* “Admin” for User name
* associated password from Uaa Admin User Credentials
For example:
$ uaac token owner get
Client ID: ops_manager
Client secret: ********************************
User name: Admin
Password: ********************************
Cannot Access UAA
Symptom
You can access Ops Manager, but you have problems accessing UAA. For example:
pks login -a https://pks-0.gpcloud.gpdb.pivotal.io:9021 -u dummy -p <some password> -k
Error: Post https://pks-0.gpcloud.gpdb.pivotal.io:8443/oauth/token: dial tcp 35.197.67.138:8443: getsockopt: connection refused
Resolution
This problem can be a symptom of having recycled the VM running the PKS API, such that the external IP address defined in the
domain name is old. Use gcloud
or Google Cloud Console to determine the current VM for the PKS API. You can
distinguish it because it has two labels, “job” and “instance-group”, which both have values “pivotal container service”.
Get the external IP address for this and change the DNS definition to that external IP address. Use a command like:
$ watch dig '<my domain name>'
Wait to see that the DNS entry is updated for your local workstation. When it updates, try the pks login
command again.
Unexpected End of JSON Input
Symptom
You see the following error when you try to deploy the Greenplum Operator using helm
:
$ helm install --name greenplum-operator -f workspace/operator-values-overrides.yaml operator/
Error: release greenplum-operator failed: Secret "regsecret" is invalid: data[.dockerconfigjson]: Invalid value: "<secret contents redacted>": unexpected end of JSON input
Resolution
This error indicates that the value specified for dockerRegistryKeyJson
in ./workspace/operator-values-overrides.yaml
is invalid or missing.
In order to download Greenplum images from a container image registry such as gcr.io
, a key.json
file is required to provide the authentication secrets. Make sure that the key.json
file is in the correct location under the /operator
directory, as described in the installation procedure.
ImagePullBackOff Error While Deploying Operator
Symptom
When you try to deploy the Greenplum Operator using helm
you see an error similar to:
$ helm install --name greenplum-operator -f workspace/operator-values-overrides.yaml operator/
$ kubectl describe pod -l app=greenplum-operator
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned default/greenplum-operator-79bd8ccbc4-4lbxx to gke-oz-acceptance-default-pool-c7870f59-6h3f
Normal Pulling 1m (x2 over 1m) kubelet, default-pool-c7870f59-6h3f pulling image "greenplum-operator:v0.6.0.dev.103.gadfb9a1"
Warning Failed 1m (x2 over 1m) kubelet, default-pool-c7870f59-6h3f Failed to pull image "greenplum-operator:v0.6.0.dev.103.gadfb9a1": rpc error: code = Unknown desc = Error response from daemon: repository greenplum-operator not found: does not exist or no pull access
Warning Failed 1m (x2 over 1m) kubelet, default-pool-c7870f59-6h3f Error: ErrImagePull
Normal SandboxChanged 1m (x7 over 1m) kubelet, default-pool-c7870f59-6h3f Pod sandbox changed, it will be killed and re-created.
Normal BackOff 1m (x6 over 1m) kubelet, default-pool-c7870f59-6h3f Back-off pulling image "greenplum-operator:v0.6.0.dev.103.gadfb9a1"
Warning Failed 1m (x6 over 1m) kubelet, default-pool-c7870f59-6h3f Error: ImagePullBackOff
Resolution
This error indicates that the value you specified for dockerRegistryKeyJson
in ./workspace/operator-values-overrides.yaml
points to a service account key that does not have permission to pull images from the specified container registry.
To download Greenplum images from a container image registry such as gcr.io
, a key.json
file is required to provide the authentication secrets. Make sure that the key.json
file contains a valid service account key with permission to pull images from the container registry as described Cluster Requirements (for GKE on GCP) or Cluster Requirements (for PKS on GCP)