Recovering Failed Nodes

Nodes may fail in a Kubernetes cluster for a variety of reasons, including drive failures, memory failures, and network failures. After a node fails, it is up to the Kubernetes cluster operator to recover the node and re-attach it to the cluster. During a node failure event, the Kubernetes cluster is operating in a degraded state, leading to potential resource constraints on a deployed Greenplum cluster. For example, segment pods previously scheduled on the failed node may not get re-scheduled on the remaining nodes. For these reasons, it’s important to recover failed nodes in a timely fashion.

About Reapplying Node Labels

After a failed node has been re-created and re-attached to the Kubernetes cluster, there may be manual steps necessary to incorporate it as part of the Greenplum cluster. Greenplum on Kubernetes relies on node labels for some functionality. Currently, these labels are not automatically reapplied to nodes upon node recreation. Therefore, manual re-applications of node labels are necessary for the following features in the Operator Manifest.

Reapply workerSelector labels

If workerSelector is not specified in the manifest, there are no steps required to re-apply workerSelector labels. If workerSelector is specified in the manifest, then you must reapply the appropriate workerSelector label to the new node to indicate whether it belongs in the masterAndStandby or segments workerSelector pool.

$ kubectl label node <node name> <key>=<value>

Reapply antiAffinity labels

If antiAffinity is explicitly set to “no”, there are no steps required to re-apply antiAffinity node labels. If antiAffinity is set to “yes” or left blank (which defaults to “yes”) and a Greenplum cluster has been deployed, then you must re-apply the appropriate antiAffinity label(s) to a recovered node. See the below chart to determine which antiAffinity label(s) to apply depending on different scenarios.

Scenario antiAffinity label to apply
There is no masterAndStandby workerSelector specified in the manifest OR

The node has the masterAndStandby workerSelector label applied
masterAndStandby antiAffinity label
There is no segments workerSelector specified in the manifest OR

The node has the segments workerSelector label applied
segments antiAffinity label
antiAffinity is explicitly set to “no” no antiAffinity labels are needed

Label Master and Standby Nodes

To apply the masterAndStandby antiAffinity label, use the following command:

$ kubectl label node <node name> greenplum-affinity-<namespace>-master=true

Label Segment Nodes

To apply the segments antiAffinity label, first determine whether the recovered node should be an “a” or “b” node. Examine the number of existing nodes that are “a” vs. “b” nodes by running,

$ kubectl get nodes --show-labels | grep greenplum-affinity-default-segment=a | wc -l  # Number of "a" nodes
$ kubectl get nodes --show-labels | grep greenplum-affinity-default-segment=b | wc -l  # Number of "b" nodes

If there are the same number of “a” nodes and “b” nodes, the new node could be either an “a” node or a “b” node. To apply the label, run:

$ kubectl label node <node name> greenplum-affinity-<namespace>-segment=<a or b>

If there are fewer “a” nodes than “b” nodes, the new node should be labeled as “a”. To apply the label, run:

$ kubectl label node <node name> greenplum-affinity-<namespace>-segment=a

If there are fewer “b” nodes than “a” nodes, the new node should be labeled as “b”. To apply the label, run:

$ kubectl label node <node name> greenplum-affinity-<namespace>-segment=b