Please note - This is a living document which we will update as and when required. |
Quick links
1) Inspect the state
- Deploy debug tools
- Inspect VolumeSnapshots
- Inspect stuck processes on I/O
- Inspect Rok master
- Inspect Rok GW
- Inspect Recurring Runs
- Identify Problematic Components
2) Gather Logs
3) How to Recover
- Task is stuck
- Node needs reboot
- The CSI node pod is stuck
- The CSI controller is stuck
- PV still staged on the node
- Rok has no master
How to Recover
This article is the Arrikto ROK CSI Troubleshooting Runbook. It documents troubleshooting methods, and solutions for issues such as:
- New pipelines are stuck without any progress
- Running pipelines that never finish
- Rok snapshots become stuck
- Pods stuck initializing/terminating state
- CSI node Pod stuck Terminating
- Rok has no master
How Rok Tasks Work
EKF uses Rok snapshot tasks to capture the state of pipeline steps and create Notebook backups. When a Rok task is created, it is picked up by the Rok task daemon (taskd) for execution. Each task has a scheduled timestamp in which it is supposed to run, meaning that tasks can be scheduled to run in the future. Rok snapshot policies use this feature to schedule their next run ahead of time. The task daemon uses three different pools of workers to run tasks: One for parent tasks (level 0 tasks), one for children (level 1 tasks) and one for grandchildren (level 2 tasks). Each worker pool has 16 workers (a total of 48 for all task levels). The lifetime of a task is represented by its status. Rok tasks can be in one of the following statuses:
Rok snapshot tasks use Kubernetes VolumeSnapshots to snapshot Kubernetes resources. When snapshotting a resource such as a Notebook or Pod, which can use more than one PVCs, Rok will create a parent task with a child task for each of the PVCs. Each child task will create a VolumeSnapshot for that PVC in Kubernetes, and wait for it to complete. Currently, Rok tasks do not use a timeout, meaning they will run until the VolumeSnapshot either completes successfully or fails with a non-recoverable error. After the VolumeSnapshot completes, Rok will create a Rok API object with the snapshotted data and delete the VolumeSnapshot. Finally, if the execution of a task is interrupted, e.g., because the roke master pod was restarted, taskd will clean up any previously running tasks by deleting the VolumeSnapshot (if it exists) and setting their status to Error. |
Inspect State
Deploy debug tools
1. Download the latest rok-tools manifest:
https://storage.googleapis.com/arrikto/downloads/rok-tools-debug.yaml
2. Rename rok-tools to rok-tools-debug:
sed 's/\(.*\): rok-tools/\1: rok-tools-debug/g' rok-tools-eks.yaml > rok-tools-debug.yaml |
3. Apply the rok-tools-debug manifest:
kubectl apply -f rok-tools-debug.yaml
|
4. Exec into the pod
kubectl exec -ti sts/rok-tools-debug -- bash |
Inspect VolumeSnapshots
rok-inspect volumesnapshots
|
Inspect stuck processes on I/O
kubectl get pods -nrok -lapp=rok-csi-node -oname | \ while read csinode; do if [[ $(kubectl exec -n rok $csinode -c csi-node -- ps -eos | grep D) ]]; then echo $csinode fi done | \ xargs -r -n1 kubectl get -n rok -ojson | \ jq -r '.metadata.name,.spec.nodeName' | paste - -
rok-csi-node-8vt6jip-192-168-152-207.eu-central-1.compute.internal |
Inspect Rok master
1. Find the master pod
kubectl get pods -n rok -l app=rok,role=master
|
2. Inspect election locks
kubectl exec -ti -n rok ds/rok -- \ |
3. Inspect members
kubectl exec -ti -n rok ds/rok -- \ |
Inspect Rok GW
1. Inspect task summary:
rok --all-accounts task-list --summary
|
Status Level 0 Level 1 Level 2 Total |
2. List running tasks
rok --all-accounts task-list --status running --progress
|
Task ID Account Bucket Action Status Scheduled At Progress Running time |
3. Inspect policies:
rok --all-accounts policy-list --schedule
|
Policy ID Account Bucket Action Status Next Run At Description Schedule |
Inspect Recurring Runs
kubectl get swf -A -o json | jq -c ".items[] | [.metadata.namespace, .metadata.name, .spec.trigger]"
|
["kubeflow-user","myrunf9mgh",{"periodicSchedule":{"intervalSecond":600}}]
|
Inspect stale locks
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 \ --dlm-namespace election lock-list | grep -v -w up
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 \ --dlm-namespace composer lock-list | grep -v -w up
Inspect Rok CSI Controller logs
kubectl logs -n rok rok-csi-controller-0 -c csi-controller | grep "still staged"
|
Identify Problematic Components
=> If it takes more than 15 minutes without any progress from the last timestamp. This may be a false alarm for big snapshots.
=> csi-node is stuck
=> taskd is stuck
=> csi-controller is stuck: it does not process requests from csi-snapshotter sidecar. => corresponding csi-node is stuck: csi-controller times out when trying to create the underlying snapshot. => CSI stuck: after 30 min VolumeSnapshot will become ERROR. In those cases, VolumeSnapshot cannot be deleted, sidecar retries the request.
=> taskd queue is full
=> taskd queue is full => the system is under heavy load => Check policies => Check recurring runs => taskd is stuck
=> CSI node GC failed => restart corresponding csi node Pod
=> reboot the corresponding node
=> a dead member is holding the master_lease lock
=> the corresponding node is not ready |
Gather logs
Before doing any action to recover, gather the following logs:
- Create a temp dir
mkdir -p ekf-logs && pushd ekf-logs
|
- Get all volume snapshots
kubectl get volumesnapshots -A -o json > volumesnapshots.json |
- Get all volumesnapshotcontents
kubectl get volumesnapshotcontent -A -o json > volumesnapshotcontent.json
|
- Inspect state
rok-inspect volumesnapshots > vs.txt |
- Rok master pod logs
kubectl logs -n rok svc/rok -c rok > master.log
|
- Rok master internal logs
mkdir -p master |
- Rok CSI controller logs
kubectl logs -n rok rok-csi-controller-0 --all-containers > rok-csi-controller.log
|
- Rok CSI node logs
kubectl get pods -n rok -l app=rok-csi-node -o name | cut -d/ -f2 | |
- Create a tarball
popd && tar -czf ekf-logs.tar.gz ekf-logs
|
How to Recover
Task is stuck
- Delete Rok master pod
kubectl delete pods -n rok -l role=master,app=rok
|
- All running tasks will be aborted.
Node needs reboot
- Specify the node
node=ip-100-75-8-21.ec2.internal
|
- Find the corresponding CSI node pod.
pod=$(kubectl get pods -n rok -lapp=rok-csi-node --field-selector=spec.nodeName=$node -o name)
|
- Exec into the pod:
kubectl exec -ti -n rok ${pod} -c csi-node -- bash |
- Reboot the node
echo b > /proc/sysrq-trigger
|
The CSI node pod is stuck.
- Find the corresponding CSI node pod.
pod=rok-csi-node-4r6wk
|
- Delete the CSI node pod
kubectl delete pods -n rok ${pod?} |
- To delete all Rok CSI node Pods
kubectl delete pods -n rok -l app=rok-csi-node
|
- All CSI operations will be aborted, for example, VolumeSnapshots.
The CSI controller is stuck
- Delete Rok CSI controller pod
kubectl delete pods -n rok rok-csi-controller-0
|
- This is harmless and will not impact any running jobs.
PV still staged on the node
- Find out the affected node from the Rok CSI Controller error message:
GRPCAborted: Volume `pvc-7c2792a2-2b0d-4b06-84c5-fc7acb1c1791' is still staged on node `ip-100-75-8-21.ec2.internal' |
- Find the CSI Node Pod running on this node:
kubectl get pods -n rok -lapp=rok-csi-node --field-selector=spec.nodeName=ip-100-75-8-21.ec2.internal -o name |
- Delete the corresponding CSI Node Pod:
kubectl delete -n rok pod/rok-csi-node-9ptt9
|
Rok has no master
- Find the dead pod/member that is holding the lock
- Find the corresponding node
- Reboot the node if is not Ready
Stale locks
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace election lock-break -y
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace composer lock-break -y
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace election client-break -y --force --member-id ${ROK_MEMBER_ID?}
a. List the locks to find the lock to break, e.g., one with an unknown client state. Mark the resource and the lock ID of the lock.
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace composer lock-list
kubectl exec -ti -n rok-system sts/rok-operator -- \ rok-dlm --etcd-endpoint http://rok-etcd.rok.svc.cluster.local:2379 --dlm-namespace composer lock-break --lock-id ${LOCK_ID} --resource ${RESOURCE} --force
Comments
0 comments
Please sign in to leave a comment.