Navigating Device Failures in Kubernetes Pods

Kubernetes has become the linchpin of modern container orchestration, but as AI/ML workloads surge and organizations lean into specialized accelerators like GPUs, FPGAs, and ASICs, the platform’s failure model is under pressure. Drawing on insights from Sergey Kanzhelev and Mrunal Patel’s KubeCon NA 2024 talk and the latest Kubernetes 1.30 developments, this article expands on failure modes, best practices, and the roadmap to robust device management.
The AI/ML boom and its impact on Kubernetes
The explosion of AI/ML workloads has introduced tightly coupled, latency-sensitive, and resource-hungry tasks into Kubernetes clusters. According to the 2024 Llama 3 paper, hardware failures—especially GPU ECC errors and VRAM faults—are a primary source of training interruptions. In NVIDIA’s self-healing GeForce NOW infrastructure (19 remediation requests per 1,000 nodes per day), node failures and live migration of VMs underscore the operational burden of managing GPU fleets.
Spot instance offerings and power overcommit across public clouds have normalized intermittent device unavailability. Yet Kubernetes’s static resource model still treats devices as binary: either fully present or absent. There’s no native concept of partial failures (e.g., degraded PCIe lanes, ECC-corrected errors) or transient faults, which leads to blind spots in scheduling, allocation, and lifecycle management.
Understanding AI/ML workloads
AI/ML tasks typically require specific device types (e.g., NVIDIA A100 with MIG partitions or AMD MI250x), challenging Kubernetes assumptions. Broadly, workloads split into:
- Training
- Gang-scheduled, run-to-completion jobs that can span days or weeks. A single pod failure may force full-step rollback unless sophisticated checkpointing and orchestration (e.g., MPI/ Horovod) are in place.
- Inference
- Long-lived services or batch endpoints that may need hot model swaps, multi-node ensemble inference, or downloading large weights (>10 GB) via sidecars or init containers.
Legacy assumptions vs. modern demands
Traditional Kubernetes designs assumed:
Before | Now |
---|---|
Generic CPU gains performance. | Requires a specific device class (e.g., ROCm vs. CUDA). |
Pod recreation is cheap. | Reallocating GPUs is time-consuming (device plugin restart + driver rebind). |
Any node suffices. | Nodes must respect PCIe topology, NUMA alignment, and RDMA networks. |
Stateless pods easily replaced. | Pods form part of coordinated steps; synchronized restarts are critical. |
Images small, quick to pull. | Model artifacts can be >50 GB; image pull controllers, layer caching, and preloaded volumes are essential. |
Long rollout masks init costs. | Init containers for device drivers and firmware updates need rolling, live updates. |
Idle time is acceptable. | GPU nodes are 10× costlier; wasted slots severely impact ROI. |
Why Kubernetes still reigns supreme
Despite these challenges, Kubernetes remains the default for AI/ML due to its rich ecosystem (Kubeflow, KEDA, Volcano), comprehensive security model (RBAC, SELinux, seccomp), and broad community support. Ongoing efforts—like the Device Resource Allocation (DRA) API in KEP-5030 and dynamic device health in Kubernetes 1.30 alpha—demonstrate the platform’s evolution toward better device failure handling.
The current state of device failure handling
Failure modes: K8s infrastructure
Scheduling a pod with devices involves multiple actors:
- Device plugin starts on the node.
- Device plugin registers via gRPC to the kubelet.
- Kubelet reports capacity updates to the API server.
- Scheduler binds the pod to the node.
- Kubelet calls Allocate on the device plugin.
- Kubelet spawns the container with host device mounts and driver modules.
Each step can fail: plugin crashes, gRPC hangs, kubelet restarts, or CRI-O device binding errors. SIG Node’s roadmap includes:
- Systemd watchdog integration for kubelet (#127460).
- DRA plugin socket health checks (#128696).
- Device manager takeover & reliability (#127803).
- Graceful retries on plugin gRPC failures (#128043).
Best practices:
- Monitor plugin metrics (gRPC latency, Allocatable vs. Capacity).
- Isolate critical GPU nodes; avoid co-locating dev/test workloads.
- Use canary upgrades for drivers via DaemonSets.
- Leverage
tolerations
andgracefulTermination
for node flakiness.
Failure modes: device failed
Currently, when a GPU experiences hardware errors or firmware hangs, device plugins simply withdraw capacity. Kubernetes does not correlate GPU ECC errors with container restarts. Common DIY patterns:
Health controller
A custom controller tracks allocatable
vs. capacity
metrics. On threshold breach, it cordons and drains the node. Drawbacks: lack of workload context, coarse detection, and potential disruption of healthy pods.
Pod failure policy
Jobs emit special exit codes for device faults. Kubernetes Pod Failure Policy can then treat these as non-retriable. Limitations include no generic deviceFailed status and only compatible with restartPolicy: Never
.
Custom pod watcher
Node-local agents poll NVIDIA DCGM, AMD ROCm SMI, or RDMA ibdiagnet
to detect degradation or stalls. They map failing devices to pods via the Pod Resources API and delete pods to trigger rescheduling. This requires elevated privileges and external controllers to refill the scheduled count.
Failure modes: container code failed
Kubernetes handles OOM kills and non-zero exits uniformly: restart or fail. AI/ML training benefits from orchestrators (e.g., MPI operator, Volcano) that coordinate gang restarts and in-place container restarts to preserve device bindings, reducing cold starts.
Failure modes: device degradation
Partial performance loss—like slowed FP16 throughput due to power capping glitches—cannot be expressed. Detection relies on job telemetry: step latency spikes or BDF (Bus/Device/Function) enumeration errors. No native remediation exists beyond node recycling or manual driver reloads.
Case Study: Real-World Outages and Lessons Learned
In early 2024, a major cloud provider experienced a 30-minute outage in its A100 fleet due to a firmware regression. Automated remediation scripts triggered full node reboots, but without pod affinity rules, some training jobs lost checkpoints. The incident highlighted:
- Need for pod-level device health annotations.
- Importance of cross-pod checkpoint coordination.
- Value of canary firmware deployments with staged rollouts.
Emerging Standards and Third-Party Tools
Beyond built-in Kubernetes, projects like node-healthcheck-operator and Volcano offer richer health APIs. Industry groups are debating Device Health API standards under CNCF to unify signals across vendors—aligning on metrics such as ECC corrected vs. uncorrectable counts, PCI resets, and thermal throttling events.
Security Implications of Device Management
Exposing device health and allocation APIs increases the attack surface. Best practices include:
- RBAC rules that restrict
PodResources
andNodeMetrics
access. - Running device plugins with minimal capabilities (no root)
- Enabling seccomp profiles on driver installation DaemonSets.
Roadmap
SIG Node is prioritizing extension points over specialized semantics to accommodate diverse workloads. Key initiatives:
K8s infrastructure enhancements
- Systemd watchdog and gRPC reconnection logic.
- Plugin socket health detectors (#128696).
- DRA takeover and ResourceSlice health metadata.
Device failure handling
- Implement KEP-4680: Resource Health Status in PodStatus.
- Integrate device events into Pod Failure Policy.
- Support
deschedule
semantics for always-restart pods.
Container code failure recovery
- In-place container restarts with preserved host-device mounts.
- Snapshot-and-restore of container state.
- Node-local restart policies to avoid full rescheduling.
Device degradation visibility
- Extend ResourceSlice to include performance counters.
- Define standard degraded status in DRA API.
- Collaborate with vendors on telemetry exporters (Prometheus metrics).
Join the conversation
Your feedback is vital. Participate in SIG Node meetings or contribute to KEPs on GitHub. Together we can build a more resilient Kubernetes for the next era of AI/ML.