Kubernetes 1.33: Job SuccessPolicy Goes GA

Home page — News — Kubernetes 1.33: Job SuccessPolicy Goes GA

With the release of Kubernetes v1.33 in June 2024, the batch working group is proud to announce that the JobsuccessPolicy API field has reached General Availability (GA). This milestone means production-grade stability, backwards-compatible API assurances, and removal of the alpha/beta feature gates. Teams running large-scale batch workloads—particularly in scientific simulation, AI/ML, and High-Performance Computing (HPC)—can now rely on successPolicy as a core primitive for smarter Job termination logic.

About Job’s SuccessPolicy

Traditional Kubernetes Jobs require all Pods to complete successfully before the Job status flips to Complete. In indexed Jobs, each Pod gets a unique numeric index in 0..completions-1. The .spec.successPolicy field lets you override this “all-or-nothing” behavior by defining early-exit criteria: either a minimal count of successful indexes (succeededCount) or an explicit list of succeededIndexes. As soon as any rule is satisfied, the Job controller marks the Job as succeeded, and kicks off cleanup of all remaining Pods.

Related topic

Programmer Wins Over OpenAI at 2025 AtCoder Finals

2025-07-18

Why GA Status Matters

Reaching GA means the successPolicy API is covered by the Kubernetes version skew policy. There will be no disruptive schema changes, and it is guaranteed to be supported for the next three minor releases. For enterprises running Kubernetes in regulated or heavily audited environments, this stability is crucial. It also unlocks full support from managed Kubernetes services—such as GKE, EKS, and AKS—where alpha or beta features may be disabled or unsupported.

Technical Specifications

The GA spec for .spec.successPolicy contains two optional subfields:

rules[].succeededCount (integer): Minimum number of indexed completions required.
rules[].succeededIndexes (array of integers): Explicit list of Pod indexes whose success triggers early exit.

Under the hood, the job-controller in kube-controller-manager watches each indexed Pod’s status update. It maintains an in-memory bitmap of acknowledged successes and evaluates the policy rules in order. Once any rule returns true, the controller adds a SuccessCriteriaMet condition to the Job’s status object, then invokes Kubernetes’ built-in pod termination API to delete all remaining Pods.

Related topic

Critique of Tesla Autopilot Safety in Wrongful Death Trial

2025-07-18

How it works

Here’s an example of a 10-Pod indexed Job which exits successfully after any single Pod finishes:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-early-exit
spec:
  parallelism: 10
  completions: 10
  completionMode: Indexed
  successPolicy:
    rules:
    - succeededCount: 1

In this configuration, as soon as one Pod with any index succeeds, the Job status transitions to SuccessCriteriaMet and all other Pods are immediately terminated. For scenarios where only the leader Pod (index 0) dictates Job success, you can combine both fields:

spec:
  parallelism: 10
  completions: 10
  completionMode: Indexed
  successPolicy:
    rules:
    - succeededIndexes:
      - 0      # leader index
      succeededCount: 1

Real-world Use Cases and Performance Benchmarks

Many large organizations have already begun piloting successPolicy in production:

CERN uses indexed Jobs to run parametrized physics simulations. By early-exiting when a representative subset finishes, they cut average cluster runtime by ~40%.
GenomeCloud processes thousands of sequencing jobs daily. With succeededCount thresholds, they reduce wasted compute and lower monthly cloud spend by 25%.
AI Labs Inc. orchestrates hyperparameter sweeps. Specifying a minimal quorum of successes in brackets accelerates convergence detection, improving resource utilization by 30%.

Benchmarks show that the additional controller logic adds under 5 ms of scheduling latency per Pod event and uses only kilobytes of extra memory in kube-controller-manager, making it lightweight enough for even resource-constrained clusters.

Related topic

Netflix’s Role in Generative AI for TV and Film

2025-07-18

Compatibility and Migration Considerations

If you have been using the alpha or beta version of this feature (feature gates JobSuccessPolicy), no API version changes are required to migrate to GA. Simply ensure your clusters are running v1.33 or later, remove any manual feature-gate toggling, and update your Job manifests to include successPolicy. Legacy manifests without successPolicy continue behaving as before, requiring all completions.

For automation and CI pipelines, update any kubectl apply or Helm charts to validate the new field against the batch/v1 schema. Running kubectl explain job.spec.successPolicy will show the GA documentation after upgrading.

Looking Ahead: Roadmap and Next Steps

Building on GA successPolicy, the WG-Batch is exploring:

Checkpointing & Resumability – allow indexed Jobs to resume from saved state after preemption or node failures (KEP-4501).
Cross-Namespace Coordination – enable Jobs in different namespaces to share success criteria via ConfigMaps or CRDs.
TTL for Early-Exited Jobs – auto-cleanup Jobs that exit early, based on successPolicy, after a configurable TTL.

Contributions are welcome via the KEP repo and SIG Apps proposals.

Related topic

Nothing Phone 3 Review: Performance, Design, and AI Features

2025-07-18

Learn more

Official docs: Success policy
KEP: Job success/completion policy

Get involved

This enhancement was driven by the WG-Batch in collaboration with SIG Apps. Join the conversation on Slack, subscribe to the working-group mailing list, and attend the biweekly community meetings to propose, review, or test new features.