In-Depth Overview of JobSet: A Unified API for Distributed ML Training and HPC on Kubernetes

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)
JobSet is an innovative open source API designed to represent and orchestrate distributed jobs. It aims to merge the complexities of distributed machine learning (ML) training and high-performance computing (HPC) workloads into a single, unified API that sits atop Kubernetes. This article explores JobSet’s architecture, technical details, and its transformative role in managing containerized workloads across diverse compute islands.
Why JobSet?
The recent improvements in Kubernetes’ batch ecosystem have played a pivotal role in attracting ML engineers to Kubernetes. The platform’s ability to seamlessly schedule and manage distributed training jobs makes it an excellent match for the heavy demands of large-scale ML and HPC workloads.
Modern large ML models, especially large language models (LLMs), often exceed the memory capacity of a single GPU or TPU host. Consequently, model training gets distributed over tens of thousands of accelerator chips spread across hundreds or thousands of nodes. These containerized applications execute concurrently on multiple hosts, performing complex distributed operations using communication primitives such as all-gather and all-reduce to synchronize gradients across nodes.
Kubernetes’ inherent strengths in container orchestration, resource scheduling, and lifecycle management make it an ideal platform to support these distributed jobs. Its extensibility allows developers to create custom APIs and controllers, though innovative use cases such as distributed ML still push the limits of existing Kubernetes primitives.
Historically, orchestration frameworks like Kubeflow have provided custom definitions such as PyTorchJob, TFJob, and MPIJob tailored to specific frameworks. However, these framework-specific solutions have introduced fragmentation into the distributed training ecosystem. JobSet tackles this by using the more general Job API as a building block and extending it to capture additional requirements—such as multi-template pods, job grouping across network topologies, and controlled startup sequencing—essential for modern HPC and ML workloads.
How JobSet Works
JobSet treats a distributed workload as an aggregate of several Kubernetes Jobs. This strategy allows users to define distinct pod templates for various roles like a leader, workers, or parameter servers. By incorporating the abstraction of a ReplicatedJob—a Job Template paired with a specified number of replicas—JobSet simplifies the task of deploying identical child jobs across separate accelerator islands, eliminating the need for cumbersome scripting or management via Helm charts.
Key features of JobSet include:
- Replicated Jobs: Modern data centers typically house groups of homogeneous accelerators connected by high-bandwidth network links. JobSet partitions large distributed tasks into smaller, identical jobs that run on accelerator islands. This design minimizes cross-island communications and holds promise for improved performance in distributed data parallel training strategies.
- Automatic Headless Service Management: Communication between pods is critical for synchronizing training workloads. JobSet automatically configures headless services, ensuring that pod-to-pod communication via hostnames is reliable and seamlessly managed throughout the job lifecycle.
- Configurable Success and Failure Policies: Users can fine-tune JobSet’s behavior with policies that determine when a job is considered complete and how many restarts are allowed in case of failure. This flexibility allows administrators to define success conditions – for instance, requiring that all pods in a specific group complete successfully before terminating the overall job – and to automatically recover from failures.
- Exclusive Placement per Topology Domain: For latency-sensitive workloads, JobSet enforces exclusive 1:1 assignment of child jobs to topology domains such as racks or accelerator islands. This optimal placement ensures that inter-node communication occurs primarily over high-bandwidth interconnects within the same domain, with minimal dependency on lower-bandwidth data center networks.
- Integration with Kueue: JobSet supports submission via Kueue, which enables cluster oversubscription, workload queuing, and multi-tenancy. This integration prevents partial scheduling scenarios and potential deadlocks, thereby increasing overall cluster efficiency.
Example Use Case: Distributed ML Training on Multiple TPU Slices with Jax
Leveraging TPU Multislice Architecture
This example demonstrates a JobSet configuration targeting TPU multislice workloads across 4 TPU v5e slices. Such clusters, defined by their accelerator architecture, benefit from precise pod placement and resource partitioning, essential for workloads utilizing Jax. Jax offers native Just-In-Time compilation via OpenXLA, while PyTorch/XLA remains a viable alternative.
The sample YAML spec below highlights how JobSet handles scheduling requirements, ensuring that each child job receives exclusive usage of assigned TPU slices with minimal manual configuration.
# Run a simple Jax workload on
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: multislice
annotations:
# Assign exclusive TPU slice usage
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
failurePolicy:
maxRestarts: 3
replicatedJobs:
- name: workers
replicas: 4 # Set to number of TPU slices
template:
spec:
parallelism: 2 # Number of VMs per TPU slice
completions: 2 # Number of VMs per TPU slice
backoffLimit: 0
template:
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: 2x4
containers:
- name: jax-tpu
image: python:3.8
ports:
- containerPort: 8471
- containerPort: 8080
securityContext:
privileged: true
command:
- bash
- -c
- |
pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
python -c 'import jax; print("Global device count:", jax.device_count())'
sleep 60
resources:
limits:
google.com/tpu: 4
Deeper Technical Analysis
JobSet’s design is heavily influenced by the need for precision in contemporary distributed computing. By decomposing a single large job into a set of smaller, interrelated jobs, JobSet provides a granular level of control over resource allocation and network topology:
- Resource Isolation and Control: The ability to specify node selectors and exclusive topology policies allows administrators to isolate workloads within specific network domains. This is particularly crucial when workloads are highly sensitive to network latency and require optimized, high-bandwidth communication channels.
- Policy-Driven Scheduling: With configurable success and failure policies, JobSet allows the orchestration engine to automatically manage retries and job restarts. These features reduce the operational overhead required to monitor distributed jobs and increase the resilience of ML training pipelines.
- Declarative Job Replication: The introduction of replicable job templates via the ReplicatedJob abstraction drastically reduces complexity in managing large-scale deployments. It empowers developers to set up uniform job replicas without manually duplicating code or configurations, while still accommodating different container images, resource requests, and security settings.
Expert Opinions and Industry Impact
Leaders in cloud computing and distributed systems praise JobSet for its potential to standardize distributed job orchestration. Experts note that by abstracting infrastructure complexities, JobSet can accelerate the development cycle for ML models and streamline HPC operations.
According to several industry analysts, unified APIs like JobSet could be instrumental in the next generation of AI research, where distributed workflows need scalable, resilient, and manageable orchestration layers. The tight integration with Kubernetes and Kueue also aligns with the broader industry shift towards container-native architectures that emphasize portability and cross-cloud compatibility.
Future Work, Trends, and Community Involvement
The roadmap for JobSet includes several promising enhancements. Planned features include improved inter-pod communication protocols, dynamic resource scaling, enhanced observability, and additional configurability for heterogeneous workloads. Ongoing community contributions are key to its evolution, and interested developers are encouraged to review the JobSet roadmap for a detailed list of upcoming features.
In addition, further improvements are anticipated in integrating JobSet with emerging cloud-native paradigms, such as serverless compute and edge computing. These integrations are expected to broaden its applicability beyond conventional data centers, making it a cornerstone in the orchestration of distributed systems across hybrid environments.
Contributors and community members can engage with the project via its repository, the mailing list, or on the Slack channel. Feedback, bug reports, and feature suggestions are highly welcomed, as they ensure that JobSet remains at the cutting edge of distributed workload orchestration.
Conclusion
JobSet represents a significant leap forward in the orchestration of large-scale distributed workloads. By providing a unified API that leverages Kubernetes’ native capabilities, it bridges the gap between ML training, HPC, and emerging distributed system needs. With its advanced features, flexible configurations, and a strong community backing, JobSet is poised to become the de facto solution for orchestrating multi-node distributed jobs in today’s complex compute environments.