Revolutionizing Distributed Workloads with JobSet on Kubernetes

Authors: Daniel Vega-Myhre (Google), Abdullah Gharaibeh (Google), Kevin Hannon (Red Hat)
Introducing JobSet
JobSet is an innovative open source API that streamlines the representation and orchestration of distributed jobs on Kubernetes. Designed to serve both distributed machine learning (ML) training and high-performance computing (HPC) workloads, JobSet provides a unified API model that enables advanced configurations and optimizations for large-scale, containerized workloads. In today’s era of accelerating artificial intelligence research and growing HPC demands, this API offers a refined approach to managing distributed resources and scheduling tasks across extensive clusters.
Why JobSet?
The Kubernetes community has recently enhanced its native batch processing ecosystem, attracting ML engineers with its robust scheduling and resource management capabilities. The emergence of colossal ML models—especially large language models (LLMs)—has underscored the need to distribute workloads across tens of thousands of accelerator chips. With models often spanning thousands of hosts, the conventional single-host scheduling paradigms no longer suffice.
Containerized model training code deployed on Kubernetes leverages distributed communication primitives such as all-gather and all-reduce. By sharding model parameters and datasets across nodes, these workloads can synchronize gradients and share metadata effectively. Kubernetes excels at scheduling and managing such containerized applications, enabling seamless lifecycle management and supporting user-defined APIs, objects, and controllers to create custom distributed training solutions.
Despite the advances with Kubernetes native Job APIs, existing primitives and fragmented training orchestration solutions (for example, Kubeflow’s multiple framework-specific custom resources) have struggled to keep pace with evolving distributed training paradigms. JobSet addresses these gaps by building on the stable Job API to provide additional abstractions necessary for modern ML and HPC use cases.
How JobSet Works
JobSet models a distributed workload as a group of Kubernetes Jobs, allowing users to specify multiple pod templates for different roles (such as drive-worker, parameter servers, and more). The mechanism hinges on the notion of a ReplicatedJob. Essentially, a ReplicatedJob acts as a Job Template punctuated with a desired number of replicas, enabling the declarative creation of identical child jobs across different accelerator islands. This eliminates the need for cumbersome scripting or complex Helm chart configurations when generating multiple variants of the same job.
Additional JobSet features include:
- Replicated Jobs: In large-scale data centers, specialized hardware accelerators like GPUs and TPUs are often clustered in specific topology domains (or islands) interconnected with high-bandwidth networks (e.g., NVLink or ICI). JobSet allows workloads to be partitioned into multiple identical child jobs, each targeting a specific accelerator island, so that inter-node communications remain local and efficient.
- Automatic Headless Service Management: Facilitating direct pod-to-pod communication, JobSet automatically creates and manages headless services. This enhances connectivity by ensuring that pods can communicate using stable hostnames throughout the lifespan of the workload.
- Configurable Success and Failure Policies: Administrators can define precise conditions under which a JobSet is marked as complete—e.g., only after all pods in a specific ReplicatedJob finish successfully—or configure maximum restart limits. These capabilities provide flexibility for safeguarding long-running distributed training jobs that resume from the most recent checkpoint if an error scenario is encountered.
- Exclusive Placement per Topology Domain: To further optimize performance, JobSet features allow specifying a 1:1 mapping between a child job and a topology domain, ensuring that pods are co-located on the same physical or network domain. This is particularly beneficial for distributed data-parallel training, minimizing cross-domain communication overhead while exploiting high-bandwidth intra-island connectivity.
- Integration with Kueue: JobSet seamlessly integrates with Kueue, a job queuing system that enables cluster oversubscription, workload queuing, deadlock prevention, and multi-tenancy. This guarantees efficient resource planning and workload execution in shared environments.
Technical Deep Dive: Distributed Orchestration and Communication
From a technical perspective, one of JobSet’s most significant contributions is its abstraction of a distributed training workload into multiple Kubernetes Jobs. Each ReplicatedJob functions as a fully defined job template, allowing for fine-grained customization such as:
- Multi-template Pods: Different pods within the same workload may require varying containers, resources, or failure policies. The driver-worker model is a prime example where the driver may need a different lifecycle compared to workers.
- Job Groups and Topology Awareness: In large-scale deployments, network topology becomes a critical factor. Job groups allow administrators to control pod locality, ensuring that communication-intensive tasks occur over low-latency, high-bandwidth paths while minimizing cross-rack or cross-data center traffic.
- Inter-Pod Communication Management: By automatically deploying headless services, JobSet ensures that pods can reliably establish connections. This is essential for collective operations that underpin communication frameworks like MPI, Ray, or Spark, where synchronization across pods is a must.
Performance Optimization and Topology Considerations
In modern data center environments, balancing network latency and throughput is key to efficient distributed computing. JobSet’s design allows for:
- Locality Optimizations: By mapping child jobs to distinct accelerator islands (such as specific racks or TPU slices), the framework ensures that the majority of high-frequency, low-latency operations occur locally, with minimal dependency on the slower data center network.
- Custom Resource Constraints: Controlling node selection using
nodeSelector
parameters and affinity rules, JobSet caters to hardware-specific configurations—whether that entails grouping nodes with NVLink-connected GPUs or TPU pods interconnected via ICI mesh networks. - Configurable Parallelism and Completion: Fine-tuning parameters like
parallelism
,completions
, andbackoffLimit
allows engineers to precisely define workload execution strategies. These parameters are crucial in scenarios where minimizing downtime and expediting recovery from failures are paramount.
Example Use Case: Distributed ML Training on TPU Slices with Jax
Distributed ML Training on Multiple TPU Slices with Jax
The following YAML example demonstrates a JobSet specification designed to run a TPU multislice workload on 4 TPU v5e slices. This configuration leverages Jax via OpenXLA for efficient, JIT-compiled distributed ML training. Although the example specifically targets Jax, similar configurations can be achieved using PyTorch/XLA or other frameworks that support TPU-based computations.
# Run a simple Jax workload on
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: multislice
annotations:
# Assign each child Job exclusive usage of a TPU slice
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
failurePolicy:
maxRestarts: 3
replicatedJobs:
- name: workers
replicas: 4 # One replica per TPU slice
template:
spec:
parallelism: 2 # Number of VMs per TPU slice
completions: 2 # Number of completions per key workload
backoffLimit: 0
template:
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: 2x4
containers:
- name: jax-tpu
image: python:3.8
ports:
- containerPort: 8471
- containerPort: 8080
securityContext:
privileged: true
command:
- bash
- -c
- |
pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
python -c 'import jax; print("Global device count:", jax.device_count())'
sleep 60
resources:
limits:
google.com/tpu: 4
Future Work, Industry Impact, and Getting Involved
JobSet has an active roadmap with numerous planned features aimed at further enhancing distributed ML and HPC workflows. Future enhancements include advanced management of heterogeneous workloads, enhanced monitoring and logging features tailored for distributed training, and deeper integration with emerging container orchestration tools.
The ongoing collaboration between engineers from Google, Red Hat, and contributions from the open source community underscores the strong industry support for JobSet. Its evolving feature set is expected to drive innovation in areas such as fault tolerance, dynamic resource scaling, and even automated recovery systems for production-critical environments.
We invite developers, researchers, and industry experts to contribute ideas, report issues, and help expand documentation. Join the conversation on the JobSet repository, participate on the mailing list, or connect on Slack.
Technical Expert Opinions and Ecosystem Perspectives
Recent discussions among Kubernetes and cloud computing experts highlight JobSet as a pivotal tool for bridging the gap between theoretical orchestration models and practical deployment realities. Architects emphasize the importance of customizable resource allocation and topology-aware scheduling to accommodate increasingly network-sensitive and compute-intensive workloads. With growing demands from both AI research and HPC industries, JobSet’s modular and extensible architecture is well-positioned to influence future orchestration strategies and best practices.
Moreover, several thought leaders in the field have noted that integrating JobSet with monitoring tools and distributed tracing systems could yield deeper insights into cluster performance, potentially guiding further optimizations in both software and hardware design.
Conclusion
JobSet represents a significant step forward in the orchestration of distributed training and HPC workloads on Kubernetes. By addressing existing scheduling inefficiencies and introducing advanced configuration options, it offers a robust framework for both early-stage research and large-scale production deployments. We encourage the community to participate in its ongoing evolution and help shape the future of distributed computing.