Kubeflow
- Kubeflow Charmers | bundle
- Cloud
Channel | Revision | Published |
---|---|---|
latest/candidate | 294 | 24 Jan 2022 |
latest/beta | 430 | 30 Aug 2024 |
latest/edge | 423 | 26 Jul 2024 |
1.10/stable | 436 | 07 Apr 2025 |
1.10/candidate | 434 | 02 Apr 2025 |
1.10/beta | 433 | 24 Mar 2025 |
1.9/stable | 432 | 03 Dec 2024 |
1.9/beta | 420 | 19 Jul 2024 |
1.9/edge | 431 | 03 Dec 2024 |
1.8/stable | 414 | 22 Nov 2023 |
1.8/beta | 411 | 22 Nov 2023 |
1.8/edge | 413 | 22 Nov 2023 |
1.7/stable | 409 | 27 Oct 2023 |
1.7/beta | 408 | 27 Oct 2023 |
1.7/edge | 407 | 27 Oct 2023 |
juju deploy kubeflow --channel 1.10/stable
Deploy Kubernetes operators easily with Juju, the Universal Operator Lifecycle Manager. Need a Kubernetes cluster? Install MicroK8s to create a full CNCF-certified Kubernetes system in under 60 seconds.
Platform:
This guide describes how to configure different Charmed Kubeflow (CKF) workloads, such as Notebooks, Pipeline steps, and distributed jobs, to align with specific scheduling patterns that might be required.
Requirements
- A CKF deployment and access to the Kubeflow dashboard. See Get started for more details.
- An underlying Kubernetes (K8s) cluster with multiple nodes and labels.
Notebooks
You can configure Notebooks to be scheduled on specific nodes via the Notebooks page in the Kubeflow dashboard when creating a new Notebook.
To do so, configure the Affinity and Toleration settings during Notebook creation by:
- Clicking on
+ Create Notebook
. - Scrolling to the bottom and expanding
Advanced Options
. - Configuring the
Affinity
andTolerations
sections.
Configuring the Notebook creation page is intended only for admins. See this guide for more details.
In case your cluster setup uses Taints, see Leverage PodDefaults for more details.
Pipeline steps
K8s specific configurations, such as nodeSelectors
and Tolerations, in a Kubeflow Pipeline step can be configured via the kfp-kubernetes Python package.
The following example sets both nodeSelector
and Tolerations in a Pipeline step:
from kfp.kubernetes import add_node_selector, add_toleration
@dsl.component(base_image="python:3.12")
def print_node_name():
"""Print the Node's hostname."""
import socket
print("Node name: %s" % socket.gethostname())
@dsl.pipeline
def node_scheduling_pipeline():
print_node_task = print_node_name()
task = add_node_selector(print_node_task, "sku", "pool-1")
task = add_toleration(task, key="sku", operator="Exists", effect="NoSchedule")
Distributed training
Distributed training in CKF is achieved via the Katib and Training Operator components.
Katib Trials can be implemented with different job types, which may use default settings defined in Trial Templates. These can include standard K8s Jobs or Distributed training jobs via the Training Operator.
All Trial definitions ultimately configure a PodSpec
for the Trial’s Pods. To accommodate the above scheduling use cases, you need to configure the nodeSelector
and Tolerations of the PodSpec
.
Below is an example of a TFJob
that can be used in a Trial definition and satisfies all the above criteria:
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
generateName: tfjob
namespace: your-user-namespace
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
nodeSelector: # Scheduling
pool: pool1
tolerations: # Scheduling
- effect: NoSchedule
key: sku
operator: Equal
value: pool1
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
nodeSelector: # Scheduling
pool: pool1
tolerations: # Scheduling
- effect: NoSchedule
key: sku
operator: Equal
value: pool1
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
resources:
limits:
nvidia.com/gpu: 1
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
KServe InferenceServices
KServe InferenceServices
expose PodSpec
attributes that can be used for configuring advanced scheduling scenarios. See the example below for more details:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
nodeSelector: # Scheduling
sku: pool-1
tolerations: # Scheduling
- key: "sku"
operator: "Exists"
effect: "NoSchedule"