Kubeflow

Kubeflow Charmers | bundle
Cloud

Channel	Revision	Published
latest/candidate	294	24 Jan 2022
latest/beta	430	30 Aug 2024
latest/edge	423	26 Jul 2024
1.10/stable	436	07 Apr 2025
1.10/candidate	434	02 Apr 2025
1.10/beta	433	24 Mar 2025
1.9/stable	432	03 Dec 2024
1.9/beta	420	19 Jul 2024
1.9/edge	431	03 Dec 2024
1.8/stable	414	22 Nov 2023
1.8/beta	411	22 Nov 2023
1.8/edge	413	22 Nov 2023
1.7/stable	409	27 Oct 2023
1.7/beta	408	27 Oct 2023
1.7/edge	407	27 Oct 2023

Learn to deploy on juju >

Platform:

Relevant links

Homepage

Share your thoughts on this charm with the community on discourse.

Join the discussion

This guide describes how to configure different Charmed Kubeflow (CKF) workloads, such as Notebooks, Pipeline steps, and distributed jobs, to align with specific scheduling patterns that might be required.

Requirements

A CKF deployment and access to the Kubeflow dashboard. See Get started for more details.
An underlying Kubernetes (K8s) cluster with multiple nodes and labels.

Notebooks

You can configure Notebooks to be scheduled on specific nodes via the Notebooks page in the Kubeflow dashboard when creating a new Notebook.

To do so, configure the Affinity and Toleration settings during Notebook creation by:

Clicking on + Create Notebook.
Scrolling to the bottom and expanding Advanced Options.
Configuring the Affinity and Tolerations sections.

Configuring the Notebook creation page is intended only for admins. See this guide for more details.

In case your cluster setup uses Taints, see Leverage PodDefaults for more details.

Pipeline steps

K8s specific configurations, such as nodeSelectors and Tolerations, in a Kubeflow Pipeline step can be configured via the kfp-kubernetes Python package.

The following example sets both nodeSelector and Tolerations in a Pipeline step:


from kfp.kubernetes import add_node_selector, add_toleration

@dsl.component(base_image="python:3.12")
def print_node_name():
     """Print the Node's hostname."""
    import socket

    print("Node name: %s" % socket.gethostname())

@dsl.pipeline
def node_scheduling_pipeline():
    print_node_task = print_node_name()
    task = add_node_selector(print_node_task, "sku", "pool-1")
    task = add_toleration(task, key="sku", operator="Exists", effect="NoSchedule")

Distributed training

Distributed training in CKF is achieved via the Katib and Training Operator components.

Katib Trials can be implemented with different job types, which may use default settings defined in Trial Templates. These can include standard K8s Jobs or Distributed training jobs via the Training Operator.

All Trial definitions ultimately configure a PodSpec for the Trial’s Pods. To accommodate the above scheduling use cases, you need to configure the nodeSelectorand Tolerations of the PodSpec.

Below is an example of a TFJob that can be used in a Trial definition and satisfies all the above criteria:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  generateName: tfjob
  namespace: your-user-namespace
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeSelector:  # Scheduling
            pool: pool1
          tolerations:   # Scheduling
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: pool1
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeSelector:  # Scheduling
            pool: pool1
          tolerations:   # Scheduling
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: pool1
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              resources:
                limits:
                  nvidia.com/gpu: 1
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000

KServe InferenceServices

KServe InferenceServices expose PodSpec attributes that can be used for configuring advanced scheduling scenarios. See the example below for more details:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      nodeSelector:  # Scheduling
        sku: pool-1
      tolerations:   # Scheduling
      - key: "sku"
        operator: "Exists"
        effect: "NoSchedule"

Help improve this document in the forum (guidelines). Last updated 18 hours ago.