Kubeflow

  • Kubeflow Charmers | bundle
  • Cloud
Channel Revision Published
latest/candidate 294 24 Jan 2022
latest/beta 430 30 Aug 2024
latest/edge 423 26 Jul 2024
1.10/stable 436 07 Apr 2025
1.10/candidate 434 02 Apr 2025
1.10/beta 433 24 Mar 2025
1.9/stable 432 03 Dec 2024
1.9/beta 420 19 Jul 2024
1.9/edge 431 03 Dec 2024
1.8/stable 414 22 Nov 2023
1.8/beta 411 22 Nov 2023
1.8/edge 413 22 Nov 2023
1.7/stable 409 27 Oct 2023
1.7/beta 408 27 Oct 2023
1.7/edge 407 27 Oct 2023
juju deploy kubeflow --channel 1.10/stable
Show information

Platform:

This guide describes how to configure different Charmed Kubeflow (CKF) workloads, such as Notebooks, Pipeline steps, and distributed jobs, to align with specific scheduling patterns that might be required.

Requirements

  1. A CKF deployment and access to the Kubeflow dashboard. See Get started for more details.
  2. An underlying Kubernetes (K8s) cluster with multiple nodes and labels.

Notebooks

You can configure Notebooks to be scheduled on specific nodes via the Notebooks page in the Kubeflow dashboard when creating a new Notebook.

To do so, configure the Affinity and Toleration settings during Notebook creation by:

  1. Clicking on + Create Notebook.
  2. Scrolling to the bottom and expanding Advanced Options.
  3. Configuring the Affinity and Tolerations sections.

Configuring the Notebook creation page is intended only for admins. See this guide for more details.

In case your cluster setup uses Taints, see Leverage PodDefaults for more details.

Pipeline steps

K8s specific configurations, such as nodeSelectors and Tolerations, in a Kubeflow Pipeline step can be configured via the kfp-kubernetes Python package.

The following example sets both nodeSelector and Tolerations in a Pipeline step:


from kfp.kubernetes import add_node_selector, add_toleration

@dsl.component(base_image="python:3.12")
def print_node_name():
     """Print the Node's hostname."""
    import socket

    print("Node name: %s" % socket.gethostname())

@dsl.pipeline
def node_scheduling_pipeline():
    print_node_task = print_node_name()
    task = add_node_selector(print_node_task, "sku", "pool-1")
    task = add_toleration(task, key="sku", operator="Exists", effect="NoSchedule")

Distributed training

Distributed training in CKF is achieved via the Katib and Training Operator components.

Katib Trials can be implemented with different job types, which may use default settings defined in Trial Templates. These can include standard K8s Jobs or Distributed training jobs via the Training Operator.

All Trial definitions ultimately configure a PodSpec for the Trial’s Pods. To accommodate the above scheduling use cases, you need to configure the nodeSelectorand Tolerations of the PodSpec.

Below is an example of a TFJob that can be used in a Trial definition and satisfies all the above criteria:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  generateName: tfjob
  namespace: your-user-namespace
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeSelector:  # Scheduling
            pool: pool1
          tolerations:   # Scheduling
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: pool1
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeSelector:  # Scheduling
            pool: pool1
          tolerations:   # Scheduling
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: pool1
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              resources:
                limits:
                  nvidia.com/gpu: 1
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000

KServe InferenceServices

KServe InferenceServices expose PodSpec attributes that can be used for configuring advanced scheduling scenarios. See the example below for more details:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      nodeSelector:  # Scheduling
        sku: pool-1
      tolerations:   # Scheduling
      - key: "sku"
        operator: "Exists"
        effect: "NoSchedule"

Help improve this document in the forum (guidelines). Last updated 18 hours ago.