Best practices for production deployments of COS Lite

Contents

Juju compatibility

COS Lite requires Juju 3.1 to function properly. To be able to set up cross-model, cross-controller relations with existing Juju controllers and models, we therefore recommend upgrading your existing controllers either to the latest Juju 3 version (at the time of writing 3.1.5), or to the latest version in the 2.9 track (at the time of writing: 2.9.44).

Topology

Deploy in isolation

COS Lite should at the very least be deployed in its own model, but preferably even on its own substrate with its own controller. This limits the blast radius of anything malfunctioning in the workloads you observe or the observability stack itself. We strongly recommend using a separate three-node Microk8s cluster.

COS Alerter

Apart from COS Lite itself, the COS Alerter should be deployed on separate infrastructure, preferably on completely different hardware. The purpose of the alerter is to let operators know whenever the routing of notifications from COS Lite stops working, preventing a false sense of security.

Avoid pulling data cross-model

Cross-model relations using the prometheus_scrape interface should be avoided. Instead, deploy a Grafana agent in each of the models you want to observe and let the agents be a fan-in point pushing the data to COS. This makes for a less error-prone networking topology that is easier to reason about, especially at scale.

Networking

Ingress

MetalLB, or an equivalent load balancer, should be configured on the Kubernetes environment COS is running on. As part of the COS Lite bundle, Traefik is deployed and configured to provide network ingressing for the bundle components. Make sure the load balancer provides Traefik with a static IP, or some other identity that remains stable over time.

Egress

Some charms require external connectivity for the COS Lite bundle to function correctly.

As a common requirement, the environment should be able to reach:

  • Charmhub;
  • the Juju registry;
  • Snapcraft.

There are other charm-specific URLs that some charms access by default:

To disable the functionalities that require those URLs, please refer to linked docs for the relevant charms.

Controller routing

If the network topology is anything other than flat, the Juju controllers will need to be bootstrapped with --controller-external-ips, --controller-external-name, or both, so that the controllers are able to communicate over routable identities for your cross–controller relations. For example:

juju bootstrap microk8s uk8s \
  --config controller-service-type=loadbalancer \
  --config controller-external-ips=[10.0.0.2]

Storage

Set up distributed storage

Note: Do not use the hostpath-storage microk8s addon in production:

  • PersistentVolumeClaims created by the hostpath storage provisioner are bound to the local node, so it is impossible to move them to a different node.
  • A hostpath volume can grow beyond the capacity set in the volume claim manifest.

Instead, you could use the rook-ceph addon together with microceph. See the microceph tutorial.

Storage volume

You should come up with an appropriate storage overlay for your use case. For example, a deployment that handles roughly:

  • 1M samples/min with 150 targets
  • 100k loglines/min for about 150 targets

has a growth rate of about 50GB per day under normal operations. So, if you want a retention interval of about two months, you’ll need 3TB of storage only for the telemetry.


Last updated 22 days ago.