Skip to main content

Command Palette

Search for a command to run...

Progressive Delivery in Kubernetes: Argo Rollouts & Istio

How we moved from Big Bang deployments to controlled Canary releases using traffic splitting and automated analysis.

Published
5 min read
Progressive Delivery in Kubernetes: Argo Rollouts & Istio
A

An Aspiring DevOps Engineer passionate about automation, CI/CD, and cloud technologies. On a journey to simplify and optimize development workflows.

Welcome back to the Building a Production-Grade SRE Platform on Kubernetes series.

In the previous posts, we built a secure and observable platform:

  • Part 1: Infrastructure (GKE)

  • Part 2: GitOps Engine (ArgoCD)

  • Part 3: Observability (LGTM)

  • Part 4: The CI/CD Factory

  • Part 5: Zero Trust Security (Kyverno & Istio)

We have secured the "who" (Identity) and the "where" (Infrastructure). Now, we need to fix the "how" of releasing software.

Until now, we have used standard Kubernetes Deployments. This relies on a "Rolling Update" strategy. While it ensures zero downtime, it has a major flaw: Lack of Control. Once you run git push, the new version rolls out to everyone. If version 2.0 has a critical bug, 100% of your users will see it before you can roll back.

In this Part 6, we implement Progressive Delivery (Canary Releases). Instead of replacing the old version instantly, we route just 20% of traffic to the new version, verify it works,


The Tech Stack

  • Orchestrator: Argo Rollouts (Kubernetes Controller)

  • Traffic Manager: Istio (VirtualService & DestinationRule)

  • Visualization: Argo Rollouts Dashboard

  • Target App: manifest-gen


Step 1: The Foundation (Installing the Controller)

First, we needed to install the Argo Rollouts controller. As always, we used the App-of-Apps pattern.

File: kubernetes/bootstrap/rollouts.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: argo-rollouts
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/anantvaid/otel-platform-infra.git
    targetRevision: main
    path: kubernetes/platform/rollouts
  destination:
    server: https://kubernetes.default.svc
    namespace: argo-rollouts
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

File: kubernetes/platform/rollouts/Chart.yaml

apiVersion: v2
name: argo-rollouts
version: 1.0.0
dependencies:
  - name: argo-rollouts
    version: 2.40.5
    repository: https://argoproj.github.io/argo-helm

File: kubernetes/platform/rollouts/values.yaml

argo-rollouts:
  dashboard:
    enabled: true
    service:
      type: ClusterIP
  controller:
    metrics:
      enabled: true
      serviceMonitor:
        enabled: true

We enabled the Dashboard service in values.yaml so we could visualize the traffic split in real-time.


Step 2: The Network Plumbing (Istio Integration)

This is where the magic happens. Standard Kubernetes Canary deployments are imprecise as they rely on replica counts (e.g., 1 pod out of 10 = 10% traffic).

By integrating Istio, we can split traffic by precise percentages, regardless of how many pods are running.

File: kubernetes/apps/manifest-gen/networking.yaml

# 1. The Stable Service (Live Traffic)
apiVersion: v1
kind: Service
metadata:
  name: manifest-gen-stable
spec:
  selector:
    app: manifest-gen
---
# 2. The Canary Service (Preview Traffic)
apiVersion: v1
kind: Service
metadata:
  name: manifest-gen-canary
spec:
  selector:
    app: manifest-gen
---
# 3. VirtualService (The Router)
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: manifest-gen-vs
spec:
  hosts:
  - manifest-gen-stable
  http:
  - route:
    - destination:
        host: manifest-gen-stable
        subset: stable
      weight: 100  # Starts at 100% Stable
    - destination:
        host: manifest-gen-stable
        subset: canary
      weight: 0    # Starts at 0% Canary

This setup gives Argo Rollouts a knob to turn. When we start a rollout, Argo will dynamically update the weight in this VirtualService.


Step 3: The Rollout Strategy

We replaced our standard Deployment with a Rollout CRD. This tells Argo exactly how to introduce the new version.

File: kubernetes/apps/manifest-gen/rollout.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: manifest-gen
spec:
  replicas: 1
  strategy:
    canary:
      trafficRouting:
        istio:
          virtualService: 
            name: manifest-gen-vs
            routes:
            - primary # The route name in VirtualService to modify the weights
      steps:
      - setWeight: 20  # 1. Send 20% traffic to Canary
      - pause: {}      # 2. WAIT indefinitely for human approval
      - setWeight: 50  # 3. Increase to 50%
      - pause: {duration: 30s} # 4. Wait 30s automatically
      - setWeight: 100 # 5. Full rollout
  template:
    metadata:
      labels:
        app: manifest-gen
        istio.io/dataplane-mode: ambient
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
      - name: manifest-gen
        image: anantvaid4/manifest-generator-api:v1
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "100m"
            memory: "64Mi"

kubectl argo rollouts get rollout manifest-gen -n manifest-gen


Step 4: The Execution (Traffic Shift)

1. Triggering the Rollout

To start the canary process, we simply updated the container image in our Git repository. This single line change triggers the entire workflow.

File: kubernetes/apps/manifest-gen/rollout.yaml

spec:
  template:
    spec:
      containers:
      - name: manifest-gen
        # Changing the image triggers the Rollout 👇
        image: ealen/echo-server:latest

Note: In a real-world scenario, you would typically just bump the version tag (e.g., v1.0.0v1.1.0). For this demo, I switched to ealen/echo-server intentionally. This returns a completely different JSON response, making it very easy to spot which requests hit the Canary vs. the Stable version during our tests.

The rollout enters a Suspended state, pausing execution to allow for manual verification and promotion.

2. Visualizing the Split

We opened the Argo Rollouts Dashboard.

kubectl argo rollouts get rollout manifest-gen -n manifest-gen
kubectl argo rollouts dashboard -n argo-rollouts &

At this point, the VirtualService was automatically updated by the controller:

  • Stable: 80%

  • Canary: 20%

3. Verifying with Load Generator

We span up a temporary load generator pod (carefully adding security overrides!) to hit the service.

kubectl run load-gen -n manifest-gen \
  --image=curlimages/curl --restart=Never --rm -it \
  --overrides='{"spec": {"securityContext": {"runAsNonRoot": true, "runAsUser": 1000}}}' \
  -- /bin/sh -c "while true; do curl -s 'http://manifest-gen-stable.manifest-gen.svc.cluster.local:80/generate?kind=service&name=backend'; echo ''; echo '--------------------------------'; sleep 0.5; done"

Result:

Istio enforces the traffic split rules, routing 20% of incoming requests to the canary subset (1 in 5) while keeping the remaining 80% on the stable version.

4. The Promotion

Satisfied with the results, we promoted the rollout via the UI (or CLI kubectl argo rollouts promote). The traffic shifted to 50%, paused for 30 seconds, and then went to 100%.

The promotion succeeded: the controller shifted 100% of traffic to the new revision and scaled down the previous ReplicaSet to zero.

Result:


The Takeaway

Progressive Delivery is the difference between "deploying" and "releasing."

  • Deploying is installation (putting bits on disk).

  • Releasing is giving traffic to users.

By separating these two concepts using Argo Rollouts and Istio, we have built a safety net. If v2 was broken, only 20% of requests would have failed, and we could have aborted instantly without a full cluster rollback.

Status: Progressive Delivery Complete. Next Up: Phase 7: FinOps & Cost Visibility (Kubecost)


Code & Resources

Building a Production-Grade SRE Platform on Kubernetes

Part 6 of 8

This series explores how to design and operate a production-grade SRE platform on Kubernetes, covering infrastructure, GitOps, observability, security, SLOs, service mesh, and chaos engineering.

Up next

FinOps in Kubernetes - Taming the Cloud Bill with Kubecost

How cost visibility and rightsizing was implemented in our SRE platform without bloating the cluster.