Skip to main content

Command Palette

Search for a command to run...

FinOps in Kubernetes - Taming the Cloud Bill with Kubecost

How cost visibility and rightsizing was implemented in our SRE platform without bloating the cluster.

Updated
5 min read
FinOps in Kubernetes - Taming the Cloud Bill with Kubecost
A

An Aspiring DevOps Engineer passionate about automation, CI/CD, and cloud technologies. On a journey to simplify and optimize development workflows.

Welcome back to the Building a Production-Grade SRE Platform on Kubernetes series.

Let’s recap what we’ve built so far:

  • Infrastructure & GitOps: Automated via Terraform and ArgoCD.

  • Observability: Deep visibility with the LGTM stack.

  • Security & Delivery: Zero Trust with Kyverno/Istio, and Canary rollouts via Argo Rollouts.

We have built an absolute beast of a platform. But beasts need feeding, and in the cloud, they feed on your credit card.

The biggest challenge in a shared Kubernetes cluster is Cost Attribution. When your AWS or GCP bill arrives, it just says "Compute Engine: $5,000." It doesn't tell you which team, which microservice, or which rogue pod is chewing up that money.

In Phase 7, we implement FinOps using Kubecost. We are going to shine a spotlight on our cluster spend, down to the exact namespace and deployment, empowering developers to engineer cost-efficiently.


The Tech Stack

  • FinOps Engine: Kubecost (via Helm)

  • Metrics Backend: Existing Prometheus (from our Phase 3 LGTM stack)

  • Ingress: Kubernetes Gateway API (HTTPRoute)

  • GitOps: ArgoCD


Step 1: The Architectural Decision

Before we write any YAML, let's look at how Kubecost actually works under the hood. It needs three things: cluster state, usage metrics, and cloud billing rates.

![](https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/66ad16df5a2d0578e94f4b33/6b0eca5b-814e-48ab-91f5-6346255283cb.png align="middle")

By default, when you install Kubecost, it attempts to install its own bundled version of Prometheus and Grafana. Deploying multiple Prometheus instances in a single cluster is a classic anti-pattern, it wastes CPU, RAM, and storage doing redundant scraping.

Since we already built a robust Observability stack in Phase 3, we configured Kubecost to disable its internal metrics components and point directly to our existing Prometheus instance instead.

File: kubernetes/platform/finops/kubecost/Chart.yaml

apiVersion: v2
name: kubecost
version: 1.0.0
dependencies:
  - name: cost-analyzer
    version: 1.108.1
    repository: https://kubecost.github.io/cost-analyzer/

File: kubernetes/platform/finops/kubecost/values.yaml

cost-analyzer:
  kubecostToken: "public-demo-token-sre-portfolio"
  global:
    prometheus:
      enabled: false # Disabled bundled Prometheus
      
      # Pointing to our existing LGTM Prometheus
      fqdn: http://observability-stack-kube-p-prometheus.monitoring.svc.cluster.local:9090

    grafana:
      enabled: false # Disabled bundled Grafana
      proxy: false

  ingress:
    enabled: false # Disabled default ingress

This is what Platform Engineering is about: integrating tools thoughtfully so the platform remains lean. We also track the Helm dependency cleanly in a Chart.yaml file, pinning the cost-analyzer to version 1.108.1.


Step 2: The GitOps Deployment

We wired this up using our standard ArgoCD - App-of-Apps pattern.

File: kubernetes/bootstrap/finops.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: finops-stack
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/anantvaid/otel-platform-infra.git
    targetRevision: main
    path: kubernetes/platform/finops/kubecost
  destination:
    server: https://kubernetes.default.svc
    namespace: kubecost
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Within minutes, ArgoCD picked up the manifest and synced the finops-stack to a Healthy and Synced state.

![](https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/66ad16df5a2d0578e94f4b33/ff2f0f8b-6d8c-4aee-b7ef-95c5a66d3916.png align="middle")


Step 3: Exposing the Dashboard (Gateway API)

Instead of using legacy Ingress controllers, we continue to embrace the modern Kubernetes Gateway API. We created an HTTPRoute to securely expose the Kubecost dashboard.

File: kubernetes/platform/finops/kubecost/httproute.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: kubecost-route
  namespace: kubecost
spec:
  parentRefs:
  - name: external-gateway
    namespace: default 
  hostnames:
  - "cost.techtalkswithanant.online"
  rules:
  - backendRefs:
    - name: finops-stack-cost-analyzer
      port: 9090

Traffic hitting cost.techtalkswithanant.online is automatically routed directly to the finops-stack-cost-analyzer service.


Step 4: The Value Realization (What Kubecost Actually Does)

Once the UI was live, we instantly unlocked critical FinOps capabilities. But first, how does it actually calculate these costs?

The Calculation Model

Kubecost doesn't just guess. It takes the billing rate from your cloud provider, looks at your pod's footprint, and bills you for the maximum of what you requested or what you actually used.

![](https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/66ad16df5a2d0578e94f4b33/077fc60e-fdb5-421e-b984-72cfbc1cfc0a.png align="middle")

1. Granular Cost Allocation (The "Who")

With the math running in the background, we can look at our dashboard and immediately see the cumulative cost broken down by namespace over the last 7 days.

![](https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/66ad16df5a2d0578e94f4b33/595f700f-96c4-4007-8bd1-6597489d0cae.png align="middle")

We can see exact fractional dollar amounts attributed to specific namespaces, and it even calculates "Efficiency" percentages based on requested vs. utilized resources.

2. Right-Sizing Recommendations (The "Waste")

Developers notoriously over-provision resources out of fear of getting OOMKilled. Kubecost analyzes historical usage patterns and provides actionable recommendations to adjust CPU/Memory requests and limits. It identifies the "Idle" capacity that we are paying for but not using.

3. Budgets and Alerting (The "Guardrails")

Instead of finding out we overspent 30 days later, Kubecost allows us to set daily budgets per namespace. If an app suddenly starts consuming 5x its normal CPU due to a memory leak, an alert is fired directly to our engineering teams.


The Takeaway

If you can't measure it, you can't improve it.

FinOps isn't just a finance department concern; it's an engineering discipline. By integrating Kubecost deeply into our SRE platform, we've transformed "cost" from an abstract monthly bill into an observable, real-time engineering metric alongside latency, traffic, errors, and saturation.

Status: Phase 7 Complete. Next Up: Phase 8 – Chaos Engineering (The Finale). We have spent 7 phases building a secure, observable, and cost-efficient platform. Now, we are going to intentionally break it to prove it actually survives production outages.


Code & Resources

Building a Production-Grade SRE Platform on Kubernetes

Part 7 of 8

This series explores how to design and operate a production-grade SRE platform on Kubernetes, covering infrastructure, GitOps, observability, security, SLOs, service mesh, and chaos engineering.

Up next

Chaos Engineering - Proving Resilience in Kubernetes Platform

Building a Kubernetes platform is only half the battle; the other half is proving it survives when unexpected failures occur.