FinOps in Kubernetes - Taming the Cloud Bill with Kubecost
How cost visibility and rightsizing was implemented in our SRE platform without bloating the cluster.

An Aspiring DevOps Engineer passionate about automation, CI/CD, and cloud technologies. On a journey to simplify and optimize development workflows.
Welcome back to the Building a Production-Grade SRE Platform on Kubernetes series.
Let’s recap what we’ve built so far:
Infrastructure & GitOps: Automated via Terraform and ArgoCD.
Observability: Deep visibility with the LGTM stack.
Security & Delivery: Zero Trust with Kyverno/Istio, and Canary rollouts via Argo Rollouts.
We have built an absolute beast of a platform. But beasts need feeding, and in the cloud, they feed on your credit card.
The biggest challenge in a shared Kubernetes cluster is Cost Attribution. When your AWS or GCP bill arrives, it just says "Compute Engine: $5,000." It doesn't tell you which team, which microservice, or which rogue pod is chewing up that money.
In Phase 7, we implement FinOps using Kubecost. We are going to shine a spotlight on our cluster spend, down to the exact namespace and deployment, empowering developers to engineer cost-efficiently.
The Tech Stack
FinOps Engine: Kubecost (via Helm)
Metrics Backend: Existing Prometheus (from our Phase 3 LGTM stack)
Ingress: Kubernetes Gateway API (
HTTPRoute)GitOps: ArgoCD
Step 1: The Architectural Decision
Before we write any YAML, let's look at how Kubecost actually works under the hood. It needs three things: cluster state, usage metrics, and cloud billing rates.

By default, when you install Kubecost, it attempts to install its own bundled version of Prometheus and Grafana. Deploying multiple Prometheus instances in a single cluster is a classic anti-pattern, it wastes CPU, RAM, and storage doing redundant scraping.
Since we already built a robust Observability stack in Phase 3, we configured Kubecost to disable its internal metrics components and point directly to our existing Prometheus instance instead.
File: kubernetes/platform/finops/kubecost/Chart.yaml
apiVersion: v2
name: kubecost
version: 1.0.0
dependencies:
- name: cost-analyzer
version: 1.108.1
repository: https://kubecost.github.io/cost-analyzer/
File: kubernetes/platform/finops/kubecost/values.yaml
cost-analyzer:
kubecostToken: "public-demo-token-sre-portfolio"
global:
prometheus:
enabled: false # Disabled bundled Prometheus
# Pointing to our existing LGTM Prometheus
fqdn: http://observability-stack-kube-p-prometheus.monitoring.svc.cluster.local:9090
grafana:
enabled: false # Disabled bundled Grafana
proxy: false
ingress:
enabled: false # Disabled default ingress
This is what Platform Engineering is about: integrating tools thoughtfully so the platform remains lean. We also track the Helm dependency cleanly in a Chart.yaml file, pinning the cost-analyzer to version 1.108.1.
Step 2: The GitOps Deployment
We wired this up using our standard ArgoCD - App-of-Apps pattern.
File: kubernetes/bootstrap/finops.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: finops-stack
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/anantvaid/otel-platform-infra.git
targetRevision: main
path: kubernetes/platform/finops/kubecost
destination:
server: https://kubernetes.default.svc
namespace: kubecost
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Within minutes, ArgoCD picked up the manifest and synced the finops-stack to a Healthy and Synced state.

Step 3: Exposing the Dashboard (Gateway API)
Instead of using legacy Ingress controllers, we continue to embrace the modern Kubernetes Gateway API. We created an HTTPRoute to securely expose the Kubecost dashboard.
File: kubernetes/platform/finops/kubecost/httproute.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: kubecost-route
namespace: kubecost
spec:
parentRefs:
- name: external-gateway
namespace: default
hostnames:
- "cost.techtalkswithanant.online"
rules:
- backendRefs:
- name: finops-stack-cost-analyzer
port: 9090
Traffic hitting cost.techtalkswithanant.online is automatically routed directly to the finops-stack-cost-analyzer service.
Step 4: The Value Realization (What Kubecost Actually Does)
Once the UI was live, we instantly unlocked critical FinOps capabilities. But first, how does it actually calculate these costs?
The Calculation Model
Kubecost doesn't just guess. It takes the billing rate from your cloud provider, looks at your pod's footprint, and bills you for the maximum of what you requested or what you actually used.

1. Granular Cost Allocation (The "Who")
With the math running in the background, we can look at our dashboard and immediately see the cumulative cost broken down by namespace over the last 7 days.

We can see exact fractional dollar amounts attributed to specific namespaces, and it even calculates "Efficiency" percentages based on requested vs. utilized resources.
2. Right-Sizing Recommendations (The "Waste")
Developers notoriously over-provision resources out of fear of getting OOMKilled. Kubecost analyzes historical usage patterns and provides actionable recommendations to adjust CPU/Memory requests and limits. It identifies the "Idle" capacity that we are paying for but not using.
3. Budgets and Alerting (The "Guardrails")
Instead of finding out we overspent 30 days later, Kubecost allows us to set daily budgets per namespace. If an app suddenly starts consuming 5x its normal CPU due to a memory leak, an alert is fired directly to our engineering teams.
The Takeaway
If you can't measure it, you can't improve it.
FinOps isn't just a finance department concern; it's an engineering discipline. By integrating Kubecost deeply into our SRE platform, we've transformed "cost" from an abstract monthly bill into an observable, real-time engineering metric alongside latency, traffic, errors, and saturation.
Status: Phase 7 Complete. Next Up: Phase 8 – Chaos Engineering (The Finale). We have spent 7 phases building a secure, observable, and cost-efficient platform. Now, we are going to intentionally break it to prove it actually survives production outages.



