Skip to main content

Command Palette

Search for a command to run...

Designing a Cost-Constrained, Production-Grade GKE Cluster with Terraform

Updated
6 min read
Designing a Cost-Constrained, Production-Grade GKE Cluster with Terraform
A

An Aspiring DevOps Engineer passionate about automation, CI/CD, and cloud technologies. On a journey to simplify and optimize development workflows.

Kubernetes tutorials often focus on getting a cluster running.
Site Reliability Engineering focuses on what happens after that.

Welcome to part one of the Blog Series. In this post, I’ll walk through how I designed and provisioned a production-grade Google Kubernetes Engine (GKE) cluster using Terraform, under real-world constraints: limited quotas, cost optimization, least-privilege IAM, and operational readiness for GitOps and observability.

This is not a hello world cluster. It is the foundation of an SRE platform designed to support GitOps, service mesh, observability, SLOs, and chaos engineering in later phases of this series.


Why This Isn’t a Typical GKE Setup

Before writing any Terraform, I defined the constraints this cluster must operate under:

  • Cost-aware: Must run close to free tier using Spot VMs

  • Production-aligned: No default networks or node pools

  • Secure by design: No secrets or credentials committed to Git

  • GitOps-ready: Designed for ArgoCD-based deployments

  • Failure-tolerant: Node loss should be expected, not catastrophic

  • Explicit over implicit: No reliance on GKE defaults

These constraints drive every design decision in this article.


Terraform Project Structure

I prefer simple, flat Terraform layouts for early infrastructure phases to keep behavior explicit and debuggable.

.
├── providers.tf     # Provider and version pinning
├── vpc.tf           # Networking
├── gke.tf           # GKE cluster configuration
├── variables.tf     # Inputs (no secrets here)
├── outputs.tf       # Useful cluster outputs
└── .gitignore       # Prevent credential leaks

Each file has a single responsibility, making reviews and troubleshooting simpler.


Provider Configuration and Version Pinning

providers.tf :

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "7.13.0"
    }
  }
}

provider "google" {
  project     = var.project_id
  region      = var.region
  credentials = file("tf-key.json")
}

Why version pinning matters?

  • Provider upgrades can introduce breaking behavior

  • Pinning versions ensures reproducible plans

  • Credential usage is explicit and auditable


Networking: Custom VPC with VPC-Native GKE

Default VPCs optimize for convenience, not production parity.

This cluster uses:

  • A custom VPC

  • A custom subnet

  • Secondary IP ranges for Pods and Services

This enables VPC-native (IP alias) networking, which is effectively mandatory for modern GKE features.

vpc.tf :

resource "google_compute_network" "vpc_network" {
  name                    = "${var.cluster_name}-vpc"
  auto_create_subnetworks = false
}
# Subnet configuration omitted for brevity (see repo)

VPC-native networking supports:

  • Higher pod density

  • Predictable IP allocation

  • Service mesh compatibility

  • Workload Identity (Most important)


GKE Cluster Design (Single Zonal, Explicit, Minimal)

This cluster is zonal, not regional - a conscious trade-off.

gke.tf :

resource "google_container_cluster" "primary" {
  name     = var.cluster_name
  location = var.zone

  networking_mode = "VPC_NATIVE"
  network         = google_compute_network.vpc_network.id
  subnetwork      = google_compute_subnetwork.vpc_subnet.id

  # Security: Enable Workload Identity (Required in Phase two)
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  deletion_protection = false

  remove_default_node_pool = true
  initial_node_count       = 1

Why Zonal?

My first attempt used a regional cluster and failed due to disk quota exhaustion.

A regional cluster implicitly multiplies node disks across three zones (3x 100GB = 300GB), silently exceeding the 250GB free-tier quotas. Zonal clusters keep us within budget.


Automated Node Management

gke.tf :

  management {
    auto_repair  = true
    auto_upgrade = true
  }

These settings ensure:

  • Failed nodes are replaced automatically

  • Security patches are applied without manual intervention

For an SRE platform, hands-off node management is non-negotiable.


Node Pool Configuration (The “Spot“ Strategy)

Instead of inline configuration (which locks nodes to the control plane), I used a separate google_container_node_pool resource. This allows us to rotate or replace nodes without destroying the cluster.

gke.tf (Node Pool):

resource "google_container_node_pool" "spot_nodes" {
  cluster    = google_container_cluster.primary.name
  location   = var.zone
  node_count = 2

  node_config {
    machine_type = "e2-standard-4"

    # 50GB is the sweet spot: Small enough for quotas, big enough for Docker images
    disk_size_gb = 50
    disk_type    = "pd-balanced"

    # The Money Saver
    spot = true

    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
    tags = ["gke-node", "${var.cluster_name}-node"]
  }
}

Why Spot VMs? Spot VMs cost ~70–90% less than on-demand instances. For an SRE platform, failure is a design input. If this cluster can’t survive Spot eviction, it’s not production-ready.


Secrets and Variables: Nothing Sensitive in Git

Sensitive values (like project_id) are:

  • Defined as variables without defaults

  • Supplied via terraform.tfvars

  • Explicitly excluded via .gitignore

This prevents:

  • Accidental credential leaks

  • Public repository exposure

  • Tight coupling between code and environment

terraform.tfvars:

project_id = "<GCP_PROJECT_ID>"

Principle of Least Privilege: Terraform With a Dedicated Service Account

One of the most common anti-patterns in infrastructure provisioning is running Terraform using application-default credentials (ADC) tied to a personal user account. This is dangerous and non-auditable.

For this platform, I explicitly avoided ADC and instead used a dedicated Terraform service account with scoped permissions.

Step 1: Create a Dedicated Terraform Service Account

gcloud iam service-accounts create terraform-deployer \
  --display-name="Terraform Infrastructure Deployer"

This service account exists only to provision infrastructure.


Step 2: Grant Required IAM Roles (Nothing More)

I avoided "Owner" roles. Instead, I granted only what was needed: container.admin, compute.admin, storage.admin and iam.serviceAccountUser.

export PROJECT_ID="sre-portfolio-platform"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:terraform-deployer@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/container.admin"

# ... (Repeat for compute.admin, iam.serviceAccountUser, storage.admin)

Why these roles?

RolePurpose
container.adminCreate and manage GKE clusters
compute.adminManage VPCs, subnets, disks, and VM resources
iam.serviceAccountUserAllow Terraform to attach service accounts
storage.adminRequired for GKE-managed storage and state operations

Step 3: Secure the Key

I generated a key for Terraform but ensured it never touches Git.

gcloud iam service-accounts keys create tf-key.json \
  --iam-account=terraform-deployer@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com

echo "tf-key.json" >> .gitignore

This keeps Terraform authentication explicit, auditable, and revocable.


Step 4: Configure Terraform to Use the Service Account

providers.tf:

provider "google" {
  project     = var.project_id
  region      = var.region
  credentials = file("tf-key.json")
}

This ensures that Terraform never uses application-default credentials and All infrastructure actions are tied to a machine identity.


Cluster Access Validation

After provisioning, I validated access using:

gcloud container clusters get-credentials sre-portfolio-cluster \
  --zone us-central1-a \
  --project sre-portfolio-platform

This creates kube-config file with the associated credentials in $HOME/.kube/config.

This confirmed:

  • Cluster creation succeeded

  • IAM bindings were correct

  • The control plane was reachable

  • Nodes joined successfully


Deployment and Validation

Apply the infrastructure:

terraform apply

Configure local access:

gcloud container clusters get-credentials sre-portfolio-cluster \
  --zone us-central1-a

Verify Nodes:

kubectl get nodes -o wide

You must get something like this:

Spot nodes registered successfully and joined the cluster.


Key Takeaways

  • Cost constraints expose architectural assumptions

  • GKE defaults are not production-safe by default

  • Smaller disks and zonal clusters matter under quotas

  • Spot VMs are viable for non-critical platforms

  • Infrastructure design is about trade-offs, not checklists


What’s Next

In Part 2, I’ll bootstrap GitOps from scratch by installing:

  • ArgoCD (The GitOps Engine)

  • External Secrets (The Secret Manager)

  • Gateway API (The Modern Ingress)

Stay Tuned!

Building a Production-Grade SRE Platform on Kubernetes

Part 1 of 8

This series explores how to design and operate a production-grade SRE platform on Kubernetes, covering infrastructure, GitOps, observability, security, SLOs, service mesh, and chaos engineering.

Up next

Beyond Ingress: Building a "Keyless" Platform with GKE Gateway API

Abandoning legacy patterns for a modern stack: Workload Identity, ArgoCD, and Global Load Balancing.