Designing a Production-Grade GKE Cluster with Terraform

Kubernetes tutorials often focus on getting a cluster running.
Site Reliability Engineering focuses on what happens after that.

Welcome to part one of the Blog Series. In this post, I’ll walk through how I designed and provisioned a production-grade Google Kubernetes Engine (GKE) cluster using Terraform, under real-world constraints: limited quotas, cost optimization, least-privilege IAM, and operational readiness for GitOps and observability.

This is not a hello world cluster. It is the foundation of an SRE platform designed to support GitOps, service mesh, observability, SLOs, and chaos engineering in later phases of this series.

Why This Isn’t a Typical GKE Setup

Before writing any Terraform, I defined the constraints this cluster must operate under:

Cost-aware: Must run close to free tier using Spot VMs
Production-aligned: No default networks or node pools
Secure by design: No secrets or credentials committed to Git
GitOps-ready: Designed for ArgoCD-based deployments
Failure-tolerant: Node loss should be expected, not catastrophic
Explicit over implicit: No reliance on GKE defaults

These constraints drive every design decision in this article.

Terraform Project Structure

I prefer simple, flat Terraform layouts for early infrastructure phases to keep behavior explicit and debuggable.

.
├── providers.tf     # Provider and version pinning
├── vpc.tf           # Networking
├── gke.tf           # GKE cluster configuration
├── variables.tf     # Inputs (no secrets here)
├── outputs.tf       # Useful cluster outputs
└── .gitignore       # Prevent credential leaks

Each file has a single responsibility, making reviews and troubleshooting simpler.

Provider Configuration and Version Pinning

providers.tf :

terraform {
  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "7.13.0"
    }
  }
}

provider "google" {
  project     = var.project_id
  region      = var.region
  credentials = file("tf-key.json")
}

Why version pinning matters?

Provider upgrades can introduce breaking behavior
Pinning versions ensures reproducible plans
Credential usage is explicit and auditable

Networking: Custom VPC with VPC-Native GKE

Default VPCs optimize for convenience, not production parity.

This cluster uses:

A custom VPC
A custom subnet
Secondary IP ranges for Pods and Services

This enables VPC-native (IP alias) networking, which is effectively mandatory for modern GKE features.

vpc.tf :

resource "google_compute_network" "vpc_network" {
  name                    = "${var.cluster_name}-vpc"
  auto_create_subnetworks = false
}
# Subnet configuration omitted for brevity (see repo)

VPC-native networking supports:

Higher pod density
Predictable IP allocation
Service mesh compatibility
Workload Identity (Most important)

GKE Cluster Design (Single Zonal, Explicit, Minimal)

This cluster is zonal, not regional - a conscious trade-off.

gke.tf :

resource "google_container_cluster" "primary" {
  name     = var.cluster_name
  location = var.zone

  networking_mode = "VPC_NATIVE"
  network         = google_compute_network.vpc_network.id
  subnetwork      = google_compute_subnetwork.vpc_subnet.id

  # Security: Enable Workload Identity (Required in Phase two)
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  deletion_protection = false

  remove_default_node_pool = true
  initial_node_count       = 1

Why Zonal?

My first attempt used a regional cluster and failed due to disk quota exhaustion.

A regional cluster implicitly multiplies node disks across three zones (3x 100GB = 300GB), silently exceeding the 250GB free-tier quotas. Zonal clusters keep us within budget.

Automated Node Management

gke.tf :

  management {
    auto_repair  = true
    auto_upgrade = true
  }

These settings ensure:

Failed nodes are replaced automatically
Security patches are applied without manual intervention

For an SRE platform, hands-off node management is non-negotiable.

Node Pool Configuration (The “Spot“ Strategy)

Instead of inline configuration (which locks nodes to the control plane), I used a separate google_container_node_pool resource. This allows us to rotate or replace nodes without destroying the cluster.

gke.tf (Node Pool):

resource "google_container_node_pool" "spot_nodes" {
  cluster    = google_container_cluster.primary.name
  location   = var.zone
  node_count = 2

  node_config {
    machine_type = "e2-standard-4"

    # 50GB is the sweet spot: Small enough for quotas, big enough for Docker images
    disk_size_gb = 50
    disk_type    = "pd-balanced"

    # The Money Saver
    spot = true

    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
    tags = ["gke-node", "${var.cluster_name}-node"]
  }
}

Why Spot VMs? Spot VMs cost ~70–90% less than on-demand instances. For an SRE platform, failure is a design input. If this cluster can’t survive Spot eviction, it’s not production-ready.

Secrets and Variables: Nothing Sensitive in Git

Sensitive values (like project_id) are:

Defined as variables without defaults
Supplied via terraform.tfvars
Explicitly excluded via .gitignore

This prevents:

Accidental credential leaks
Public repository exposure
Tight coupling between code and environment

terraform.tfvars:

project_id = "<GCP_PROJECT_ID>"

Principle of Least Privilege: Terraform With a Dedicated Service Account

One of the most common anti-patterns in infrastructure provisioning is running Terraform using application-default credentials (ADC) tied to a personal user account. This is dangerous and non-auditable.

For this platform, I explicitly avoided ADC and instead used a dedicated Terraform service account with scoped permissions.

Step 1: Create a Dedicated Terraform Service Account

gcloud iam service-accounts create terraform-deployer \
  --display-name="Terraform Infrastructure Deployer"

This service account exists only to provision infrastructure.

Step 2: Grant Required IAM Roles (Nothing More)

I avoided "Owner" roles. Instead, I granted only what was needed: container.admin, compute.admin, storage.admin and iam.serviceAccountUser.

export PROJECT_ID="sre-portfolio-platform"

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:terraform-deployer@$PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/container.admin"

# ... (Repeat for compute.admin, iam.serviceAccountUser, storage.admin)

Why these roles?

Role	Purpose
container.admin	Create and manage GKE clusters
compute.admin	Manage VPCs, subnets, disks, and VM resources
iam.serviceAccountUser	Allow Terraform to attach service accounts
storage.admin	Required for GKE-managed storage and state operations

Step 3: Secure the Key

I generated a key for Terraform but ensured it never touches Git.

gcloud iam service-accounts keys create tf-key.json \
  --iam-account=terraform-deployer@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com

echo "tf-key.json" >> .gitignore

This keeps Terraform authentication explicit, auditable, and revocable.

Step 4: Configure Terraform to Use the Service Account

providers.tf:

provider "google" {
  project     = var.project_id
  region      = var.region
  credentials = file("tf-key.json")
}

This ensures that Terraform never uses application-default credentials and All infrastructure actions are tied to a machine identity.

Cluster Access Validation

After provisioning, I validated access using:

gcloud container clusters get-credentials sre-portfolio-cluster \
  --zone us-central1-a \
  --project sre-portfolio-platform

This creates kube-config file with the associated credentials in $HOME/.kube/config.

This confirmed:

Cluster creation succeeded
IAM bindings were correct
The control plane was reachable
Nodes joined successfully

Deployment and Validation

Apply the infrastructure:

terraform apply

Configure local access:

gcloud container clusters get-credentials sre-portfolio-cluster \
  --zone us-central1-a

Verify Nodes:

kubectl get nodes -o wide

You must get something like this:

Spot nodes registered successfully and joined the cluster.

Key Takeaways

Cost constraints expose architectural assumptions
GKE defaults are not production-safe by default
Smaller disks and zonal clusters matter under quotas
Spot VMs are viable for non-critical platforms
Infrastructure design is about trade-offs, not checklists

What’s Next

In Part 2, I’ll bootstrap GitOps from scratch by installing:

ArgoCD (The GitOps Engine)
External Secrets (The Secret Manager)
Gateway API (The Modern Ingress)

Stay Tuned!

Designing a Cost-Constrained, Production-Grade GKE Cluster with Terraform

Why This Isn’t a Typical GKE Setup

Terraform Project Structure

Provider Configuration and Version Pinning

Why version pinning matters?

Networking: Custom VPC with VPC-Native GKE

GKE Cluster Design (Single Zonal, Explicit, Minimal)

Why Zonal?

Automated Node Management

Node Pool Configuration (The “Spot“ Strategy)

Secrets and Variables: Nothing Sensitive in Git

Principle of Least Privilege: Terraform With a Dedicated Service Account

Why these roles?

Cluster Access Validation

Deployment and Validation

Key Takeaways

What’s Next

Comments

Building a Production-Grade SRE Platform on Kubernetes

Beyond Ingress: Building a "Keyless" Platform with GKE Gateway API

More from this blog

LangChain, LangGraph, LangSmith, and LangFlow - Actually Explained

From FastAPI to Cloud Run - Deploying AI Agents the Google Way

Stop Doing Manual SRE Reviews: Build an AI Auditor with GitHub MCP

Chaos Engineering - Proving Resilience in Kubernetes Platform

FinOps in Kubernetes - Taming the Cloud Bill with Kubecost

Command Palette

Why This Isn’t a Typical GKE Setup

Terraform Project Structure

Provider Configuration and Version Pinning

Why version pinning matters?

Networking: Custom VPC with VPC-Native GKE

GKE Cluster Design (Single Zonal, Explicit, Minimal)

Why Zonal?

Automated Node Management

Node Pool Configuration (The “Spot“ Strategy)

Secrets and Variables: Nothing Sensitive in Git

Principle of Least Privilege: Terraform With a Dedicated Service Account

Why these roles?

Cluster Access Validation

Deployment and Validation

Key Takeaways

What’s Next

Comments

Building a Production-Grade SRE Platform on Kubernetes

Beyond Ingress: Building a "Keyless" Platform with GKE Gateway API

More from this blog