Designing a Cost-Constrained, Production-Grade GKE Cluster with Terraform

An Aspiring DevOps Engineer passionate about automation, CI/CD, and cloud technologies. On a journey to simplify and optimize development workflows.
Kubernetes tutorials often focus on getting a cluster running.
Site Reliability Engineering focuses on what happens after that.
Welcome to part one of the Blog Series. In this post, I’ll walk through how I designed and provisioned a production-grade Google Kubernetes Engine (GKE) cluster using Terraform, under real-world constraints: limited quotas, cost optimization, least-privilege IAM, and operational readiness for GitOps and observability.
This is not a hello world cluster. It is the foundation of an SRE platform designed to support GitOps, service mesh, observability, SLOs, and chaos engineering in later phases of this series.
Why This Isn’t a Typical GKE Setup
Before writing any Terraform, I defined the constraints this cluster must operate under:
Cost-aware: Must run close to free tier using Spot VMs
Production-aligned: No default networks or node pools
Secure by design: No secrets or credentials committed to Git
GitOps-ready: Designed for ArgoCD-based deployments
Failure-tolerant: Node loss should be expected, not catastrophic
Explicit over implicit: No reliance on GKE defaults
These constraints drive every design decision in this article.
Terraform Project Structure
I prefer simple, flat Terraform layouts for early infrastructure phases to keep behavior explicit and debuggable.
.
├── providers.tf # Provider and version pinning
├── vpc.tf # Networking
├── gke.tf # GKE cluster configuration
├── variables.tf # Inputs (no secrets here)
├── outputs.tf # Useful cluster outputs
└── .gitignore # Prevent credential leaks
Each file has a single responsibility, making reviews and troubleshooting simpler.
Provider Configuration and Version Pinning
providers.tf :
terraform {
required_providers {
google = {
source = "hashicorp/google"
version = "7.13.0"
}
}
}
provider "google" {
project = var.project_id
region = var.region
credentials = file("tf-key.json")
}
Why version pinning matters?
Provider upgrades can introduce breaking behavior
Pinning versions ensures reproducible plans
Credential usage is explicit and auditable
Networking: Custom VPC with VPC-Native GKE
Default VPCs optimize for convenience, not production parity.
This cluster uses:
A custom VPC
A custom subnet
Secondary IP ranges for Pods and Services
This enables VPC-native (IP alias) networking, which is effectively mandatory for modern GKE features.
vpc.tf :
resource "google_compute_network" "vpc_network" {
name = "${var.cluster_name}-vpc"
auto_create_subnetworks = false
}
# Subnet configuration omitted for brevity (see repo)
VPC-native networking supports:
Higher pod density
Predictable IP allocation
Service mesh compatibility
Workload Identity (Most important)
GKE Cluster Design (Single Zonal, Explicit, Minimal)
This cluster is zonal, not regional - a conscious trade-off.
gke.tf :
resource "google_container_cluster" "primary" {
name = var.cluster_name
location = var.zone
networking_mode = "VPC_NATIVE"
network = google_compute_network.vpc_network.id
subnetwork = google_compute_subnetwork.vpc_subnet.id
# Security: Enable Workload Identity (Required in Phase two)
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
deletion_protection = false
remove_default_node_pool = true
initial_node_count = 1
Why Zonal?
My first attempt used a regional cluster and failed due to disk quota exhaustion.
A regional cluster implicitly multiplies node disks across three zones (3x 100GB = 300GB), silently exceeding the 250GB free-tier quotas. Zonal clusters keep us within budget.
Automated Node Management
gke.tf :
management {
auto_repair = true
auto_upgrade = true
}
These settings ensure:
Failed nodes are replaced automatically
Security patches are applied without manual intervention
For an SRE platform, hands-off node management is non-negotiable.
Node Pool Configuration (The “Spot“ Strategy)
Instead of inline configuration (which locks nodes to the control plane), I used a separate google_container_node_pool resource. This allows us to rotate or replace nodes without destroying the cluster.
gke.tf (Node Pool):
resource "google_container_node_pool" "spot_nodes" {
cluster = google_container_cluster.primary.name
location = var.zone
node_count = 2
node_config {
machine_type = "e2-standard-4"
# 50GB is the sweet spot: Small enough for quotas, big enough for Docker images
disk_size_gb = 50
disk_type = "pd-balanced"
# The Money Saver
spot = true
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
tags = ["gke-node", "${var.cluster_name}-node"]
}
}
Why Spot VMs? Spot VMs cost ~70–90% less than on-demand instances. For an SRE platform, failure is a design input. If this cluster can’t survive Spot eviction, it’s not production-ready.
Secrets and Variables: Nothing Sensitive in Git
Sensitive values (like project_id) are:
Defined as variables without defaults
Supplied via
terraform.tfvarsExplicitly excluded via
.gitignore
This prevents:
Accidental credential leaks
Public repository exposure
Tight coupling between code and environment
terraform.tfvars:
project_id = "<GCP_PROJECT_ID>"
Principle of Least Privilege: Terraform With a Dedicated Service Account
One of the most common anti-patterns in infrastructure provisioning is running Terraform using application-default credentials (ADC) tied to a personal user account. This is dangerous and non-auditable.
For this platform, I explicitly avoided ADC and instead used a dedicated Terraform service account with scoped permissions.
Step 1: Create a Dedicated Terraform Service Account
gcloud iam service-accounts create terraform-deployer \
--display-name="Terraform Infrastructure Deployer"
This service account exists only to provision infrastructure.
Step 2: Grant Required IAM Roles (Nothing More)
I avoided "Owner" roles. Instead, I granted only what was needed: container.admin, compute.admin, storage.admin and iam.serviceAccountUser.
export PROJECT_ID="sre-portfolio-platform"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:terraform-deployer@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/container.admin"
# ... (Repeat for compute.admin, iam.serviceAccountUser, storage.admin)
Why these roles?
| Role | Purpose |
| container.admin | Create and manage GKE clusters |
| compute.admin | Manage VPCs, subnets, disks, and VM resources |
| iam.serviceAccountUser | Allow Terraform to attach service accounts |
| storage.admin | Required for GKE-managed storage and state operations |
Step 3: Secure the Key
I generated a key for Terraform but ensured it never touches Git.
gcloud iam service-accounts keys create tf-key.json \
--iam-account=terraform-deployer@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com
echo "tf-key.json" >> .gitignore
This keeps Terraform authentication explicit, auditable, and revocable.
Step 4: Configure Terraform to Use the Service Account
providers.tf:
provider "google" {
project = var.project_id
region = var.region
credentials = file("tf-key.json")
}
This ensures that Terraform never uses application-default credentials and All infrastructure actions are tied to a machine identity.
Cluster Access Validation
After provisioning, I validated access using:
gcloud container clusters get-credentials sre-portfolio-cluster \
--zone us-central1-a \
--project sre-portfolio-platform
This creates kube-config file with the associated credentials in $HOME/.kube/config.
This confirmed:
Cluster creation succeeded
IAM bindings were correct
The control plane was reachable
Nodes joined successfully
Deployment and Validation
Apply the infrastructure:
terraform apply
Configure local access:
gcloud container clusters get-credentials sre-portfolio-cluster \
--zone us-central1-a
Verify Nodes:
kubectl get nodes -o wide
You must get something like this:

Spot nodes registered successfully and joined the cluster.
Key Takeaways
Cost constraints expose architectural assumptions
GKE defaults are not production-safe by default
Smaller disks and zonal clusters matter under quotas
Spot VMs are viable for non-critical platforms
Infrastructure design is about trade-offs, not checklists
What’s Next
In Part 2, I’ll bootstrap GitOps from scratch by installing:
ArgoCD (The GitOps Engine)
External Secrets (The Secret Manager)
Gateway API (The Modern Ingress)
Stay Tuned!




