The LGTM Stack: From Blind Containers to Full Visibility

An Aspiring DevOps Engineer passionate about automation, CI/CD, and cloud technologies. On a journey to simplify and optimize development workflows.
Stop running the containers, blindly. A step-by-step runbook for deploying the LGTM Observability stack (Loki, Grafana, Tempo, Prometheus) via GitOps.
In Part 1, we laid the foundation: a cost-optimized GKE cluster with modern networking using Gateway API. In Part 2, we installed the engine: ArgoCD, GitOps, and a secure bootstrap flow. Now, in Part 3, we turn on the lights.
We are moving from "it works" to "I can see how it works."
Why Observability Is Not Optional
A Kubernetes cluster without observability is a liability.
Pods crash - you don’t know why
Latency spikes - you can’t pinpoint where
Errors happen - you don’t know who caused them
Logs alone aren’t enough. Metrics alone lie without context. Tracing alone is useless without correlation.
This is why we deploy the LGTM stack - a deliberately decoupled, production-proven observability model.
The Stack (LGTM)
| Signal | Tool | Purpose |
| Logs | Loki | Centralized, label-based logging |
| Graphs | Grafana | Visualization |
| Traces | Tempo | Distributed tracing backend |
| Metrics | Prometheus | Time-series metrics & alerting |
To generate real data, we deploy the OpenTelemetry Astronomy Shop - a microservices demo with enough complexity to surface real SRE problems.
1. GitOps Strategy: "App-of-Apps"
This is not a “helm install” tutorial. Everything is deployed via ArgoCD, following a two-level App-of-Apps pattern.
The goal is simple:
Separate orchestration, platform concerns, and business workloads - explicitly.
The Repository Structure: We reorganized our repo to support the "App-of-Apps" pattern. A single root application manages the entire cluster state.
kubernetes/
├── bootstrap/ # The Control Tower
│ ├── root-app.yaml # The parent that manages the children
│ ├── apps.yaml # Child app for business workloads
│ └── observability.yaml # Child app for platform tools
├── platform/ # Infrastructure Tools
│ └── observability/ # Wrapper Chart.yaml for LGTM Stack
└── apps/ # Business Logic
└── otel-demo/ # The Astronomy Shop Manifests
This structure lets us answer, instantly:
What bootstraps the cluster?
What is platform-owned?
What is application-owned?
GitOps Topology: The Real App-of-Apps Model
At the top sits a single root application. Its only responsibility is to create other ArgoCD applications.
Run the only kubectl apply command to kickstart the App of Apps.
kubectl apply -f kubernetes/bootstrap/root-app.yaml
Root Application (The Control Tower)
kubernetes/bootstrap/root-app.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-app
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/anantvaid/otel-platform-infra.git
targetRevision: HEAD
path: kubernetes/bootstrap # Watches the bootstrap folder
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
The root app itself does not deploy workloads. Instead, it watches the bootstrap/ directory and spawns child applications automatically.
Child 1: Observability Platform
kubernetes/bootstrap/observability.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: observability-stack
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/anantvaid/otel-platform-infra.git
targetRevision: HEAD
# Directory for observability apps
path: kubernetes/platform/observability
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- ServerSideApply=true
- CreateNamespace=true
This application owns the observability stack. It watches for applications deployed under kubernetes/platform/observability.
Child 2: Business Workloads
kubernetes/bootstrap/apps.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: apps-parent
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://github.com/anantvaid/otel-platform-infra.git
targetRevision: HEAD
# Directory for Otel demo apps
path: kubernetes/apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
This application owns:
OpenTelemetry Astronomy Shop
App related Gateway routes
Health checks
Future services
Observability upgrades never block app rollouts - and app failures never destabilize the platform.
Why This Matters (SRE Perspective)
This structure gives us:
Clear ownership boundaries
Independent blast radii
Safer rollbacks
Predictable GitOps behavior
2. Platform Observability: The LGTM Wrapper Chart
Rather than deploying each tool separately, we define a wrapper Helm chart in platform/observability.
Why?
One ArgoCD application
One lifecycle
One rollback surface
This allows us to manage all 3 monitoring tools as a single ArgoCD application.
kubernetes/platform/observability/Chart.yaml:
apiVersion: v2
name: observability-stack
type: application
version: 1.0.0
dependencies:
- name: kube-prometheus-stack
version: 80.10.0
repository: https://prometheus-community.github.io/helm-charts
- name: loki
version: 6.49.0
repository: https://grafana.github.io/helm-charts
- name: tempo
version: 1.24.1
repository: https://grafana.github.io/helm-charts
This mirrors how real platform teams manage observability - as infrastructure, not as a sidecar.
Secure Access: Grafana via Gateway API
Port-forwarding is not observability. Grafana is exposed using the Gateway API, maintaining parity with how production traffic flows.
kubernetes/platform/observability/templates/grafana-route.yaml:
kind: HTTPRoute
metadata:
name: grafana-route
namespace: monitoring
spec:
parentRefs:
- name: external-gateway
namespace: default
hostnames:
- "grafana.techtalkswithanant.online"
rules:
- backendRefs:
- name: observability-stack-grafana
port: 80
kubernetes/platform/observability/templates/grafana-health-check.yaml:
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
name: grafana-health-check
namespace: monitoring
spec:
default:
checkIntervalSec: 10
timeoutSec: 5
healthyThreshold: 1
unhealthyThreshold: 2
logConfig:
enabled: true
config:
type: TCP
tcpHealthCheck:
port: 3000
targetRef:
group: ""
kind: Service
name: observability-stack-grafana

3: The "Subject": OpenTelemetry Astronomy Shop
Observability without traffic is pointless.
We deploy the OpenTelemetry demo - but with an important optimization.
Disable Bundled Routing:
kubernetes/apps/otel-demo.yaml:
source:
chart: opentelemetry-demo
repoURL: https://open-telemetry.github.io/opentelemetry-helm-charts
helm:
values: |
# Disable bundled components to save CPU/RAM
prometheus:
enabled: false
grafana:
enabled: false
jaeger:
enabled: false
# Point the collector to OUR centralized Tempo
opentelemetry-collector:
config:
exporters:
otlp:
endpoint: "observability-stack-tempo.monitoring.svc.cluster.local:4317"
tls:
insecure: true
Why?
Save CPU & RAM
Avoid duplicate data
Force everything through our stack
Point Traces to Central Tempo
endpoint: "observability-stack-tempo.monitoring.svc.cluster.local:4317"
Now logs, metrics, and traces flow only through the platform layer.
Secure Access: OpenTelemetry Shop app via Gateway API
To access the application using a user-friendly domain name, let’s create a gateway route for the application service.
kubernetes/apps/shop-route.yaml:
kind: HTTPRoute
apiVersion: gateway.networking.k8s.io/v1
metadata:
name: shop-route
namespace: otel-demo
spec:
parentRefs:
- name: external-gateway
namespace: default
hostnames:
- "shop.techtalkswithanant.online"
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: frontend-proxy
port: 8080
In GKE’s Gateway API model, simply exposing a Service is not enough. The Google load balancer needs an explicit definition of what healthy traffic looks like.
So instead of relying on implicit behavior, I defined it explicitly using a HealthCheckPolicy.
kubernetes/apps/shop-health-check.yaml:
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
name: shop-health-check
namespace: otel-demo
spec:
default:
checkIntervalSec: 10
timeoutSec: 5
healthyThreshold: 1
unhealthyThreshold: 2
logConfig:
enabled: true
config:
type: TCP
tcpHealthCheck:
port: 8080
targetRef:
group: ""
kind: Service
name: frontend-proxy

Crucial Infrastructure Tweak:
HTTPS is Not Automatic
One final catch: Simply creating an HTTPRoute does not automatically enable HTTPS. By default, the Gateway we built in Phase 2 was only listening for argocd. For shop and grafana to work securely, we have to update the gateway.yaml to include listeners for these new subdomains.
Without this update, the Google Load Balancer wouldn't know which SSL certificate to present for the new domains, resulting in insecure (HTTP) connections or certificate errors.
3. SRE War Stories: The "Gotchas"
This wasn't a "happy path" tutorial. We hit real-world errors. Here is how we fixed them.
The "Ghost Password" (State Drift)
Issue: The Grafana admin password in the Kubernetes Secret didn't match the internal SQLite database because of a Helm chart reinstall. I was locked out of my own dashboard.
Fix: I bypassed the UI and reset the password directly inside the running container using the CLI.
kubectl exec -it -n monitoring deployment/observability-stack-grafana -c grafana \
-- grafana cli admin reset-admin-password <new-password>

The "Silent" Traces
Issue: The Astronomy Shop was running, but Grafana showed "No Data" for traces.
Root Cause: The demo app defaults to sending traces to
localhost. It didn't know where our centralized Tempo service lived.Fix: We explicitly overrode the OTLP endpoint in the Helm values to point to the correct internal DNS:
endpoint: "observability-stack-tempo.monitoring.svc.cluster.local:4317"
Filtering the Noise
Issue: The
load-generatorrobot was creating thousands of traces per minute, burying our manual tests.Fix: We used TraceQL in Grafana to filter the noise and find only "human" traffic.
- Query:
{ resource.service.name = "frontend" && resource.service.name != "load-generator" }
- Query:
4. The Grand Tour: Verifying the Stack
Now that the "App of Apps" is synced and the Gateway is configured, let's take a tour of our new platform.
A. The Control Tower (ArgoCD)
First, verify the GitOps hierarchy. Navigate to your ArgoCD UI (https://argocd.techtalkswithanant.online). Instead of a flat list of random applications, you should see a clear tree structure:
root-app: The parent controller.observability-stack: Managing Prometheus, Grafana, Loki, and Tempo.apps-parent: Managing the OpenTelemetry Demo.

B. Generating Traffic (The Shop)
Open a new tab and go to http://shop.techtalkswithanant.online. This isn't just a static page. It’s a microservices-backed e-commerce store.
Action: Add a "Telescope" to your cart.
Action: Proceed to checkout.
Action: Refresh the page a few times.
Every click you just made sent a request through the Frontend Envoy Proxy, which propagated W3C Trace Context headers to the Checkout Service, which called the Payment Service.


C. Visualizing the Traces (Grafana)
Now, let's see those clicks in real-time.
Navigate to
http://grafana.techtalkswithanant.online.Log in (Default user is usually
admin).Go to the Explore tab on the left sidebar.
Select Tempo as your source. (If you do not find this, manually add the Tempo source -
http://observability-stack-tempo.monitoring.svc.cluster.local:3200)
Run the query we refined earlier to filter out the load-generator bots:
{ resource.service.name = "frontend" && resource.service.name != "load-generator" }
Click on one of the dots in the scatter plot.
The Payoff: You should see a full Waterfall Trace. You can visually see the request entering the frontend, hitting the checkout service, waiting for the currency converter, and finally persisting to the database.
This is the "Aha!" moment. We aren't guessing why the checkout is slow; we can see exactly which microservice spans are taking the most time.

5. The Final Outcome
We now have full visibility into our cluster.
Shop URL:
https://shop.techtalkswithanant.online(Generates Traffic)Grafana URL:
https://grafana.techtalkswithanant.online(Visualizes Traffic)
We can now trace a single user click from the Frontend load balancer, through the Checkout service, down to the Payment service.
What's Next? (Phase 4)
We have the Platform (GKE + Gateway), the Engine (ArgoCD), and the Eyes (Observability).
Phase 4 is "The Factory." Currently, we push YAML to Git manually. In the next phase, we will build a CI/CD Pipeline. We will transition from pushing manifests to pushing Go Code, having GitHub Actions build the Docker image, push it to the Artifact Registry, and update the Helm chart automatically.




