Dynamic PR environments for services on EKS

TL;DR

At peak load, 20+ open pull requests map to 20+ ephemeral namespaces on a single shared dev EKS cluster — each running 5–6 services — all driven from one GitOps repo.
A 3-layer Helm value cascade (base → environment → service) cut roughly 45% of the YAML lines that used to live as copy-pasted per-service files.
The hard parts aren’t “make it work” — they’re cleanup (orphan PVs, stuck namespaces, retained EBS volumes) and the corners of multi-source ArgoCD Applications that the official docs glide past.

The problem

For a team shipping 30–50 PRs a week, a cluster-per-PR model means hundreds of cluster lifecycles per month and a sustained AWS bill that’s impossible to attribute back to specific work. Worse, “cluster-level” failures — a bad ALB controller version, a sealed-secrets key mismatch — become per-PR debugging instead of an investigation once.

Per-namespace previews fit the actual shape of what reviewers need: same cluster, same ingress, same DNS zone. Each PR gets its own services in a sealed-off namespace addressable by a predictable URL. The interesting work is making namespace creation declarative and cleanup reliable.

Architecture

Three things drive the system: (1) a GitOps repo whose directory tree mirrors the deploy model, (2) ArgoCD ApplicationSets that turn that tree into N independently-syncing Applications, and (3) Helm charts whose values resolve in a deterministic 3-layer cascade. PRs slot in by creating and deleting directories.

The handoff worth flagging: the application CI never deploys. It commits the new image tag to the GitOps repo and lets ArgoCD do the work. Deploy authority lives in one place; rollback is git revert.

The 3-layer Helm value cascade

helm template --values a.yaml --values b.yaml --values c.yaml merges files in order, later wins. The directory contract enforces three layers:

Three-layer Helm value cascade flowing left to right: base-values.yaml (shared across all environments) → environment common-values.yaml (per-environment domains, CORS, auth IDs) → service-specific values.yaml (image tag, ports, env vars, secret refs) → rendered manifests with later-wins semantics.

environments/base-values.yaml — values shared across all environments: ECR registry, ingress class, default resource limits, common env vars.
environments/<env>/common-values.yaml — per-environment: domains, CORS origins, edge CDN, auth IDs.
environments/<env>/<service>.yaml — per-service: image tag, ports, env vars, secret refs.

Concretely, the three layers for BACKEND-SERVICE in develop:

# 1. environments/base-values.yaml — shared across ALL environments
global:
  registry: <account-id>.dkr.ecr.<region>.amazonaws.com
  ingressClass: alb
resources:
  requests: { cpu: 100m, memory: 256Mi }
  limits:   { cpu: 500m, memory: 512Mi }

# 2. environments/develop/common-values.yaml — this environment
global:
  domain: dev.example.com
  authProvider: auth0-dev
ingress:
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing

# 3. environments/develop/BACKEND-SERVICE.yaml — this service
image:
  tag: pr-1234-abc123
service:
  port: 8080
sealedSecrets:
  enabled: true
  name: backend-sealed
  keys: [DATABASE_URL, JWT_SECRET]

Before the cascade, each (environment × service) had its own values file duplicating the registry, ingress annotations, and resource defaults. After the cascade those duplicates collapse into the base and per-environment files, leaving the service files thin. Net effect on this codebase was roughly 45% fewer YAML lines — measured across values files and the ApplicationSet inventory that was also consolidated.

ApplicationSet patterns: list vs git-directory generator

ApplicationSets generate N Applications from a generator. Two patterns cover everything:

List generator — explicit (service, environment) pairs, one block per deploy target. Used for long-lived environments where every new entry should be a deliberate PR.

Git-directory generator — a glob like environments/pr-* becomes the list. The filesystem is the inventory.

Two ApplicationSet generator patterns side by side. Left: list generator for long-lived environments — one dev-cluster ApplicationSet fans out to per-service Applications. Right: git-directory generator for PR environments — one ApplicationSet fans out to one Application per open PR directory.

The git-directory generator template, in skeleton:

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
spec:
goTemplate: true
generators:
- git:
    repoURL: https://github.com/.../k8s-deployments.git
    revision: HEAD
    directories:
    - path: environments/pr-*
template:
  metadata:
    name: '{{.path.basename}}-backend'
  spec:
    destination:
      namespace: '{{.path.basename}}'

Subtle: you need goTemplate: true for .path.basename to resolve. Without it you get text matching mode, the names never substitute, and no Applications get generated — quietly, with no error.

Multi-source Applications and the `$values` ref pattern

A single ArgoCD Application can combine multiple sources. I use three: (1) the Helm chart, (2) a $values ref that supplies layered value files without rendering itself, (3) a raw YAML directory for non-Helm manifests (typically SealedSecrets).

sources:
- repoURL: https://github.com/.../k8s-deployments.git
  path: charts/BACKEND-SERVICE
  helm:
    valueFiles:
    - $values/environments/base-values.yaml
    - $values/environments/develop/common-values.yaml
    - $values/environments/develop/BACKEND-SERVICE.yaml
- repoURL: https://github.com/.../k8s-deployments.git
  ref: values
- repoURL: https://github.com/.../k8s-deployments.git
  path: environments/develop
  directory:
    include: BACKEND-SERVICE-secrets.yaml

Two failure modes the docs don’t mention. The $values ref must use the same repo as the chart source — cross-repo $values silently fails. And the ref: source must not have a path:, or ArgoCD will try to render it as a manifest source and produce confusing duplicate-resource errors.

Conditional sealed-secret injection in the chart

The cascade carries a sealedSecrets block per service (you saw it in layer 3 above). The chart template uses that block to conditionally inject secretKeyRef env vars only for services that opt in — no per-service deployment template:

# charts/BACKEND-SERVICE/templates/deployment.yaml (excerpt)
env:
{{- range $key, $value := .Values.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}
{{- if .Values.sealedSecrets.enabled }}
{{- range $key := .Values.sealedSecrets.keys }}
- name: {{ $key }}
valueFrom:
  secretKeyRef:
    name: {{ $.Values.sealedSecrets.name }}
    key: {{ $key }}
{{- end }}
{{- end }}

Two design choices baked in here. Plaintext env vars and sealed-secret-backed env vars use the same Helm input shape — adding a secret is a values-file edit, not a chart edit. And services with no secrets carry zero secret-related YAML; the if short-circuits the whole block.

Cleanup: the namespace-reaper

ArgoCD’s prune: true deletes Applications and their direct children when the source directory disappears. What it doesn’t delete:

PVCs/PVs with reclaimPolicy: Retain (anything stateful — Postgres, Mongo, Redis with AOF).
EBS volumes that backed those PVs once the cluster has forgotten about them.
Namespaces stuck in Terminating because a finalizer hung.

I wrote a small Go controller — namespace-reaper — that runs on the dev cluster, lists namespaces matching app-pr-*, and for each:

Checks whether the corresponding directory still exists in the GitOps repo (GitHub API).
If the directory is gone and the namespace has been idle past a TTL, force-deletes PVCs, removes finalizers from stuck namespaces, and tags orphan EBS volumes for the next daily cleanup pass.

The complexity isn’t in the controller — it’s in the decision tree of “is this safe to delete now?” The rule I landed on: never delete unless (a) the GitOps directory is gone AND (b) the namespace has had no reconcile activity for at least 24h. Both conditions, not either.

Closing

This pattern fits when a fleet of services share infrastructure assumptions, preview environments are part of the review loop, and the cost of standing up a cluster per PR clearly outweighs the cost of sharing one. It’s not the right shape for monolithic single-service apps (overkill) or for environments that need cluster-level isolation — multi-tenancy, GPU pools, strict security boundaries.

The thing worth investing in earliest: convention-over-config on the directory layout. Half the value of this system is that a new service is a PR adding three files in known locations — and that property only emerges if the conventions are obvious to anyone joining the team.