GovTech

Orcrist: A GitOps Platform on GKE for Mission-Critical Workloads

Role: Technical Product Lead, Platform Engineering · 2025–2026

Inherited a platform that was thin on security, reliability, scalability, and observability. Rebuilt it around GitOps, policy-as-code, and zero-trust defaults.

GovTech: Orcrist: A GitOps Platform on GKE for Mission-Critical Workloads

Platform team: 5 Senior FTEs
Workload Identity + KMS: Zero-trust
FluxCD + OCI artifacts: GitOps-native

Team

Team: Tungi Dang; Technical Product Lead, Platform Engineering

A platform held together by tribal knowledge

On paper, Orcrist already had a platform. In practice it had four problems at once. Secrets were in the wrong places. Clusters had no shared guardrails. Nobody had a clean picture of capacity. And every incident started with ten minutes of figuring out what was even running.

For workloads serving law enforcement, that wasn’t good enough. My job was to fix the foundation without freezing delivery for the application teams already shipping on top of it.

Security: stop trusting the cluster

We started with plaintext credentials in repos, drifted IAM, and Terraform PRs that nobody really reviewed. The fixes were unglamorous.

External Secrets Operator in front of GCP Secret Manager, so no application ever sees a long-lived credential.
Workload Identity and KMS end-to-end, killing the static service account keys that had been floating around.
Kyverno at admission for image provenance, required labels, and a hard no on privileged containers.
Checkov in Cloud Build and GitHub Actions, blocking IaC merges before they could land. Terraform, OpenTofu, and Pulumi all run through the same gates.
NetworkPolicies flipped to default-deny, with per-workload egress rules.

Secrets exposure went down quickly. The bigger win was that none of the controls relied on someone remembering to apply them.

Reliability: Git is the only way in

The cluster and the Git history disagreed. Hotfixes sat in someone’s shell. Rolling back a release meant guessing.

FluxCD became the only path. Every change to a cluster (app code, config, policy, secrets references) goes through a PR, ships as an OCI artifact, and lands via Kustomizations or Helm releases. Local, staging, and prod now use the same manifests promoted through Git, not three drifting variants.

For application teams, that translated to predictable deploys, real rollbacks, and an audit trail that holds up to both engineering and compliance review.

Scalability: paved roads, not bespoke onboarding

Every previous application team had been onboarded by hand. I reframed the platform as a set of capabilities with owners, SLOs, and adoption metrics, and gave teams a paved road: a template repo, a Kustomization, a Helm chart, a CI pipeline, observability already wired in.

OCI-first artifacts let the same build promote across environments without a rebuild. Standards for Kustomizations and Helm releases meant onboarding stopped being a negotiation and became a checklist.

The data plane scales with the workloads. Kafka for streams, Spark for batch, Temporal for anything that has to survive a restart. The platform team is no longer the bottleneck on every new tenant.

Observability: one story per incident

The telemetry was already there. The problem was that no two teams used it the same way. Grafana, Prometheus, and Loki were live, but every team had reinvented its own dashboards and labels.

We standardised the conventions: metric names, log fields, trace context, alert routing. SLOs got defined per capability, with error budgets that actually shape release decisions. Runbooks and ADRs live next to the code, so the answer to “why did we do it this way” is in the repo, not in someone’s head.

Incident calls now start with the SLO, not with a tour of six dashboards.

Running it like a product

Five senior engineers across platform, infrastructure, and cloud, running Jira sprints. I prioritise the roadmap the way an application team would prioritise theirs: capability definitions, golden paths, deprecation timelines, migration plans. Infrastructure, foundation, and application teams get the same updates at the same cadence, so changes don’t arrive as surprises.

The point of writing runbooks, ADRs, and platform docs is the same as the point of the rest of the work: make the next person’s job easier than yours was.

What changed

Secrets exposure reduced through External Secrets Operator, Workload Identity, and KMS.
Policy and IaC checks enforced in CI, not left to reviewer discretion.
GitOps is the only way into a cluster, with reviewable and reversible changes.
Onboarding shortened from bespoke setup to a documented golden path.
No more drift between local, staging, and prod once OCI artifacts promote unchanged.
Incidents start from the SLO, not from a hunt across dashboards.

GovTechPlatform EngineeringCloud PlatformDeveloper ExperienceMulti-tenant Platform+17

5 Senior FTEs platform team. Zero-trust workload Identity + KMS. GitOps-native fluxCD + OCI artifacts.

Got a challenge? I've probably seen it before.

Download CV

E.ON. AI-native ObservabilityEnergy