Hi, my name is

Anand Kumar.

I make AI infrastructure boring to operate.

I lead infrastructure teams — from internet-scale data orgs to founding engineer at AI-native startups. Hands-on with Kubernetes, AWS, and the operational practices that turn experiments into production systems.

About

Focus: AI infrastructure — Kubernetes at scale, GPU operations, inference platforms, and the operational practices that ship experiments to production at speed.

I've spent 18 years working at the seams of distributed systems — multi-tenant Splunk + Hadoop + ML at Zynga, AdTech infra across LATAM at Yahoo, founding the stack at a Mark Pincus AI-gaming startup. The shape keeps changing; the job stays the same: make complex systems boring to run.

What I care about: cost-aware architecture, blameless on-call, small teams that ship. What I don't: tool tribalism, ceremony over outcomes, premature standardization.

Based in San Jose.

Anand Kumar

Experience

  1. Oct 2022 — Present
    San Jose, CA

    Engineering Manager, Infrastructure · ErthAI

    • AWS
    • EKS
    • Lambda
    • API Gateway
    • Terraform
    • ArgoCD
    • GitOps
    • BetterStack
    • Founding infrastructure hire at a Mark Pincus AI-gaming company, where I recruited and built the infrastructure team.
    • Shipped a multi-platform Discord poker game (peak 30K DAU).
    • Open-sourced Stem Studio — an ambitious AI-assisted, no-code game engine.
    • Owned the AWS footprint end-to-end with Terraform + Terragrunt: EKS, RDS, Route 53, Lambda, Amplify previews, ALB Ingress.
    • Stood up the SRE practice: defined SLIs/SLOs per service, instrumented BetterStack for synthetic uptime monitoring and a public status page, ran a follow-the-sun on-call rotation with time-based paging escalation, and led blameless post-mortems for Sev1 incidents.
    • Re-architected compute, storage, and ingress (including aggressive cross-AZ traffic reduction) to deliver a 70% reduction in monthly AWS spend.
  2. Jun 2016 — Oct 2022
    Sunnyvale, CA

    Sr. Principal DevOps Engineer · Yahoo

    • Hadoop
    • Spark
    • Solr
    • Kafka
    • YARN
    • Built and led a team of 7–10 engineers (US + LATAM) inside Yahoo's AI & Data organization, operating the Orion data platform across multiple LATAM countries to support an AdTech business generating ~$150M in annual revenue.
    • Architected and operated a Hadoop ecosystem processing 100B+ records daily across 300+ nodes — Spark as the primary compute engine, Solr for search, plus YARN, Hive, Kafka, and HBase — powering workloads like Mobile Marketing Insights for audience segmentation.
    • Led the RedHat–Yahoo integration, starting with 5G Mobile Edge Compute (MEC): deployed enterprise edge platforms with AI/ML capabilities to telco customers.
    • Owned platform reliability, capacity planning, and on-call practices across a distributed organization spanning Sunnyvale and multiple LATAM sites.
  3. Apr 2011 — Jun 2016
    San Francisco, CA

    Senior DevOps Engineer, Big Data · Zynga

    • Splunk
    • Hadoop
    • HBase
    • AWS
    • Architected and operated a multi-tenant Splunk platform spanning bare-metal and AWS, indexing 15+ TB/day from 50,000+ forwarders with high resilience.
    • Built and operated multiple HA Hadoop, HBase, and ML clusters totaling ~1 PB of storage, powering game telemetry, analytics, and ML across Zynga's portfolio.
    • Designed multi-tenant isolation that let multiple game studios share the platform while preserving data segregation and per-tenant performance.
  4. Jan 2008 — Apr 2011
    San Jose, CA

    DevOps Engineer · Dell SonicWALL

    • Nagios
    • Cacti
    • Virtualization
    • Managed the engineering datacenter and built/operated observability platforms (Nagios, Cacti) for the engineering organization.
    • Stood up virtualization infrastructure that supported product test environments across the engineering team.

Stack

Cloud & Containers

  • AWS
  • Kubernetes
  • EKS
  • ECS
  • Helm
  • Docker
  • Istio

IaC & GitOps

  • Terraform
  • Terragrunt
  • ArgoCD
  • Sealed Secrets

CI/CD

  • GitHub Actions
  • GitLab CI
  • Jenkins

Distributed Data

  • Hadoop
  • Spark
  • Kafka
  • Solr
  • YARN
  • HBase

Observability

  • Splunk
  • OpenTelemetry
  • Grafana / Loki
  • CloudWatch
  • BetterStack

AI Infrastructure

  • Together AI
  • Multi-tier routing

Languages

  • Python
  • Go
  • Shell

Open Source

Stem Studio Engine

Web game engine — AI-assisted, no-code

The open-source core of Stem Studio: a TypeScript/WebGPU game engine that lets creators ship complete games through natural-language prompts, without writing engine code.

  • TypeScript
  • WebGPU
  • Go
  • AI-assisted
buildwithstem.com →

Independent Projects

PicCanvas

AI image-generation product, built solo

Production inference on Together AI with a multi-tier model-routing layer that dispatches requests across model sizes to balance cost, latency, and quality.

  • Together AI
  • AWS
  • Multi-tier routing

Beacon

A better UI for Google Analytics

Syncs GA4 data into DynamoDB and presents it through a clean dashboard — KPIs, charts, user profiles, and behavior analytics.

  • TypeScript
  • GA4
  • DynamoDB
  • AWS

Shipyard

Release management dashboard

Helps teams promote code through branch-based preview environments — single pane for tracking what's where across release stages.

  • JavaScript
  • CI/CD
  • Release Mgmt

k8s-controller-namespace-reaper

Kubernetes controller (Go)

Automatically cleans up orphaned PR preview environments — reclaims cluster resources when feature branches die.

  • Go
  • Kubernetes
  • Controller
  • Automation

k8s-controller-firebase-sync

Kubernetes controller (Go)

Keeps Firebase authorized domains in sync with your Ingress hosts. Eliminates manual toggling between Firebase console and the cluster.

  • Go
  • Kubernetes
  • Firebase
  • Controller

What's next?

Let's talk infrastructure.