Skip to content
All posts
·4 min read

From Manual Scripts to CyberRange Platform: Designing a Production-Grade Lab System

How we replaced fragile VMware runbooks with a resilient cyber lab platform that hit 99.9% uptime and sub-five-minute provisioning.

systems-designinfrastructureeducation

The Problem: Fragile VMware Scripts and Human Runbooks

I inherited a VMware based lab environment held together by bash fragments, wiki pages, and tribal knowledge. Faculty requested classes via email. Teaching assistants ran scripts from their laptops. Students opened support tickets for broken VMs, DNS issues, or incorrect images. Incidents were invisible until classes started.

Failure meant entire cohorts missing lab time, faculty losing trust, and me babysitting jobs at 2 AM. We needed a real platform.

Design Goals and Constraints

We defined clear objectives before writing a line of code:

  • Reliability: 99.9% uptime during lab windows and auto recovery from transient VMware faults.
  • Provisioning time: Less than five minutes from "Create Lab" to "Student VM is ready" so instructors could iterate quickly.
  • Security & FERPA: Least-privilege access, audit trails, and SAML integration with the campus IdP.
  • Operational simplicity: A single engineer should be able to run the system without heroics.
  • Cost visibility: Right-size clusters, archive unused templates, and surface per-course cost in dashboards.

High-Level Architecture

Next.js App  -->  API Gateway  -->  Service Layer (Node)
                                  |--> Provisioning Queue (Redis)
                                  |--> Worker Fleet (Go)
                                  |--> VMware vCenter
                                  |--> Postgres + Mongo (state + metrics)
                                  |--> Observability Stack (Grafana + Loki + Tempo)
  • Frontend: Next.js App Router with a design-system powered UI.
  • API Layer: Node services handling auth, validation, and orchestration.
  • Worker Fleet: Idempotent Go workers talking to vCenter, calling Terraform modules, and streaming status events.
  • Queues: Redis Streams for backpressure and retry semantics.
  • Storage: Postgres for transactional state, Mongo for high-cardinality metrics.
  • Observability: Prometheus exporters, Loki structured logs, Grafana dashboards, PagerDuty alerts.

Provisioning Pipeline: Create Class → VM Ready

  1. Instructor submits a "Create Class" form.
  2. API validates quotas, templates, and SAML groups.
  3. Payload is enqueued with a deterministic job ID.
  4. Worker picks job, marks it pending, and requests resources from vCenter.
  5. Terraform module deploys VMs, attaches networking, applies security baselines.
  6. Health checks verify guest customization, DNS, and domain join.
  7. Success event updates Postgres, triggers notifications, and exposes a real-time progress UI.

Resilience Patterns

  • Idempotent jobs: Job IDs map to resource names, so retries never double-provision.
  • Backoff + circuit breakers: vCenter rate limits are respected, and failing clusters are quarantined.
  • Compensating actions: Partial failures roll back provisioned VMs and emit structured error codes.

Security & SSO in an Academic Setting

We federated with the university SAML IdP. Attribute mappings drive RBAC:

  • student → read-only VM access.
  • ta → limited provisioning for courses they assist.
  • faculty → full provisioning, class management, invite controls.
  • admin → platform operations, audit tools, feature flags.

Session hardening included SameSite=strict cookies, rotating session keys, CSRF tokens, and audit logging for every privileged action.

Observability at 2 AM

We treated provisioning like any mission-critical pipeline:

  • Structured logs: JSON with correlation_id, course_id, vm_id for easy search.
  • Metrics: Provisioning latency, success rate, retry counts, cluster saturation.
  • Tracing: Tempo traces from API → worker → vCenter to pinpoint slow calls.
  • Dashboards: Course health overview, per-cluster capacity, SLA compliance, support ticket volume.

During week one we hit a vCenter API rate limit. Metrics showed a latency spike, logs pointed to a specific cluster, and we throttled requests with a hotfix. Provisioning resumed within ten minutes.

Impact & Lessons Learned

  • Provisioning time dropped from four hours to under five minutes.
  • Support tickets related to broken labs fell by 95%.
  • Faculty trusted the system enough to run live assessments.
  • Observability reduced mean time to resolution from hours to minutes.

If I shipped v2, I would add multi-region failover, autoscaling for worker pools, and a richer student UI for self-service troubleshooting. But we hit our SLAs and finally slept through lab nights.💤