InfrastructureSystem DesignMarch 9, 202612 min read

How I Built a Message-Triggered On-Demand Docker Platform

Start containers when a message arrives • Run isolated workloads • Stop and delete them after completion

TB

Tushar Badgujar

Backend Engineer (5+ yrs) building scalable SaaS systems

System Flow at a Glance

👤
User Request
API Server
📋
Job Queue
⚙️
Worker Node
🐳
Docker Container
Response
Motivation

Why I Built This

Traditional SaaS architectures run on always-on infrastructure — every user shares the same servers, the same databases, and the same risk. Even when no one is using the system, you're paying full price for idle compute.

I needed something different. A system where a container starts only when a message or request arrives, runs the required workload in isolation, and then gets stopped and deleted after completion. That gave me a practical way to reduce idle infrastructure cost while keeping execution isolated per job.

💸

Cost Waste

Idle resources consuming budget 24/7

🔒

Shared Risk

One bug affects all tenants simultaneously

📈

Manual Scaling

Adding capacity required human intervention


Engineering

System Architecture

The platform splits into three layers: a Control Server (the brain), Worker Nodes (the muscle), and a Reverse Proxy (the router). Each layer has a single responsibility and communicates via a job queue.

Control Server

  • Auth & request validation
  • Queue job dispatch
  • Container state tracking
  • Resource availability

Worker Nodes

  • Docker Engine runtime
  • Consumes queue jobs
  • Container lifecycle mgmt
  • Reports health status

Traefik Proxy

  • Dynamic route discovery
  • Listens to Docker events
  • TLS termination
  • Zero-downtime routing
provisioner.service.ts
const Docker = require("dockerode");
const docker = new Docker({ socketPath: '/var/run/docker.sock' });

async function provisionUserEnvironment(userId, config) {
  const container = await docker.createContainer({
    Image: config.image || "standard-node-env:latest",
    name: `user_inst_${userId}`,
    Env: [`INSTANCE_ID=${userId}`, `MAX_TIMEOUT=3600`],
    HostConfig: {
      Memory: 1024 * 1024 * 512, // 512MB RAM
      CpuQuota: 50000,            // 0.5 CPU
      AutoRemove: true,           // Auto-destroy on exit
      NetworkMode: "isolated_nw"
    },
    Labels: {
      "traefik.enable": "true",
      [`traefik.http.routers.${userId}.rule`]: `PathPrefix("/u/${userId}")`
    }
  });

  await container.start();
  const data = await container.inspect();
  return { id: data.Id, ip: data.NetworkSettings.Networks.isolated_nw.IPAddress };
}

Deep Dive

How It Works — Step by Step

👤
Step 01

Request Received

User hits the API endpoint. The control server authenticates and validates the request, then creates a job record in the database.

📋
Step 02

Job Pushed to Queue

The job payload (userId, image, config) is published to a LavishMQ queue. This decouples the API from actual container execution.

⚙️
Step 03

Worker Picks the Job

An idle worker node consumes the next available message from LavishMQ. Delivery, acknowledgements, and competing consumers keep job distribution clean without a central orchestrator.

🐳
Step 04

Container Spins Up

The worker calls the Docker SDK to create and start a container with resource limits, env vars, and Traefik routing labels.

▶️
Step 05

Task Executed

The container runs its workload — whether a WhatsApp bot, an AI agent, or a background job. Logs stream back in real time.

🗑️
Step 06

Container Destroyed

On completion (or timeout), the container exits. AutoRemove ensures it's immediately cleaned up with zero manual intervention.


Scalability

Scaling Model

Horizontal scaling is native to this architecture. The queue acts as the central nervous system — adding a new worker node is the only step required to double throughput. No config changes, no restarts, no coordination layer.

LavishMQ competing consumers for job distributionStateless workers — add/remove freelyQueue depth as the scaling signalWorker autoscale via load metrics
Job Queue (LavishMQ)
Worker 1
Worker 2
Worker 3
Container A
Container B
Container C

Each worker independently picks a job → spins its own container → horizontal scale by adding workers


Applications

Real Use Cases

WhatsApp Automation

Each user's WhatsApp bot runs in an isolated container. Session state, webhooks, and memory are fully contained per tenant.

Background Workers

CPU-intensive tasks like PDF generation, video processing, or data aggregation run in ephemeral containers without blocking the main API.

AI Agents

Spawn a dedicated container per AI workload — each with its own model config, context window, and tool calls. True agent isolation.

Multi-Tenant SaaS

Every customer gets their own compute environment. Impossible for one tenant to read another's data or exhaust shared resources.


Philosophy

Design Principles

Isolation over Shared Systems

Containers provide hard boundaries. No shared memory, no shared file system, no cross-tenant leakage. Each environment is a fortress.

Event-Driven Architecture

Nothing is polled. Jobs flow through the queue as events. Workers react. The system is asynchronous by design and scales naturally.

On-Demand Compute

Infrastructure exists only when it's needed. No idle cost. No over-provisioning. Resources appear, execute, and vanish.

Horizontal Scalability

Scale-out over scale-up. Add more workers, not bigger machines. Linear cost scaling with zero architectural changes.


Resilience

Failure Handling

Retry Mechanism

Failed jobs are re-queued with exponential back-off up to 3 attempts before being moved to a dead-letter queue for inspection.

Container Crash Recovery

If a container crashes unexpectedly, the worker detects the exit code, marks the job as failed, and triggers a retry or alert.

Timeout Handling

Every container has a MAX_TIMEOUT env var. A watchdog process polls for overdue containers and forcefully terminates them.

Node Failure Fallback

If a worker node goes down, its in-flight jobs remain in the queue or are re-delivered by a heartbeat monitor, picked up by healthy workers.


Honesty

Trade-offs

Every architectural decision is a trade-off. This system shines in isolation and scalability, but comes with real costs worth understanding.

Cold Start Latency (~1-3s)vsReduced Infrastructure Cost

Containers take a second to spin up. For latency-sensitive flows, this is non-trivial. The trade-off pays off for workloads longer than a few seconds.

Full IsolationvsOverhead per Tenant

Running a separate container per user consumes more RAM and CPU than a shared-thread model. You're trading efficiency for safety.

Queue DelayvsBackpressure & Throughput

High concurrency means jobs wait in queue. The queue prevents overload but adds latency under burst traffic.


Analysis

Traditional SaaS vs On-Demand Containers

FeatureTraditional SaaSOn-Demand Containers
Infrastructure costAlways-on (high)Pay per use (low)
Resource utilizationLow (idle waste)High (on-demand)
User isolationShared / riskyFull container isolation
ScalingManual / slowQueue-driven auto-scale
Cold startNone~1-3 seconds
Multi-tenancyComplex / shared DBNative per-container
Debug visibilityCentralized logsPer-container logs
Infrastructure opsModerateHigher (orchestration)

War Stories

Challenges Faced

01

Managing Container Lifecycle

Containers that hung silently were the hardest to debug. Solution: a dedicated watchdog service that polls active containers every 30s and forcibly kills any that exceed their timeout or have exited unexpectedly.

02

Dynamic Routing with Traefik

Traefik's Docker label-based routing works beautifully — but only when labels are applied at container creation time. Changing routes on a running container requires a full restart. Understanding this constraint early saved significant debugging time.

03

Multi-Server Coordination

When workers run on separate physical machines, ensuring a job doesn't get picked up twice requires careful queue design. LavishMQ consumer acknowledgements and delivery semantics solved this cleanly — only one worker receives and completes each job.

04

Debugging Distributed Systems

A bug that only manifests when Worker 2 processes a specific job type is genuinely hard to reproduce. The investment in per-container structured logging (shipped to a central service) paid for itself within the first week.


Impact

Results & Impact

~40%

Reduction in infra cost vs always-on

50+

Concurrent containers supported

<3s

Avg container cold start time

99.2%

Job success rate (with retries)

* Numbers are approximate from load tests and production usage. Exact results depend on workload and hardware.


Summary

Key Takeaways

  • On-demand containers dramatically reduce costPay only for active compute. Idle cost drops to near-zero for workloads with natural pause patterns.
  • Queue-based scaling is elegant and powerfulLavishMQ + stateless workers gives you horizontal scalability with no central orchestrator, no config changes, and no downtime.
  • Isolation improves reliability and securityOne container crash doesn't impact other tenants. Container boundaries are the best multi-tenancy primitive available today.
  • Build the observability layer firstDistributed systems are only debuggable if you invested in logging, metrics, and alerting before needing them.

Want to build scalable systems like this?

I'm available for engineering consulting, system design reviews, and backend architecture work.

Let's Talk
TB

Tushar Badgujar

Backend Engineer (5+ yrs) building scalable SaaS systems