How I Built a Message-Triggered On-Demand Docker Platform
Start containers when a message arrives • Run isolated workloads • Stop and delete them after completion
Tushar Badgujar
Backend Engineer (5+ yrs) building scalable SaaS systems
System Flow at a Glance
Why I Built This
Traditional SaaS architectures run on always-on infrastructure — every user shares the same servers, the same databases, and the same risk. Even when no one is using the system, you're paying full price for idle compute.
I needed something different. A system where a container starts only when a message or request arrives, runs the required workload in isolation, and then gets stopped and deleted after completion. That gave me a practical way to reduce idle infrastructure cost while keeping execution isolated per job.
Cost Waste
Idle resources consuming budget 24/7
Shared Risk
One bug affects all tenants simultaneously
Manual Scaling
Adding capacity required human intervention
System Architecture
The platform splits into three layers: a Control Server (the brain), Worker Nodes (the muscle), and a Reverse Proxy (the router). Each layer has a single responsibility and communicates via a job queue.
Control Server
- Auth & request validation
- Queue job dispatch
- Container state tracking
- Resource availability
Worker Nodes
- Docker Engine runtime
- Consumes queue jobs
- Container lifecycle mgmt
- Reports health status
Traefik Proxy
- Dynamic route discovery
- Listens to Docker events
- TLS termination
- Zero-downtime routing
const Docker = require("dockerode");
const docker = new Docker({ socketPath: '/var/run/docker.sock' });
async function provisionUserEnvironment(userId, config) {
const container = await docker.createContainer({
Image: config.image || "standard-node-env:latest",
name: `user_inst_${userId}`,
Env: [`INSTANCE_ID=${userId}`, `MAX_TIMEOUT=3600`],
HostConfig: {
Memory: 1024 * 1024 * 512, // 512MB RAM
CpuQuota: 50000, // 0.5 CPU
AutoRemove: true, // Auto-destroy on exit
NetworkMode: "isolated_nw"
},
Labels: {
"traefik.enable": "true",
[`traefik.http.routers.${userId}.rule`]: `PathPrefix("/u/${userId}")`
}
});
await container.start();
const data = await container.inspect();
return { id: data.Id, ip: data.NetworkSettings.Networks.isolated_nw.IPAddress };
}How It Works — Step by Step
Request Received
User hits the API endpoint. The control server authenticates and validates the request, then creates a job record in the database.
Job Pushed to Queue
The job payload (userId, image, config) is published to a LavishMQ queue. This decouples the API from actual container execution.
Worker Picks the Job
An idle worker node consumes the next available message from LavishMQ. Delivery, acknowledgements, and competing consumers keep job distribution clean without a central orchestrator.
Container Spins Up
The worker calls the Docker SDK to create and start a container with resource limits, env vars, and Traefik routing labels.
Task Executed
The container runs its workload — whether a WhatsApp bot, an AI agent, or a background job. Logs stream back in real time.
Container Destroyed
On completion (or timeout), the container exits. AutoRemove ensures it's immediately cleaned up with zero manual intervention.
Scaling Model
Horizontal scaling is native to this architecture. The queue acts as the central nervous system — adding a new worker node is the only step required to double throughput. No config changes, no restarts, no coordination layer.
Each worker independently picks a job → spins its own container → horizontal scale by adding workers
Real Use Cases
WhatsApp Automation
Each user's WhatsApp bot runs in an isolated container. Session state, webhooks, and memory are fully contained per tenant.
Background Workers
CPU-intensive tasks like PDF generation, video processing, or data aggregation run in ephemeral containers without blocking the main API.
AI Agents
Spawn a dedicated container per AI workload — each with its own model config, context window, and tool calls. True agent isolation.
Multi-Tenant SaaS
Every customer gets their own compute environment. Impossible for one tenant to read another's data or exhaust shared resources.
Design Principles
Isolation over Shared Systems
Containers provide hard boundaries. No shared memory, no shared file system, no cross-tenant leakage. Each environment is a fortress.
Event-Driven Architecture
Nothing is polled. Jobs flow through the queue as events. Workers react. The system is asynchronous by design and scales naturally.
On-Demand Compute
Infrastructure exists only when it's needed. No idle cost. No over-provisioning. Resources appear, execute, and vanish.
Horizontal Scalability
Scale-out over scale-up. Add more workers, not bigger machines. Linear cost scaling with zero architectural changes.
Failure Handling
Retry Mechanism
Failed jobs are re-queued with exponential back-off up to 3 attempts before being moved to a dead-letter queue for inspection.
Container Crash Recovery
If a container crashes unexpectedly, the worker detects the exit code, marks the job as failed, and triggers a retry or alert.
Timeout Handling
Every container has a MAX_TIMEOUT env var. A watchdog process polls for overdue containers and forcefully terminates them.
Node Failure Fallback
If a worker node goes down, its in-flight jobs remain in the queue or are re-delivered by a heartbeat monitor, picked up by healthy workers.
Trade-offs
Every architectural decision is a trade-off. This system shines in isolation and scalability, but comes with real costs worth understanding.
Containers take a second to spin up. For latency-sensitive flows, this is non-trivial. The trade-off pays off for workloads longer than a few seconds.
Running a separate container per user consumes more RAM and CPU than a shared-thread model. You're trading efficiency for safety.
High concurrency means jobs wait in queue. The queue prevents overload but adds latency under burst traffic.
Traditional SaaS vs On-Demand Containers
| Feature | Traditional SaaS | On-Demand Containers |
|---|---|---|
| Infrastructure cost | Always-on (high) | Pay per use (low) |
| Resource utilization | Low (idle waste) | High (on-demand) |
| User isolation | Shared / risky | Full container isolation |
| Scaling | Manual / slow | Queue-driven auto-scale |
| Cold start | None | ~1-3 seconds |
| Multi-tenancy | Complex / shared DB | Native per-container |
| Debug visibility | Centralized logs | Per-container logs |
| Infrastructure ops | Moderate | Higher (orchestration) |
Challenges Faced
Managing Container Lifecycle
Containers that hung silently were the hardest to debug. Solution: a dedicated watchdog service that polls active containers every 30s and forcibly kills any that exceed their timeout or have exited unexpectedly.
Dynamic Routing with Traefik
Traefik's Docker label-based routing works beautifully — but only when labels are applied at container creation time. Changing routes on a running container requires a full restart. Understanding this constraint early saved significant debugging time.
Multi-Server Coordination
When workers run on separate physical machines, ensuring a job doesn't get picked up twice requires careful queue design. LavishMQ consumer acknowledgements and delivery semantics solved this cleanly — only one worker receives and completes each job.
Debugging Distributed Systems
A bug that only manifests when Worker 2 processes a specific job type is genuinely hard to reproduce. The investment in per-container structured logging (shipped to a central service) paid for itself within the first week.
Results & Impact
Reduction in infra cost vs always-on
Concurrent containers supported
Avg container cold start time
Job success rate (with retries)
* Numbers are approximate from load tests and production usage. Exact results depend on workload and hardware.
Key Takeaways
- On-demand containers dramatically reduce cost — Pay only for active compute. Idle cost drops to near-zero for workloads with natural pause patterns.
- Queue-based scaling is elegant and powerful — LavishMQ + stateless workers gives you horizontal scalability with no central orchestrator, no config changes, and no downtime.
- Isolation improves reliability and security — One container crash doesn't impact other tenants. Container boundaries are the best multi-tenancy primitive available today.
- Build the observability layer first — Distributed systems are only debuggable if you invested in logging, metrics, and alerting before needing them.
Want to build scalable systems like this?
I'm available for engineering consulting, system design reviews, and backend architecture work.
Let's TalkTushar Badgujar
Backend Engineer (5+ yrs) building scalable SaaS systems