Objective
Identify and quantify bottlenecks in the current polling-based architecture
Determine if streaming is the right solution
If yes, outline the least-effort path to implementation
Part 1: Current Polling Architecture
How It Works
Workers (Runtimes) poll the State Manager every poll_interval (default: 1 second) asking for work:
sequenceDiagram
participant RT as Runtime
participant SM as State Manager
participant DB as MongoDB
loop Every 1 second
RT->>SM: POST /states/enqueue
SM->>DB: Query for CREATED states
SM->>DB: Update status to QUEUED
SM-->>RT: Return states (or empty array)
end
Loading
Key Configuration
Parameter
Default
Purpose
poll_interval
1s
Time between polls
batch_size
16
States fetched per poll
workers
4
Parallel executors per runtime
Part 2: Bottleneck Analysis
Bottleneck 1: Latency
Problem : States wait up to poll_interval before being picked up.
Scenario
Latency
Best case (state created just before poll)
~0ms
Worst case (state created just after poll)
~1000ms
Average
~500ms
Impact : For latency-sensitive workflows (real-time processing, user-facing operations), 500ms average delay may be unacceptable.
Bottleneck 2: Wasted Requests
Problem : Runtimes poll even when no work is available.
Analysis (per runtime):
Polls per minute: 60
If queue is empty 80% of the time: 48 wasted requests/minute
With 100 runtimes: 4,800 wasted requests/minute
Impact : Unnecessary network traffic and State Manager load.
Bottleneck 3: Database Load
Problem : Each poll triggers a MongoDB query, even when idle.
Current query pattern :
find_one_and_update (
{"status" : "CREATED" , "node_name" : {"$in" : nodes }},
{"$set" : {"status" : "QUEUED" }}
)
Analysis :
Queries per minute per runtime: 60
With 100 runtimes: 6,000 queries/minute (regardless of actual work)
Impact : MongoDB load scales with runtime count, not workload.
Bottleneck 4: Scalability Ceiling
Problem : Adding more runtimes linearly increases polling load.
Total requests/sec = num_runtimes / poll_interval
Runtimes
Requests/sec to State Manager
10
10
100
100
1000
1000
Impact : State Manager becomes a bottleneck before compute is saturated.
Bottleneck Summary
Bottleneck
Severity
Affects
Latency (~500ms avg)
Medium
Real-time workloads
Wasted requests
Low
Network/cost
Database load
Medium
MongoDB at scale
Scalability ceiling
High
Large deployments
Part 3: Should We Move to Streaming?
Decision Matrix
Factor
Polling
Streaming
Winner
Latency
~500ms
~10ms
Streaming
Idle resource usage
High
Near zero
Streaming
Implementation complexity
Simple
Moderate
Polling
Proxy/firewall compatibility
Excellent
Good (SSE) / Moderate (WS)
Polling
Failure recovery
Stateless, trivial
Requires reconnect logic
Polling
Scalability
O(n) requests
O(1) connections
Streaming
When Polling is Sufficient
Small deployments (< 50 runtimes)
Batch/async workloads where 500ms latency is acceptable
Environments with restrictive proxies
Teams preferring operational simplicity
When Streaming is Needed
Large deployments (100+ runtimes)
Real-time or low-latency requirements
High state throughput (1000+ states/minute)
Cost-sensitive environments (reduce unnecessary requests)
Recommendation
Yes, streaming is worth pursuing for the following reasons:
Scalability : Polling fundamentally doesn't scale - streaming does
Latency : 50x improvement (500ms to 10ms) enables new use cases
Efficiency : Eliminates waste in idle scenarios
Industry standard : Most distributed task systems use push-based models
However, polling should remain as a fallback for compatibility.
Part 4: Implementation Options
Two viable streaming approaches exist. The choice depends on current needs vs. future extensibility.
Option A: Server-Sent Events (SSE) - Unidirectional
How it works : Server pushes states to workers over HTTP. Workers respond via existing REST endpoints.
flowchart LR
SM[State Manager] -->|SSE: push states| RT[Runtime]
RT -->|HTTP POST: /executed| SM
RT -->|HTTP POST: /errored| SM
Loading
Pros:
Simpler implementation
Works through most proxies/load balancers
Built-in browser reconnection (if dashboard uses it)
Lower server resource usage
Cons:
One-way only - responses still require HTTP calls
Cannot stream execution output back
Adding bidirectional later requires new infrastructure
Option B: WebSockets - Bidirectional
How it works : Full duplex connection. Both state delivery AND responses flow over same connection.
flowchart LR
SM[State Manager] <-->|WebSocket: states + responses| RT[Runtime]
Loading
Pros:
Single persistent connection for everything
Can stream execution results/logs back in real-time
Lower latency for responses (no HTTP overhead)
Future-proof for additional use cases
Cons:
More complex implementation
Some proxies require special configuration
Need to implement ping/pong keepalive
Higher per-connection memory on server
Option C: Message Queue / Durable Streams
How it works : State Manager publishes states to a message broker. Workers consume from the queue. Decouples producers from consumers entirely.
flowchart LR
SM[State Manager] -->|Publish| MQ[(Message Queue)]
MQ -->|Consume| RT1[Runtime 1]
MQ -->|Consume| RT2[Runtime 2]
MQ -->|Consume| RT3[Runtime 3]
RT1 -->|HTTP POST| SM
Loading
Technology Options:
Technology
Latency
Durability
Complexity
Managed Options
Redis Streams
~1-5ms
Configurable (AOF/RDB)
Low
AWS ElastiCache, Redis Cloud
Apache Kafka
~5-10ms
Strong (replicated log)
High
Confluent, AWS MSK
NATS JetStream
~1-2ms
Strong (replicated)
Medium
Synadia Cloud
Pros:
Decoupling : State Manager doesn't need to track worker connections
Built-in durability : Messages persist if workers are temporarily down
Consumer groups : Built-in work distribution (no custom claiming logic)
Horizontal scaling : Add workers without any State Manager changes
Replay capability : Can replay messages for recovery
Battle-tested : Production-grade queueing semantics
Cons:
Additional infrastructure : New component to deploy and operate
Cost : Managed services have costs; self-hosted requires expertise
Latency variance : Network hop to broker adds some latency
Overkill for small deployments : Adds complexity for simple setups
Message Queue: Technology Recommendations
If starting fresh:
NATS JetStream - Lowest latency, simple to operate, lightweight
Good for: Most Exosphere deployments
If already using Redis:
Redis Streams - No new infrastructure, familiar tooling
Good for: Teams with Redis expertise
If enterprise/high-throughput:
Apache Kafka - Industry standard for high-volume streaming
Good for: 10K+ states/second, strict ordering requirements
Next Steps
Decide: Should we move to streaming?
Is ~500ms latency acceptable for our use cases?
Are we hitting scalability limits with polling?
Do users need real-time responsiveness?
If yes, gather requirements:
Do we need bidirectional streaming (logs, progress) in next 6-12 months?
Do we expect 100+ runtimes at scale?
Is durability/replay a requirement?
Do we already have Redis/Kafka/NATS infrastructure?
Based on requirements, shortlist options:
Small scale, no durability needed → SSE or WebSocket
Large scale or durability needed → Message Queue
Validate infrastructure prerequisites for shortlisted options
Determine the simplest path forward based on constraints and team expertise
References
Codebase:
Current polling code: python-sdk/exospherehost/runtime.py (_enqueue method)
State query logic: state-manager/app/controller/enqueue_states.py
Option A - SSE:
Option B - WebSocket:
Option C - Message Queues:
Common (MongoDB Change Streams):
Objective
Part 1: Current Polling Architecture
How It Works
Workers (Runtimes) poll the State Manager every
poll_interval(default: 1 second) asking for work:sequenceDiagram participant RT as Runtime participant SM as State Manager participant DB as MongoDB loop Every 1 second RT->>SM: POST /states/enqueue SM->>DB: Query for CREATED states SM->>DB: Update status to QUEUED SM-->>RT: Return states (or empty array) endKey Configuration
poll_intervalbatch_sizeworkersPart 2: Bottleneck Analysis
Bottleneck 1: Latency
Problem: States wait up to
poll_intervalbefore being picked up.Impact: For latency-sensitive workflows (real-time processing, user-facing operations), 500ms average delay may be unacceptable.
Bottleneck 2: Wasted Requests
Problem: Runtimes poll even when no work is available.
Analysis (per runtime):
Impact: Unnecessary network traffic and State Manager load.
Bottleneck 3: Database Load
Problem: Each poll triggers a MongoDB query, even when idle.
Current query pattern:
Analysis:
Impact: MongoDB load scales with runtime count, not workload.
Bottleneck 4: Scalability Ceiling
Problem: Adding more runtimes linearly increases polling load.
Impact: State Manager becomes a bottleneck before compute is saturated.
Bottleneck Summary
Part 3: Should We Move to Streaming?
Decision Matrix
When Polling is Sufficient
When Streaming is Needed
Recommendation
Yes, streaming is worth pursuing for the following reasons:
However, polling should remain as a fallback for compatibility.
Part 4: Implementation Options
Two viable streaming approaches exist. The choice depends on current needs vs. future extensibility.
Option A: Server-Sent Events (SSE) - Unidirectional
How it works: Server pushes states to workers over HTTP. Workers respond via existing REST endpoints.
flowchart LR SM[State Manager] -->|SSE: push states| RT[Runtime] RT -->|HTTP POST: /executed| SM RT -->|HTTP POST: /errored| SMPros:
Cons:
Option B: WebSockets - Bidirectional
How it works: Full duplex connection. Both state delivery AND responses flow over same connection.
flowchart LR SM[State Manager] <-->|WebSocket: states + responses| RT[Runtime]Pros:
Cons:
Option C: Message Queue / Durable Streams
How it works: State Manager publishes states to a message broker. Workers consume from the queue. Decouples producers from consumers entirely.
flowchart LR SM[State Manager] -->|Publish| MQ[(Message Queue)] MQ -->|Consume| RT1[Runtime 1] MQ -->|Consume| RT2[Runtime 2] MQ -->|Consume| RT3[Runtime 3] RT1 -->|HTTP POST| SMTechnology Options:
Pros:
Cons:
Message Queue: Technology Recommendations
If starting fresh:
If already using Redis:
If enterprise/high-throughput:
Next Steps
Decide: Should we move to streaming?
If yes, gather requirements:
Based on requirements, shortlist options:
Validate infrastructure prerequisites for shortlisted options
Determine the simplest path forward based on constraints and team expertise
References
Codebase:
python-sdk/exospherehost/runtime.py(_enqueuemethod)state-manager/app/controller/enqueue_states.pyOption A - SSE:
Option B - WebSocket:
Option C - Message Queues:
Common (MongoDB Change Streams):