Infrastructure + scenario runner showcasing happy paths and many failure modes of working with queues in distributed systems
scenarios/ contains self-contained setups to exercise typical queue behaviors. Each scenario lives in its own file under scenarios/scenarios/{short,long} and uses its own isolated SQS queue + DLQ and DynamoDB tables, all provisioned by the shared terraform/ configuration. Consumers and producers run as local Python subprocesses.
Scenario naming conventions
- Scenario CLI name is derived from the filename (underscores become hyphens).
- Terraform keys are used as the file stem when possible (e.g.,
dup.pyfor the duplicate-delivery scenario). - Scenarios that reuse another scenario’s infra prefix their file stem with that terraform key (e.g.,
dup-side-effects.py→ CLIdup-side-effects).
Fast (finish in seconds):
- Happy path (
happy,make scenario-happy): messages flow through cleanly, all processed and deleted. - Crash recovery (
crash,make scenario-crash): consumer crashes after receiving; observe redelivery and recovery. - Duplicate delivery (
dup,make scenario-dup): slow processing triggers visibility timeout, causing redelivery; DynamoDB-backed idempotency prevents duplicate side effects. - Business idempotency (
business,make scenario-business): two different message IDs represent the same logical work; consumer dedupes on a business key (e.g.,order_id). - Poison messages (
poison,make scenario-poison): messages with a poison marker are rejected and redrive to the DLQ after max retries. - Partial batch failure (
poison-partial-batch,make scenario-poison-partial-batch): consumer receives batches of 10 messages containing both good and poison messages; good messages are processed and deleted individually while poison messages are retried and eventually land in the DLQ. - External side effects (
dup-side-effects,make scenario-dup-side-effects): consumer uses the check-before-doing pattern to handle external side effects; proves no duplicate side effects occur even when crashing after the side effect but before updating status. - Graceful shutdown (
dup-graceful-shutdown,make scenario-dup-graceful-shutdown): consumer receives SIGTERM mid-processing and finishes in-flight work before exiting cleanly. - FIFO ordering (
fifo-order,make scenario-fifo-order): FIFO queue with a single message group verifies in-order delivery using a shared log file. - Version ordering (
version-order,make scenario-version-order): standard queue + version checks apply only newer versions from out-of-order delivery.
Slow (5–10 minutes):
- Backpressure / auto-scaling (
backpressure,make scenario-backpressure): a continuous producer floods the queue; the runner monitors queue depth and spawns additional consumers until equilibrium. - Purge timing (
happy-purge-timing,make scenario-happy-purge-timing): diagnostic scenario verifying SQS's 60-second async purge behavior and documenting the danger window for message loss.
Not yet implemented:
- Out-of-order processing: demonstrate that standard SQS makes no ordering guarantees; show version/timestamp reconciliation for correct final state.
- Backlog drain after outage: consumer offline while messages pile up; verify burst recovery, idempotency under load, and stale message handling.
- DLQ redrive: messages fail and land in DLQ; apply fix; use
StartMessageMoveTaskto replay; verify successful processing and empty DLQ.
| Directory | Purpose |
|---|---|
consumer/ |
Python SQS consumer with chaos/failure knobs |
producer/ |
Local Python script to enqueue synthetic messages with batching and rate limiting |
scripts/ |
Build/push helper (build_and_push.sh), producer wrapper (run_producer.sh), env loader (set_env.sh) |
terraform/ |
Infrastructure definitions with local state |
scenarios/ |
Scenario runner and modular scenario files |
plans/ |
Planning/design documents |
- AWS CLI configured with a profile that can create SQS/DynamoDB resources
- Terraform >= 1.5
- Python 3.11+
-
Provision infrastructure (local Terraform state in
terraform/)make infra-up
This writes a
.envat repo root withQUEUE_URL,DLQ_ARN,AWS_REGION,MESSAGE_STATUS_TABLE, andMESSAGE_SIDE_EFFECTS_TABLE(pointing at the happy-path resources for ad-hoc testing). -
Run the producer locally to send messages
./scripts/run_producer.sh --n 25 --rate 5
The wrapper auto-loads
.env/Terraform outputs forQUEUE_URL/AWS_REGION/AWS_PROFILE, creates a venv inproducer/.venv, installs deps, and runsproduce.py.Or manually:
cd producer python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt QUEUE_URL=$(terraform -chdir=../terraform output -raw queue_url) python produce.py --n 25 --rate 5 --queue-url "$QUEUE_URL"
-
Run a consumer locally
source .env cd consumer python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt python consume.py
-
Run scenarios (see Scenarios below)
make scenario-happy
-
Tear down
make infra-down
Producer (local) ──▶ SQS Queue ──▶ Consumer (local) ──▶ DynamoDB (status + completed)
│
▼
Dead-Letter Queue (after 2 failed receives)
- Idempotency: DynamoDB tables track message IDs. The
message-statustable records processing state (STARTED/COMPLETED) with attempt counts. Themessage-side-effectstable provides a separate completion record for cross-consumer deduplication. The consumer also maintains an in-memory LRU cache for fast local duplicate detection. - Long-polling: Consumer uses configurable wait time (default 10s) to reduce API calls.
- Batching: Producer batches up to 10 messages per SQS API call.
- Long-polls SQS, parses JSON payloads, tracks state in DynamoDB, deletes messages after processing.
- Extracts a custom
idfield from the message payload if present, otherwise falls back to the SQSMessageId. - Handles stale receipt handles gracefully (logs a warning instead of crashing).
| Variable | Default | Description |
|---|---|---|
QUEUE_URL |
(required) | SQS queue URL to poll |
AWS_REGION |
SDK default | Region override |
WAIT_TIME_SECONDS |
10 |
Long-poll wait (seconds) |
MAX_MESSAGES |
1 |
Messages per receive call (capped at 10) |
CRASH_RATE |
0.0 |
Probability (0–1) of random crash per message |
CRASH_AFTER_RECEIVE |
0 |
If non-zero, crash deterministically after receiving but before side effect |
IDEMPOTENCY_CACHE_SIZE |
1000 |
In-memory LRU cache size for duplicate detection (0 = disabled) |
MESSAGE_LIMIT |
0 |
Stop after processing N messages (0 = unlimited) |
IDLE_TIMEOUT_SECONDS |
0 |
Exit if no messages arrive for this many seconds (0 = disabled) |
REJECT_PAYLOAD_MARKER |
(empty) | Reject messages whose payload contains this marker (for poison message testing) |
MESSAGE_STATUS_TABLE |
(optional) | DynamoDB table name for status tracking (our bookkeeping) |
MESSAGE_SIDE_EFFECTS_TABLE |
(optional) | DynamoDB table name for side effects (simulates external service) |
SIDE_EFFECT_DELAY_SECONDS |
0.0 |
Delay before side effect to simulate slow external service (for visibility timeout testing) |
CRASH_AFTER_SIDE_EFFECT |
0 |
Crash after side effect but before marking complete (for testing idempotency) |
SIDE_EFFECT_LOG_PATH |
(empty) | If set, append a payload field as a line to this file (used for FIFO ordering scenario) |
SIDE_EFFECT_LOG_FIELD |
sequence |
Payload field to append when SIDE_EFFECT_LOG_PATH is set |
VERSIONING_ENABLED |
0 |
Enable version-based reconciliation (used for version ordering scenario) |
VERSION_STATE_PATH |
(empty) | Path to version state JSON file (required when VERSIONING_ENABLED=1) |
VERSION_LOG_PATH |
(empty) | Path to log file that records applied versions (required when VERSIONING_ENABLED=1) |
VERSION_KEY |
version |
Payload field name for version values |
ENTITY_KEY |
entity_id |
Payload field name for entity identifier |
BUSINESS_IDEMPOTENCY_FIELD |
(empty) | Payload field name to dedupe external side effects (e.g., order_id) |
LOG_LEVEL |
INFO |
Logging level (DEBUG, INFO, WARNING, ERROR) |
./scripts/run_producer.sh --n 100 --rate 20--n— number of messages to send (default 10)--rate— messages/sec limit (0= as fast as possible)--batch-size— messages per SQS batch call (max 10, default 10)--poison-count— number of poison messages to inject at random indices (default 0)--poison-every N— make every Nth message poison (e.g., 3 means indices 2, 5, 8...); overrides--poison-count--queue-url— override queue URL (defaults toQUEUE_URLenv var)--region— AWS region (defaults toAWS_REGIONenv var)--profile— AWS profile (defaults toAWS_PROFILEenv var)
Each message contains a JSON payload with a UUID id and a random work value (1–1000). Poison messages additionally carry "poison": true.
- Crashy worker:
CRASH_RATE=0.2— consumer exits intermittently; messages redeliver from the queue - Crash before side effect:
CRASH_AFTER_RECEIVE=1— crashes after receiving but before performing side effect - Crash after side effect:
CRASH_AFTER_SIDE_EFFECT=1— crashes after side effect but before marking complete; tests the check-before-doing pattern - Burst load:
./scripts/run_producer.sh --n 1000 --rate 0— observe SQS metrics and processing throughput
- Terraform writes a
.envat repo root withQUEUE_URL,DLQ_ARN,AWS_REGION,MESSAGE_STATUS_TABLE, andMESSAGE_SIDE_EFFECTS_TABLE(using the happy-path scenario resources). - If you prefer manual setup, copy
.env.exampleto.envand fill in your account-specific values; scripts/loaders will pick it up.
| Variable | Default | Description |
|---|---|---|
aws_region |
us-west-1 |
AWS region for all resources |
project_name |
sqs-demo |
Base name for created resources |
queue_visibility_timeout |
10 |
SQS visibility timeout (seconds) |
queue_receive_wait |
10 |
SQS long-poll wait time (seconds) |
- Prereqs: infrastructure provisioned (
make infra-up), AWS CLI + Terraform on PATH. - Quick start (venv is created automatically):
make scenarios-fast # run fast scenarios make scenarios-slow # run slow scenarios make scenarios # run everything (fast then slow)
- Individual scenarios:
make scenario-happy make scenario-crash make scenario-business make scenario-fifo-order make scenario-version-order
- Pass extra arguments via
ARGS:make scenario-happy ARGS="--count 10 --batch-size 5"
make validateRun make help to see all targets. Key ones:
| Target | Purpose |
|---|---|
make preflight |
Check local environment for required tools |
make infra-up |
Provision infrastructure |
make infra-down |
Destroy infrastructure |
make venv |
Create/update the project venv |
make validate |
Validate scenario infrastructure is healthy |
make scenario-* |
Run individual scenarios |
make scenarios-fast |
Run fast scenarios |
make scenarios-slow |
Run slow scenarios |
make scenarios |
Run all scenarios (fast then slow) |
make infra-downThis section documents deeper design considerations that emerged from building and testing these scenarios.
This codebase uses two DynamoDB tables to model how production systems handle external side effects (like calling Stripe, sending emails, etc.) that cannot be transacted with your own database:
| Table | Role | Real-world analogy |
|---|---|---|
message_side_effects |
The external side effect — represents the external service's record | Stripe's ledger saying "$50 was transferred" |
message_status |
Our bookkeeping — tracks that we confirmed the side effect happened | Our database saying "we verified Stripe did the transfer" |
The check-before-doing pattern:
1. mark_started() → record STARTED in status table
2. CHECK: does message_id exist in side_effects table?
- If YES → side effect already done, skip to step 5
3. DO SIDE EFFECT: write to side_effects table (simulates external call)
4. [crash possible here]
5. Update status to COMPLETED (our bookkeeping)
6. Delete from SQS
Why this matters:
You can't transact with external systems. When you call Stripe, you can't wrap that HTTP call in a database transaction. So if you crash after Stripe charges the card but before you update your records, you need a way to know "did the charge already happen?" on retry.
The message_side_effects table simulates this — it's the "external system's record" of what happened. On retry, you check it first to avoid duplicate side effects.
If the consumer crashes between steps 3 and 5:
- Side effect is recorded (Stripe charged the card)
- But our status table doesn't know yet
- Message becomes visible again in SQS
- New consumer picks it up
- Step 2 check sees side effect exists → skips the external call
- Updates status to COMPLETED
- Deletes from SQS
No duplicate charge. The pattern works because we check before doing.
When your consumer needs to write to multiple systems (e.g., charge via Stripe, send an email, update a database), you can't wrap everything in a single transaction.
Common patterns:
| Pattern | How it works | Trade-offs |
|---|---|---|
| Outbox | Write business data + outbox record in one DB transaction. A separate process reads the outbox and calls external systems, marking entries as processed. | Adds complexity (outbox reader), but guarantees the intent is durably recorded before external calls. |
| Saga | Break the operation into steps with compensating actions. If step 3 fails, run compensation for steps 1–2 (e.g., refund the charge). | Works for multi-system workflows, but compensation logic can be complex and some actions aren't reversible. |
| Idempotency keys to external systems | Pass a stable key (e.g., Idempotency-Key header to Stripe). On retry, the external system returns the cached result instead of re-executing. |
Relies on external systems supporting idempotency. Most payment APIs do; arbitrary HTTP endpoints may not. |
The outbox pattern is particularly common: you never call an external system unless the intent is already durably recorded, and you never lose track of what succeeded or failed.
There are two distinct duplication problems:
| Type | Cause | Solution |
|---|---|---|
| Delivery idempotency | SQS redelivers the same message (visibility timeout, crash before delete) | Dedupe on message_id — the ID the producer assigned to the message |
| Business idempotency | Producer sends the same logical work twice with different message IDs (user double-clicks, retry logic, bugs) | Dedupe on a business key like order_id or payment_intent_id |
This demo implements delivery idempotency: the message_side_effects table is keyed on message_id, which the producer generates as a UUID. If the same message is redelivered, the check-before-doing pattern skips the side effect.
But if the producer sends two messages with different UUIDs that both say "charge order-123", the consumer will process both — because they have different message_id values. To catch this, you need the producer to include a business-level idempotency key in the payload:
{
"id": "uuid-1",
"order_id": "order-123",
"action": "charge"
}Then the consumer dedupes on order_id + action, not just message_id.
-
Stripe requires callers to provide an
Idempotency-Keyheader. If you don't, duplicate charges are your responsibility. -
Payment systems typically use
payment_intent_idororder_idas the idempotency key, not the message/event ID. The message is just a delivery mechanism. -
Event-driven architectures often treat events as "facts that happened" and push idempotency responsibility to consumers. If you receive
OrderCreatedtwice with different event IDs but the sameorder_id, you're expected to dedupe. -
API gateways sometimes generate client-side idempotency keys and reject duplicates at the edge before they hit the queue.
The key insight: message-level idempotency protects against infrastructure problems (redelivery), but business-level idempotency protects against application problems (duplicate requests). Most production systems need both.