Queue Fundamentals with AWS SQS

Infrastructure + scenario runner showcasing happy paths and many failure modes of working with queues in distributed systems

Scenarios

scenarios/ contains self-contained setups to exercise typical queue behaviors. Each scenario lives in its own file under scenarios/scenarios/{short,long} and uses its own isolated SQS queue + DLQ and DynamoDB tables, all provisioned by the shared terraform/ configuration. Consumers and producers run as local Python subprocesses.

Scenario naming conventions

Scenario CLI name is derived from the filename (underscores become hyphens).
Terraform keys are used as the file stem when possible (e.g., dup.py for the duplicate-delivery scenario).
Scenarios that reuse another scenario’s infra prefix their file stem with that terraform key (e.g., dup-side-effects.py → CLI dup-side-effects).

Fast (finish in seconds):

Happy path (happy, make scenario-happy): messages flow through cleanly, all processed and deleted.
Crash recovery (crash, make scenario-crash): consumer crashes after receiving; observe redelivery and recovery.
Duplicate delivery (dup, make scenario-dup): slow processing triggers visibility timeout, causing redelivery; DynamoDB-backed idempotency prevents duplicate side effects.
Business idempotency (business, make scenario-business): two different message IDs represent the same logical work; consumer dedupes on a business key (e.g., order_id).
Poison messages (poison, make scenario-poison): messages with a poison marker are rejected and redrive to the DLQ after max retries.
Partial batch failure (poison-partial-batch, make scenario-poison-partial-batch): consumer receives batches of 10 messages containing both good and poison messages; good messages are processed and deleted individually while poison messages are retried and eventually land in the DLQ.
External side effects (dup-side-effects, make scenario-dup-side-effects): consumer uses the check-before-doing pattern to handle external side effects; proves no duplicate side effects occur even when crashing after the side effect but before updating status.
Graceful shutdown (dup-graceful-shutdown, make scenario-dup-graceful-shutdown): consumer receives SIGTERM mid-processing and finishes in-flight work before exiting cleanly.
FIFO ordering (fifo-order, make scenario-fifo-order): FIFO queue with a single message group verifies in-order delivery using a shared log file.
Version ordering (version-order, make scenario-version-order): standard queue + version checks apply only newer versions from out-of-order delivery.

Slow (5–10 minutes):

Backpressure / auto-scaling (backpressure, make scenario-backpressure): a continuous producer floods the queue; the runner monitors queue depth and spawns additional consumers until equilibrium.
Purge timing (happy-purge-timing, make scenario-happy-purge-timing): diagnostic scenario verifying SQS's 60-second async purge behavior and documenting the danger window for message loss.

Not yet implemented:

Out-of-order processing: demonstrate that standard SQS makes no ordering guarantees; show version/timestamp reconciliation for correct final state.
Backlog drain after outage: consumer offline while messages pile up; verify burst recovery, idempotency under load, and stale message handling.
DLQ redrive: messages fail and land in DLQ; apply fix; use StartMessageMoveTask to replay; verify successful processing and empty DLQ.

Repo layout

Directory	Purpose
`consumer/`	Python SQS consumer with chaos/failure knobs
`producer/`	Local Python script to enqueue synthetic messages with batching and rate limiting
`scripts/`	Build/push helper (`build_and_push.sh`), producer wrapper (`run_producer.sh`), env loader (`set_env.sh`)
`terraform/`	Infrastructure definitions with local state
`scenarios/`	Scenario runner and modular scenario files
`plans/`	Planning/design documents

Prerequisites

AWS CLI configured with a profile that can create SQS/DynamoDB resources
Terraform >= 1.5
Python 3.11+

How to use this repo end-to-end

Provision infrastructure (local Terraform state in terraform/)
```
make infra-up
```
This writes a .env at repo root with QUEUE_URL, DLQ_ARN, AWS_REGION, MESSAGE_STATUS_TABLE, and MESSAGE_SIDE_EFFECTS_TABLE (pointing at the happy-path resources for ad-hoc testing).

Run the producer locally to send messages

./scripts/run_producer.sh --n 25 --rate 5

The wrapper auto-loads .env/Terraform outputs for QUEUE_URL/AWS_REGION/AWS_PROFILE, creates a venv in producer/.venv, installs deps, and runs produce.py.

Or manually:

cd producer
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
QUEUE_URL=$(terraform -chdir=../terraform output -raw queue_url)
python produce.py --n 25 --rate 5 --queue-url "$QUEUE_URL"

Run a consumer locally

source .env
cd consumer
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python consume.py

Run scenarios (see Scenarios below)
```
make scenario-happy
```
Tear down
```
make infra-down
```

Architecture

Producer (local)  ──▶  SQS Queue  ──▶  Consumer (local)  ──▶  DynamoDB (status + completed)
                          │
                          ▼
                     Dead-Letter Queue (after 2 failed receives)

Idempotency: DynamoDB tables track message IDs. The message-status table records processing state (STARTED/COMPLETED) with attempt counts. The message-side-effects table provides a separate completion record for cross-consumer deduplication. The consumer also maintains an in-memory LRU cache for fast local duplicate detection.
Long-polling: Consumer uses configurable wait time (default 10s) to reduce API calls.
Batching: Producer batches up to 10 messages per SQS API call.

What the consumer does

Long-polls SQS, parses JSON payloads, tracks state in DynamoDB, deletes messages after processing.
Extracts a custom id field from the message payload if present, otherwise falls back to the SQS MessageId.
Handles stale receipt handles gracefully (logs a warning instead of crashing).

Consumer env vars

Variable	Default	Description
`QUEUE_URL`	(required)	SQS queue URL to poll
`AWS_REGION`	SDK default	Region override
`WAIT_TIME_SECONDS`	`10`	Long-poll wait (seconds)
`MAX_MESSAGES`	`1`	Messages per receive call (capped at 10)
`CRASH_RATE`	`0.0`	Probability (0–1) of random crash per message
`CRASH_AFTER_RECEIVE`	`0`	If non-zero, crash deterministically after receiving but before side effect
`IDEMPOTENCY_CACHE_SIZE`	`1000`	In-memory LRU cache size for duplicate detection (0 = disabled)
`MESSAGE_LIMIT`	`0`	Stop after processing N messages (0 = unlimited)
`IDLE_TIMEOUT_SECONDS`	`0`	Exit if no messages arrive for this many seconds (0 = disabled)
`REJECT_PAYLOAD_MARKER`	(empty)	Reject messages whose payload contains this marker (for poison message testing)
`MESSAGE_STATUS_TABLE`	(optional)	DynamoDB table name for status tracking (our bookkeeping)
`MESSAGE_SIDE_EFFECTS_TABLE`	(optional)	DynamoDB table name for side effects (simulates external service)
`SIDE_EFFECT_DELAY_SECONDS`	`0.0`	Delay before side effect to simulate slow external service (for visibility timeout testing)
`CRASH_AFTER_SIDE_EFFECT`	`0`	Crash after side effect but before marking complete (for testing idempotency)
`SIDE_EFFECT_LOG_PATH`	(empty)	If set, append a payload field as a line to this file (used for FIFO ordering scenario)
`SIDE_EFFECT_LOG_FIELD`	`sequence`	Payload field to append when `SIDE_EFFECT_LOG_PATH` is set
`VERSIONING_ENABLED`	`0`	Enable version-based reconciliation (used for version ordering scenario)
`VERSION_STATE_PATH`	(empty)	Path to version state JSON file (required when `VERSIONING_ENABLED=1`)
`VERSION_LOG_PATH`	(empty)	Path to log file that records applied versions (required when `VERSIONING_ENABLED=1`)
`VERSION_KEY`	`version`	Payload field name for version values
`ENTITY_KEY`	`entity_id`	Payload field name for entity identifier
`BUSINESS_IDEMPOTENCY_FIELD`	(empty)	Payload field name to dedupe external side effects (e.g., `order_id`)
`LOG_LEVEL`	`INFO`	Logging level (DEBUG, INFO, WARNING, ERROR)

Producer usage

./scripts/run_producer.sh --n 100 --rate 20

--n — number of messages to send (default 10)
--rate — messages/sec limit (0 = as fast as possible)
--batch-size — messages per SQS batch call (max 10, default 10)
--poison-count — number of poison messages to inject at random indices (default 0)
--poison-every N — make every Nth message poison (e.g., 3 means indices 2, 5, 8...); overrides --poison-count
--queue-url — override queue URL (defaults to QUEUE_URL env var)
--region — AWS region (defaults to AWS_REGION env var)
--profile — AWS profile (defaults to AWS_PROFILE env var)

Each message contains a JSON payload with a UUID id and a random work value (1–1000). Poison messages additionally carry "poison": true.

Failure-mode experiments

Crashy worker: CRASH_RATE=0.2 — consumer exits intermittently; messages redeliver from the queue
Crash before side effect: CRASH_AFTER_RECEIVE=1 — crashes after receiving but before performing side effect
Crash after side effect: CRASH_AFTER_SIDE_EFFECT=1 — crashes after side effect but before marking complete; tests the check-before-doing pattern
Burst load: ./scripts/run_producer.sh --n 1000 --rate 0 — observe SQS metrics and processing throughput

Environment config

Terraform writes a .env at repo root with QUEUE_URL, DLQ_ARN, AWS_REGION, MESSAGE_STATUS_TABLE, and MESSAGE_SIDE_EFFECTS_TABLE (using the happy-path scenario resources).
If you prefer manual setup, copy .env.example to .env and fill in your account-specific values; scripts/loaders will pick it up.

Terraform variables

Variable	Default	Description
`aws_region`	`us-west-1`	AWS region for all resources
`project_name`	`sqs-demo`	Base name for created resources
`queue_visibility_timeout`	`10`	SQS visibility timeout (seconds)
`queue_receive_wait`	`10`	SQS long-poll wait time (seconds)

Running scenarios

Prereqs: infrastructure provisioned (make infra-up), AWS CLI + Terraform on PATH.

Quick start (venv is created automatically):

make scenarios-fast        # run fast scenarios
make scenarios-slow        # run slow scenarios
make scenarios             # run everything (fast then slow)

Individual scenarios:

make scenario-happy
make scenario-crash
make scenario-business
make scenario-fifo-order
make scenario-version-order

Pass extra arguments via ARGS:

make scenario-happy ARGS="--count 10 --batch-size 5"

Validating infrastructure

make validate

Makefile targets

Run make help to see all targets. Key ones:

Target	Purpose
`make preflight`	Check local environment for required tools
`make infra-up`	Provision infrastructure
`make infra-down`	Destroy infrastructure
`make venv`	Create/update the project venv
`make validate`	Validate scenario infrastructure is healthy
`make scenario-*`	Run individual scenarios
`make scenarios-fast`	Run fast scenarios
`make scenarios-slow`	Run slow scenarios
`make scenarios`	Run all scenarios (fast then slow)

Tear down

make infra-down

Design Considerations for Queue-Based Systems

This section documents deeper design considerations that emerged from building and testing these scenarios.

The Two-Table Pattern for External Side Effects

This codebase uses two DynamoDB tables to model how production systems handle external side effects (like calling Stripe, sending emails, etc.) that cannot be transacted with your own database:

Table	Role	Real-world analogy
`message_side_effects`	The external side effect — represents the external service's record	Stripe's ledger saying "$50 was transferred"
`message_status`	Our bookkeeping — tracks that we confirmed the side effect happened	Our database saying "we verified Stripe did the transfer"

The check-before-doing pattern:

1. mark_started() → record STARTED in status table
2. CHECK: does message_id exist in side_effects table?
   - If YES → side effect already done, skip to step 5
3. DO SIDE EFFECT: write to side_effects table (simulates external call)
4. [crash possible here]
5. Update status to COMPLETED (our bookkeeping)
6. Delete from SQS

Why this matters:

You can't transact with external systems. When you call Stripe, you can't wrap that HTTP call in a database transaction. So if you crash after Stripe charges the card but before you update your records, you need a way to know "did the charge already happen?" on retry.

The message_side_effects table simulates this — it's the "external system's record" of what happened. On retry, you check it first to avoid duplicate side effects.

What Happens on Crash + Retry

If the consumer crashes between steps 3 and 5:

Side effect is recorded (Stripe charged the card)
But our status table doesn't know yet
Message becomes visible again in SQS
New consumer picks it up
Step 2 check sees side effect exists → skips the external call
Updates status to COMPLETED
Deletes from SQS

No duplicate charge. The pattern works because we check before doing.

Side Effects Across Multiple External Systems

When your consumer needs to write to multiple systems (e.g., charge via Stripe, send an email, update a database), you can't wrap everything in a single transaction.

Common patterns:

Pattern	How it works	Trade-offs
Outbox	Write business data + outbox record in one DB transaction. A separate process reads the outbox and calls external systems, marking entries as processed.	Adds complexity (outbox reader), but guarantees the intent is durably recorded before external calls.
Saga	Break the operation into steps with compensating actions. If step 3 fails, run compensation for steps 1–2 (e.g., refund the charge).	Works for multi-system workflows, but compensation logic can be complex and some actions aren't reversible.
Idempotency keys to external systems	Pass a stable key (e.g., `Idempotency-Key` header to Stripe). On retry, the external system returns the cached result instead of re-executing.	Relies on external systems supporting idempotency. Most payment APIs do; arbitrary HTTP endpoints may not.

The outbox pattern is particularly common: you never call an external system unless the intent is already durably recorded, and you never lose track of what succeeded or failed.

Delivery Idempotency vs. Business Idempotency

There are two distinct duplication problems:

Type	Cause	Solution
Delivery idempotency	SQS redelivers the same message (visibility timeout, crash before delete)	Dedupe on `message_id` — the ID the producer assigned to the message
Business idempotency	Producer sends the same logical work twice with different message IDs (user double-clicks, retry logic, bugs)	Dedupe on a business key like `order_id` or `payment_intent_id`

This demo implements delivery idempotency: the message_side_effects table is keyed on message_id, which the producer generates as a UUID. If the same message is redelivered, the check-before-doing pattern skips the side effect.

But if the producer sends two messages with different UUIDs that both say "charge order-123", the consumer will process both — because they have different message_id values. To catch this, you need the producer to include a business-level idempotency key in the payload:

{
  "id": "uuid-1",
  "order_id": "order-123",
  "action": "charge"
}

Then the consumer dedupes on order_id + action, not just message_id.

How Real-World Systems Handle This

Stripe requires callers to provide an Idempotency-Key header. If you don't, duplicate charges are your responsibility.
Payment systems typically use payment_intent_id or order_id as the idempotency key, not the message/event ID. The message is just a delivery mechanism.
Event-driven architectures often treat events as "facts that happened" and push idempotency responsibility to consumers. If you receive OrderCreated twice with different event IDs but the same order_id, you're expected to dedupe.
API gateways sometimes generate client-side idempotency keys and reject duplicates at the edge before they hit the queue.

The key insight: message-level idempotency protects against infrastructure problems (redelivery), but business-level idempotency protects against application problems (duplicate requests). Most production systems need both.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Queue Fundamentals with AWS SQS

Scenarios

Repo layout

Prerequisites

How to use this repo end-to-end

Architecture

What the consumer does

Consumer env vars

Producer usage

Failure-mode experiments

Environment config

Terraform variables

Running scenarios

Validating infrastructure

Makefile targets

Tear down

Design Considerations for Queue-Based Systems

The Two-Table Pattern for External Side Effects

What Happens on Crash + Retry

Side Effects Across Multiple External Systems

Delivery Idempotency vs. Business Idempotency

How Real-World Systems Handle This

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
consumer		consumer
plans		plans
producer		producer
scenarios		scenarios
scripts		scripts
terraform		terraform
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Queue Fundamentals with AWS SQS

Scenarios

Repo layout

Prerequisites

How to use this repo end-to-end

Architecture

What the consumer does

Consumer env vars

Producer usage

Failure-mode experiments

Environment config

Terraform variables

Running scenarios

Validating infrastructure

Makefile targets

Tear down

Design Considerations for Queue-Based Systems

The Two-Table Pattern for External Side Effects

What Happens on Crash + Retry

Side Effects Across Multiple External Systems

Delivery Idempotency vs. Business Idempotency

How Real-World Systems Handle This

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages