Step Functions — Concept
What it is
AWS Step Functions = serverless workflow orchestrator. You define a state machine in JSON (Amazon States Language); Step Functions runs it: tasks, choices, parallel branches, retries, error handling, waits.
Why it exists
Stringing Lambdas with SNS/SQS quickly becomes spaghetti — no visibility, no built-in retries, no easy human approval steps. Step Functions gives you visual workflows, state, error handling, long-running waits, and integration with 200+ AWS services.
Two workflow types
| Standard | Express | |
|---|---|---|
| Max duration | 1 year | 5 minutes |
| Execution rate | 2,000 / s (start rate) | 100,000+ / s |
| Pricing | per state transition | per request + duration (cheaper at high volume) |
| Execution history | exactly-once | at-least-once (Async) or exactly-once (Sync) |
| Use | long, durable, human approval, infrequent | high-volume event-driven |
State types
- Task — invoke service (Lambda, ECS, SNS, SQS, DynamoDB, …).
- Choice — branch based on input.
- Wait — pause until X seconds / until timestamp.
- Parallel — run branches in parallel.
- Map — iterate over a list.
- Pass / Succeed / Fail — control flow.
Integrations
- Direct integrations with AWS services — no Lambda glue needed (e.g.,
arn:aws:states:::dynamodb:putItem). .syncsuffix = wait for service job to complete (Glue, ECS task, EMR step, Batch job)..waitForTaskToken= pause until a callback (great for human approval).
Error handling
- Per-task
Retrywith intervals / max attempts / exponential backoff. Catchblocks for specific errors.- Failed executions visible in console with full history.
When to use vs alternatives
| Use ... | Instead of ... | When ... |
|---|---|---|
| Step Functions Standard | chained Lambdas | Long workflow (hours/days), visibility, retries, human approval |
| Step Functions Express | chained Lambdas | High-volume short workflows, event processing |
| EventBridge | Step Functions | Simple "event → fanout" routing, no orchestration |
| SQS | Step Functions | Decoupling and buffering only |
| AWS Batch | Step Functions | Long-running compute jobs (vs orchestration) |
Common exam scenarios
- "Multi-step order workflow with approval after 24 h" → Standard workflow with
.waitForTaskToken. - "Image processing pipeline: thumbnail → label → DB update with retries" → Express (or Standard).
- "Run hundreds of parallel data tasks with map state" → Step Functions Map (distributed).
- "Coordinate Glue job → wait for it → run Lambda" → Task
.syncon Glue job. - "Need visual diagram of the workflow + error history" → Step Functions console.
Exam tip
"Orchestrate" / "workflow" / "human approval" / "retry & catch" → Step Functions. For "event routing" → EventBridge. For "simple decoupling" → SQS.