☁︎SAA-C03

Redshift

Redshift — Concept

What it is

Amazon Redshift = a managed petabyte-scale data warehouse (OLAP). Columnar storage, massively parallel processing (MPP), SQL on TB–PB of historical data.

Why it exists

RDS/Aurora are OLTP (transactional, row-based) and can't efficiently scan billions of rows. Redshift is OLAP: columnar storage, compression, parallel scans — for BI dashboards, reporting, analytics over historical data.

Architecture

  • Cluster = 1 leader node + many compute nodes.
  • Leader parses SQL, plans, aggregates results.
  • Compute nodes hold partitions ("slices") and run in parallel.
  • Node types: RA3 (managed storage, separate compute/storage), DC2 (legacy SSD).

Redshift Serverless

  • No cluster sizing — scale automatically.
  • Pay per RPU-hour.
  • Best for variable / unpredictable analytics workloads.

Spectrum

  • Query data directly in S3 without loading.
  • Schema in Glue Data Catalog.
  • Good for data lake federation; cheap because compute is separate from S3 storage.

Concurrency

  • Concurrency Scaling = adds extra clusters on demand for read peaks; you get free credits.
  • Workload Management (WLM) = queues with priorities; auto-WLM available.

Loading & integrations

  • COPY from S3 (best practice, parallel).
  • Federated Query: Postgres / RDS / Aurora live tables.
  • Redshift Streaming Ingestion from Kinesis / MSK.
  • AWS DMS for migration from other DBs.

Backups

  • Automated snapshots to S3 (1-day default, configurable up to 35).
  • Cross-region copy for DR.

Security

  • VPC, KMS encryption, IAM auth, audit logs to S3.
  • Enhanced VPC routing for COPY/UNLOAD over VPC.

When to use vs alternatives

Use ...Instead of ...When ...
RedshiftAuroraTB–PB analytical reporting, complex aggregations
AthenaRedshiftAd-hoc queries on S3, low/no setup, pay per query
EMR / SparkRedshiftHeavy custom transformations, ML pipelines
OpenSearchRedshiftLog search and full-text
DynamoDBRedshiftOLTP, not analytics
Redshift ServerlessProvisionedVariable / unknown load, less ops
Redshift SpectrumCOPY into RedshiftData lake; want to query S3 directly

Common exam scenarios

  1. "BI dashboards over 50 TB of historical data"Redshift.
  2. "Ad-hoc SQL on data already in S3 / no infra"Athena.
  3. "Query S3 + warehouse together"Redshift Spectrum.
  4. "Streaming ingestion of clickstream into warehouse"Kinesis → Redshift Streaming Ingestion.
  5. "Variable analytics workload, no ops"Redshift Serverless.
  6. "Federated SQL over Aurora + Redshift without ETL"Federated Query.

Exam tip

Athena = serverless SQL over S3, ad-hoc, cheap, no infra. Redshift = persistent warehouse, faster on big joins/aggregations at scale, you load data in. EMR = full Hadoop/Spark stack for custom big-data pipelines.

References