Architecture Overview¶
This project is an investigation + observability kit for data pipelines: it captures immutable raw events, indexes them for fast querying, measures lag, detects duplicates, and supports replay into a queue for downstream processing.
It is intentionally production-shaped, but not a full ETL/warehouse pipeline.
Core Goals¶
- Immutable raw storage: keep the original event payloads as they arrived.
- Idempotent ingestion: stable
event_id, dedupe gate, safe retries. - Fast investigation: query by entity/time windows, find gaps/late arrivals, see which sources/types are problematic.
- Replay: push historical events back into a queue for reprocessing.
- Traceability: aggregate outputs store input hashes and sample event_ids.
High-level flow¶
1. Event Ingestion¶
Producer → Ingest API (POST /ingest)
- Validates incoming event payload
- Computes stable
event_id(for idempotency) - Writes to three destinations:
- S3: Raw JSON with immutable object key
- DynamoDB: Metadata and index records for fast querying
- CloudWatch: Metrics and structured logs
2. Investigation & Replay¶
Investigator → Replay API (POST /replay)
- Queries DynamoDB for historical events (by entity, time range, etc.)
- Sends selected events to SQS queue for reprocessing
3. Event Processing (Optional)¶
Processor Lambda ← SQS Queue
- Loads raw event from S3
- Normalizes data to minimal schema
- Computes placeholder aggregate
- Stores versioned results with input hashes for auditability
Key extension points¶
- Schema: extend allowed fields / validation in
src/shared/schema.py - Partitioning: change S3 key layout and DynamoDB PK/SK format in shared utilities
- Replay output: customize message payload for your downstream processors
- Processor: replace placeholder aggregates with domain logic; keep input hashes for auditability