Ingest API¶
The Ingest API is the entry point of the Pipeline Investigation Kit.
Its responsibility is intentionally narrow:
- accept events
- deduplicate them safely
- store raw data immutably
- record metadata for investigation
It does not transform, enrich, or aggregate data.
Endpoint¶
POST /ingest
Request Payload¶
{
"source": "demo",
"event_type": "heartbeat",
"entity_id": "user_123",
"event_time": "2025-12-28T23:59:59Z",
"payload": {
"steps": 10
}
}
Required Fields¶
| Field | Description |
|---|---|
source |
Origin system (service, app, vendor) |
event_type |
Logical event type |
entity_id |
Entity identifier (user, device, account, etc.) |
event_time |
When the event actually occurred (ISO-8601) |
payload |
Arbitrary JSON payload |
All fields are required.
Event ID Strategy¶
A stable event_id is generated server-side using a deterministic hash:
hash(source + event_type + entity_id + event_time + payload)
This guarantees:
- idempotent ingestion
- safe client retries
- deterministic deduplication
- reproducible replay
Clients do not send event_id.
Deduplication Behavior¶
Deduplication is based on event_id.
- First occurrence →
ACCEPTED - Subsequent occurrences →
DUPLICATE
Duplicates are:
- expected
- recorded
- observable
They are not errors.
Storage Model¶
Raw Events (S3)¶
- Stored as immutable JSON objects
- Never overwritten or deleted by the system
-
Partitioned by:
-
source
- event_type
- event_date
- ingest_date
- hour
Example key:
raw/source=demo/event_type=heartbeat/event_date=2025-12-28/ingest_date=2025-12-29/hour=23/event_id=....json
Metadata (DynamoDB)¶
Each ingest writes a metadata record containing:
- event identifiers
- entity and time info
- ingest timestamp
- ingest lag
- status (
ACCEPTED/DUPLICATE) - S3 location
- payload hash
This table is optimized for investigation queries.
Ingest Lag¶
The system computes:
ingest_lag_ms = ingest_time - event_time
This allows detection of:
- late arrivals
- backfills
- delayed syncs
Lag is indexed and queryable.
Responses¶
Accepted Event¶
{
"event_id": "...",
"status": "ACCEPTED",
"ingest_lag_ms": 72936580,
"s3_key": "raw/..."
}
Duplicate Event¶
{
"event_id": "...",
"status": "DUPLICATE",
"ingest_lag_ms": 72956085,
"s3_key": "raw/..."
}
The same event_id is returned for duplicates.
DRY_RUN Mode¶
When DRY_RUN=true:
- payload is validated
event_idis generated- no S3 writes
- no DynamoDB writes
- metrics are still emitted
This enables safe testing in production.
Observability¶
Metrics¶
IngestRequestCountDuplicateCountIngestLagMs
Logs¶
- structured JSON
- include event_id and entity_id
- safe for correlation
Failure Handling¶
- Invalid payload → rejected
- Missing fields → rejected
- Internal errors → surfaced
No data is silently dropped.
Design Guarantees¶
- Raw data is immutable
- Deduplication is deterministic
- Ingest is idempotent
- No side effects beyond storage
When to Use Ingest¶
- capturing raw events
- recording late or out-of-order data
- preserving investigation context
- debugging upstream failures
If you are unsure whether to ingest something: ingest it.