Developer Guide – Troubleshooting¶

This section documents real failure modes you are likely to see, how to diagnose them, and what not to do.

This project is intentionally transparent: if something looks wrong, the system is probably telling you something important.

Core Troubleshooting Principle¶

Never “fix” before you understand.

Use:

logs
metrics
raw data
replay

Do not patch blindly.

Ingest Issues¶

❗ Ingest returns `DUPLICATE` unexpectedly¶

This is not an error.

Likely causes:

client retries
network timeouts
upstream replay

What to check:

same event_id
same payload hash
dedupe table entries

✅ Expected behavior.

❗ Ingest accepts but no data appears downstream¶

Check:

S3 bucket → raw object exists?
EventsTable → metadata row exists?
Status field (ACCEPTED vs DUPLICATE)

If raw exists, ingest is working.

Replay Issues¶

❗ Replay returns `sent = 0`¶

This is the most common confusion point.

Possible reasons:

no events in time window
all events filtered out
include_duplicates=false
missing s3_bucket or s3_key

What to do:

aws dynamodb scan --table-name EVENTS_TABLE

Inspect rows manually.

❗ Replay scans items but sends nothing¶

Check:

status field
missing S3 references
limit reached early

Replay is conservative by design.

Processor Issues¶

❗ Processor not consuming messages¶

Checklist:

Is EnableProcessor=true?
Does event source mapping exist?
Is SQS empty?

aws lambda list-event-source-mappings

❗ Processor runs but aggregates are wrong¶

This is expected during investigation.

Check:

multiple aggregate versions
input hashes
sample event IDs

Aggregates are diagnostic outputs, not truth.

❗ Processor errors but queue drains¶

This is dangerous.

Fix immediately:

sam deploy --parameter-overrides EnableProcessor=false

This stops consumption without losing messages.

SQS Issues¶

❗ Messages disappear¶

Possible causes:

processor enabled
visibility timeout expired
DLQ not configured (by design)

Always inspect before enabling processor.

❗ SQS stays empty after replay¶

Check:

replay logs
sent count
IAM permissions (sqs:SendMessage)

DynamoDB Issues¶

❗ Scan works but query doesn’t¶

Likely:

wrong index
wrong key condition
wrong partition key

Remember:

PK = ENTITY#<id>
SK = TS#<timestamp>#EID#<id>

❗ Unexpected aggregate overwrites¶

Aggregates are append-only by design.

If you see overwrites:

check table schema
check sort key versioning
verify code changes

Logging & Metrics Issues¶

❗ No logs in CloudWatch¶

Check:

correct log group name
correct region
IAM role includes AWSLambdaBasicExecutionRole

❗ Metrics missing¶

Ensure:

EMF logs emitted
namespace PipelineInvestigationKit
correct dimensions

Metrics are written via logs.

Local vs Cloud Confusion¶

❗ Works locally but not in AWS¶

Common causes:

missing IAM permission
missing env var
wrong resource name

Compare:

sam local invoke
aws lambda invoke

Side-by-side.

Golden Debugging Path¶

When confused, always do this in order:

Inspect raw S3 data
Inspect EventsTable rows
Replay with DRY_RUN
Inspect SQS messages
Enable processor briefly
Inspect aggregates

Never skip steps.

What NOT to Do¶

❌ Delete raw data ❌ Rewrite aggregates ❌ Disable dedupe ❌ Replay without scoping ❌ Enable processor blindly

When to Escalate¶

Escalate if:

raw data missing
ingest fails consistently
IAM denies expected access

Otherwise, the system is likely behaving correctly.

Developer Guide Complete ✅¶

You now have:

Architecture
Quickstart
Deployment
Configuration
Usage
Troubleshooting

This is a complete, production-grade investigation toolkit.

Developer Guide – Troubleshooting¶

Core Troubleshooting Principle¶

Ingest Issues¶

❗ Ingest returns DUPLICATE unexpectedly¶

❗ Ingest accepts but no data appears downstream¶

Replay Issues¶

❗ Replay returns sent = 0¶

❗ Replay scans items but sends nothing¶

Processor Issues¶

❗ Processor not consuming messages¶

❗ Processor runs but aggregates are wrong¶

❗ Processor errors but queue drains¶

SQS Issues¶

❗ Messages disappear¶

❗ SQS stays empty after replay¶

DynamoDB Issues¶

❗ Scan works but query doesn’t¶

❗ Unexpected aggregate overwrites¶

Logging & Metrics Issues¶

❗ No logs in CloudWatch¶

❗ Metrics missing¶

Local vs Cloud Confusion¶

❗ Works locally but not in AWS¶

Golden Debugging Path¶

What NOT to Do¶

When to Escalate¶

Developer Guide Complete ✅¶

❗ Ingest returns `DUPLICATE` unexpectedly¶

❗ Replay returns `sent = 0`¶