Developer Guide – Troubleshooting¶
This section documents real failure modes you are likely to see, how to diagnose them, and what not to do.
This project is intentionally transparent: if something looks wrong, the system is probably telling you something important.
Core Troubleshooting Principle¶
Never “fix” before you understand.
Use:
- logs
- metrics
- raw data
- replay
Do not patch blindly.
Ingest Issues¶
❗ Ingest returns DUPLICATE unexpectedly¶
This is not an error.
Likely causes:
- client retries
- network timeouts
- upstream replay
What to check:
- same
event_id - same payload hash
- dedupe table entries
✅ Expected behavior.
❗ Ingest accepts but no data appears downstream¶
Check:
- S3 bucket → raw object exists?
- EventsTable → metadata row exists?
- Status field (
ACCEPTEDvsDUPLICATE)
If raw exists, ingest is working.
Replay Issues¶
❗ Replay returns sent = 0¶
This is the most common confusion point.
Possible reasons:
- no events in time window
- all events filtered out
include_duplicates=false- missing
s3_bucketors3_key
What to do:
aws dynamodb scan --table-name EVENTS_TABLE
Inspect rows manually.
❗ Replay scans items but sends nothing¶
Check:
statusfield- missing S3 references
- limit reached early
Replay is conservative by design.
Processor Issues¶
❗ Processor not consuming messages¶
Checklist:
- Is
EnableProcessor=true? - Does event source mapping exist?
- Is SQS empty?
aws lambda list-event-source-mappings
❗ Processor runs but aggregates are wrong¶
This is expected during investigation.
Check:
- multiple aggregate versions
- input hashes
- sample event IDs
Aggregates are diagnostic outputs, not truth.
❗ Processor errors but queue drains¶
This is dangerous.
Fix immediately:
sam deploy --parameter-overrides EnableProcessor=false
This stops consumption without losing messages.
SQS Issues¶
❗ Messages disappear¶
Possible causes:
- processor enabled
- visibility timeout expired
- DLQ not configured (by design)
Always inspect before enabling processor.
❗ SQS stays empty after replay¶
Check:
- replay logs
sentcount- IAM permissions (
sqs:SendMessage)
DynamoDB Issues¶
❗ Scan works but query doesn’t¶
Likely:
- wrong index
- wrong key condition
- wrong partition key
Remember:
PK = ENTITY#<id>SK = TS#<timestamp>#EID#<id>
❗ Unexpected aggregate overwrites¶
Aggregates are append-only by design.
If you see overwrites:
- check table schema
- check sort key versioning
- verify code changes
Logging & Metrics Issues¶
❗ No logs in CloudWatch¶
Check:
- correct log group name
- correct region
- IAM role includes
AWSLambdaBasicExecutionRole
❗ Metrics missing¶
Ensure:
- EMF logs emitted
- namespace
PipelineInvestigationKit - correct dimensions
Metrics are written via logs.
Local vs Cloud Confusion¶
❗ Works locally but not in AWS¶
Common causes:
- missing IAM permission
- missing env var
- wrong resource name
Compare:
sam local invoke
aws lambda invoke
Side-by-side.
Golden Debugging Path¶
When confused, always do this in order:
- Inspect raw S3 data
- Inspect EventsTable rows
- Replay with DRY_RUN
- Inspect SQS messages
- Enable processor briefly
- Inspect aggregates
Never skip steps.
What NOT to Do¶
❌ Delete raw data ❌ Rewrite aggregates ❌ Disable dedupe ❌ Replay without scoping ❌ Enable processor blindly
When to Escalate¶
Escalate if:
- raw data missing
- ingest fails consistently
- IAM denies expected access
Otherwise, the system is likely behaving correctly.
Developer Guide Complete ✅¶
You now have:
- Architecture
- Quickstart
- Deployment
- Configuration
- Usage
- Troubleshooting
This is a complete, production-grade investigation toolkit.