reasoning framework your devs can run every time they pick up a new event-driven requirement (source adapter → canonical → target adapter; Kafka in the middle; DLQ ready)
- Mark Kendall
- 2 days ago
- 5 min read
Love it — you’ve got the infra and the code patterns down. Here’s a lightweight, repeatable reasoning framework your devs can run every time they pick up a new event-driven requirement (source adapter → canonical → target adapter; Kafka in the middle; DLQ ready). It’s designed to be fast to fill, hard to get wrong, and easy to review.
The 12-Box Adapter Canvas (fill this before coding)
1. Business Outcome & KPI
• What changes for the user/system?
• KPI/SLO: latency ___ ms, success rate %, cost ceiling $/month.
• Owner/on-call: ______
2. Trigger & Source
• Event type(s): create/update/delete/batch/command.
• Ingress: API Gateway | Kafka | SQS | EventBridge | Cron.
• Ordering requirement? (per customer/order/etc.) Key: ______
3. Upstream Contract
• Schema/spec link + version.
• Required vs optional fields.
• Max payload size & rate (p95 TPS).
• Idempotency key available? (eventId, externalId)
4. Canonical Model
• Canonical object: e.g., TMF.ProductOrder.v5
• Versioning plan (back/forward compat).
• Field map table (source→canonical, defaults, units).
5. Validation Rules
• Syntactic: JSON/schema validation, enums, formats.
• Semantic: cross-field rules, referential checks, duplicates.
• PII redaction/masking policy.
6. Transform & Enrichment
• Lookups (cache/DB/service).
• Derived fields.
• Timezone/units normalization.
7. Routing & Topics
• Publish topic(s): eip.<domain>.<event>.in
• Partition key & header conventions (tenant, correlationId, schemaVer).
• Fan-out rules (who consumes, why).
8. Idempotency & Exactly-Once Story
• Idempotency store/strategy (key: ______; TTL: ___ days).
• Side-effect guards (read-before-write? conditional upserts?).
• Outbox / transactional publish (if needed).
9. Failure Taxonomy & DLQ
• Retryable vs non-retryable vs poison.
• Retry policy (attempts/backoff/timeout).
• DLQ topic naming: eip.<domain>.<stage>.dlq + reason code schema.
• Replay procedure (who runs it, how).
10. Performance & Cost
• Lambda memory/timeout targets.
• Batch size/prefetch/concurrency caps (per partition key if ordering).
• p95/p99 targets; load-test plan.
11. Security & Compliance
• AuthN/Z (JWT, mTLS, IAM).
• Secrets (provider, rotation, scope).
• Data retention & audit fields.
• Least-privilege IAM topic/table policies.
12. Observability & Ops
• Logs: structured JSON w/ correlationId, eventId.
• Metrics: accepts/validations/failures/latency/cost.
• Tracing: span boundaries (ingress → transform → publish).
• Dashboards/alerts (thresholds + runbooks).
• Rollout/rollback (canary %, feature flags).
⸻
Map the Canvas → Implementation (mechanical steps)
1. Contracts First
• Check in upstream schema + canonical schema + mapping table.
• Add version gates: reject unknown future versions or tolerate?
2. Choose Event Source
• Ordered stream per key → Kafka consumer Lambda (partitioned).
• Request/response → API Gateway Lambda.
• Fan-out timers → EventBridge Scheduler.
3. Name & Configure Topics
• eip.<domain>.<entity>.<verb>.in → canonical;
• <...>.dlq for each processing stage; headers carry schemaVer, correlationId.
4. Idempotency Guard
• Pick key (eventId/externalId+source).
• Implement “seen” check (fast store) + safe upsert on target.
5. Lambda Knobs (per NFR)
• Timeout = p99 path + margin.
• Memory = transform CPU needs; tune cost/latency.
• Batch size & concurrency: preserve ordering per key.
• Retries: align to taxonomy.
6. Observability
• Emit accept, validate_fail, transform_fail, publish_ok, publish_fail.
• One dashboard per adapter; SLO alerts wire to on-call.
7. Security
• Scope secrets to adapter; least-privilege topic ACLs.
• Mask PII in logs; tag data classes.
8. Deploy & Prove
• Contract tests (schema compat), mapping tests, golden payloads, chaos tests (bad enums, missing fields).
• Load test to target TPS with real partition key skews.
• Canary on a small traffic slice; bake; then ramp.
⸻
Fast Decision Helpers
Trigger chooser
• Need strict ordering per customer/order? → Kafka w/ partition key.
• Synchronous API to caller? → API Gateway + immediate validate/publish.
• Scheduled jobs/batches? → EventBridge Scheduler.
DLQ policy
• Put full payload + error envelope (stage, reason, stack hash, firstSeen, attemptCount).
• Alert on DLQ creation > N/min; provide replay tool that re-publishes after fix.
Retries
• Validation errors → no retry, direct DLQ.
• Transient downstream/partition throttling → N retries w/ exponential backoff.
• Unknown errors → small N, then DLQ; open an incident if rate spikes.
Ordering vs parallelism
• If ordering matters, set consumer concurrency to number of partitions and keep per-key affinity; never open >1 concurrent worker on same partition.
⸻
Definition of Done (adapter)
• Canvas completed & reviewed.
• Schemas + mapping committed & versioned.
• Golden test vectors for happy/sad paths.
• Idempotency guard proven with duplicate replay.
• DLQ message contract & replay runbook documented.
• Dashboards + alerts deployed and firing in test.
• Load test at p95 TPS with 2× burst passes SLO.
• Rollback plan tested (disable consumer/canary flip).
• Security review (secrets, PII, IAM).
• Cost estimate documented (≈ $/million messages).
⸻
Fill-in Templates (drop into your repo)
1) Adapter Canvas (Markdown)
# Adapter: <Domain>-<Entity>-<Verb>
Owner: <team> • On-call: <rotor> • SLO: <latency/availability> • Target Cost:<$>
## 1. Business Outcome & KPI
<one paragraph + KPI table>
## 2. Trigger & Source
Type: <API/Kafka/...> • Ordering: <none|per key> (key: <...>) • Expected TPS: <...>
## 3. Upstream Contract
Spec: <link+version> • Required fields: <...> • Max size: <...> • Idempotency key: <...>
## 4. Canonical Model
Object: <...> v<...> • Compat policy: <back/forward>
Mapping table: source.field → canonical.field (default, transform)
## 5. Validation Rules
- Syntax: <json schema link>
- Semantic: <rules>
- PII: <masking>
## 6. Transform & Enrichment
Lookups: <...> • Derived fields: <...>
## 7. Routing & Topics
Publish to: `eip.<domain>.<entity>.<verb>.in`
Headers: correlationId, schemaVer, tenant, idempotencyKey
## 8. Idempotency & Side-Effects
Key: <...> • Store: <...> • Strategy: <first-write-wins|upsert|compare-and-swap>
## 9. Failure & DLQ
Taxonomy: <...> • Retries: <...> • DLQ: `eip.<domain>.<stage>.dlq` • Replay runbook: <link>
## 10. Performance & Cost
Lambda: memory <...>, timeout <...>, batch <...>, concurrency <...>
Targets: p95 <...> ms, p99 <...> ms • Est. $/MM msgs: <...>
## 11. Security & Compliance
Auth: <...> • IAM: <...> • Secrets: <...> • Retention: <...>
## 12. Observability & Ops
Logs: <fields> • Metrics: <list> • Traces: <spans> • Alerts: <thresholds> • Rollout: <canary/flags>
2) Mapping Table (CSV for quick diffs)
source_field,canonical_field,required,default,transform,notes
id,order.id,required,,,
timestamp,order.occurredAt,required,,parseIsoToUtc(),
...
3) Error Envelope (JSON contract)
{
"correlationId": "<uuid>",
"stage": "validate|transform|publish|target",
"reasonCode": "SCHEMA_MISSING_FIELD|DOWNSTREAM_429|UNKNOWN",
"message": "<human readable>",
"attempt": 3,
"firstSeen": "2025-09-10T20:00:00Z",
"payload": { /* original */ },
"context": { "schemaVer":"v5", "tenant":"acme" }
}
4) Idempotency Key Rule (YAML)
idempotency:
key: "${sourceSystem}:${eventType}:${externalId}"
store: "dynamodb|redis|kafka-header-index"
ttlDays: 30
mode: "first-write-wins"
⸻
Mini Example (worked in 60 seconds)
Use case: TMF ProductOrder.create from Source A → canonical → publish to Kafka → Target B consumes and opens/updates Salesforce Case.
• Ordering: per customerId (partition key = customerId).
• Idempotency: ${source}:${eventType}:${orderId} in a 30-day store.
• Validation: TMF v5 JSON Schema + semantic rule “lineItems[].quantity > 0”.
• Topics:
• eip.orders.productorder.create.in (canonical)
• eip.orders.validate.dlq, eip.orders.transform.dlq, eip.orders.target.dlq
• Lambda knobs: timeout 10s, memory 512MB, batch 100, reserved concurrency = number of partitions expected to be hot (protects target).
• Metrics: accepts/sec, validations_failed/sec, dlq/sec, p95 end-to-end latency, cost/msg.
• Alerts: DLQ > 5/min for 5 min; p95 > SLO for 10 min; missing traffic for 15 min.
⸻
How to use this in practice
1. Print the 12-Box Canvas, fill it in your design issue.
2. Commit the schemas + mapping CSV + error envelope + idempotency YAML next to your Lambda code.
3. In PR review, require: canvas present, golden vectors, DLQ contract, dashboard link.
4. After merge, run the canary and prove replay-from-DLQ works before full ramp.
If you want, I can drop these templates into a ZIP with a starter README and folder structure next.
Comments