/Reliability & Production Engineering
🛡️
Reliability & Production Engineering
Day 2 · Production Systems · 45 min
Exactly-Once vs At-Least-Once:
- True exactly-once is nearly impossible in distributed systems
- In practice: at-least-once delivery + idempotent processing = effectively exactly-once
- Idempotent processing: processing the same message twice produces the same result
- On-chain state and off-chain DB will temporarily diverge
- This is expected and acceptable
- Reconciliation jobs bring them back in sync
- Design UI/UX for this: show "pending" states, don't promise instant finality
CLOSED → (failures > threshold) → OPEN
OPEN → (timeout expires) → HALF-OPEN
HALF-OPEN → (success) → CLOSED
HALF-OPEN → (failure) → OPEN
Use for: RPC node calls, Visa API, external services.When open: fail fast, don't queue up requests to a dead service.
Rate Limiting:
- Token bucket for API endpoints
- Per-user rate limits for card authorizations
- Per-signer rate limits for blockchain transactions (nonce management)
- If settlement queue grows too fast, slow down accepting new authorizations
- Better to reject at the edge than overwhelm the system
Key Points
- ▸At-least-once + idempotency = effectively exactly-once
- ▸Design for eventual consistency between chain and DB
- ▸Circuit breakers prevent cascading failures
- ▸Rate limit at multiple levels: API, per-user, per-signer
- ▸Backpressure: reject at edge rather than overwhelm internals