/Reliability & Production Engineering
🛡️

Reliability & Production Engineering

Day 2 · Production Systems · 45 min

Exactly-Once vs At-Least-Once:

  • True exactly-once is nearly impossible in distributed systems
  • In practice: at-least-once delivery + idempotent processing = effectively exactly-once
  • Idempotent processing: processing the same message twice produces the same result
Eventual Consistency:
  • On-chain state and off-chain DB will temporarily diverge
  • This is expected and acceptable
  • Reconciliation jobs bring them back in sync
  • Design UI/UX for this: show "pending" states, don't promise instant finality
Circuit Breakers:
CLOSED → (failures > threshold) → OPEN
OPEN → (timeout expires) → HALF-OPEN
HALF-OPEN → (success) → CLOSED
HALF-OPEN → (failure) → OPEN
Use for: RPC node calls, Visa API, external services.
When open: fail fast, don't queue up requests to a dead service.

Rate Limiting:

  • Token bucket for API endpoints
  • Per-user rate limits for card authorizations
  • Per-signer rate limits for blockchain transactions (nonce management)
Backpressure:
  • If settlement queue grows too fast, slow down accepting new authorizations
  • Better to reject at the edge than overwhelm the system

Key Points

  • At-least-once + idempotency = effectively exactly-once
  • Design for eventual consistency between chain and DB
  • Circuit breakers prevent cascading failures
  • Rate limit at multiple levels: API, per-user, per-signer
  • Backpressure: reject at edge rather than overwhelm internals

Navigate