System Design Interview
Questions & Answers
🌱Fundamental ConceptsQ1–Q14
System design is the process of defining the architecture, components, modules, interfaces, and data flow of a system to satisfy specified requirements. In interviews, it tests whether you can think holistically about building large-scale, reliable, and maintainable software.
- What interviewers assess: Ambiguity handling, requirement clarification, trade-off reasoning, component knowledge, and communication skills.
- Typical format: 45–60 minutes. Open-ended prompt like "Design Twitter" or "Design a URL shortener."
- Framework to follow: Clarify requirements → Estimate scale → High-level design → Deep dive components → Identify bottlenecks → Propose solutions.
| Feature | Vertical Scaling (Scale Up) | Horizontal Scaling (Scale Out) |
|---|---|---|
| Method | Add more CPU/RAM/SSD to one machine | Add more machines to the pool |
| Limit | Hardware ceiling — finite | Theoretically unlimited |
| Downtime | Often requires restart | No downtime (add nodes live) |
| Complexity | Simple — no code changes | Higher — needs load balancer, stateless design |
| Cost | Expensive at high end | Commodity servers — cost-effective |
| Fault tolerance | Single point of failure | Redundant — node failure is survivable |
| Best for | Databases (often), early stage | Web/app servers, stateless services |
A load balancer distributes incoming traffic across multiple servers to prevent any single server from becoming overloaded. It acts as a reverse proxy and is the entry point for all client requests.
- Round Robin: Requests distributed sequentially to each server. Simple, good for uniform workloads.
- Weighted Round Robin: More powerful servers get proportionally more traffic.
- Least Connections: Routes to server with fewest active connections. Good for varying request durations.
- IP Hash: Same client IP always goes to the same server. Useful for session stickiness.
- Random: Pick a server randomly. Simple and effective at scale.
- Resource Based: Route based on actual CPU/memory usage. Most accurate but needs health reporting.
L4 vs L7: Layer 4 balancers operate at the transport layer (TCP/UDP) — fast but unaware of content. Layer 7 balancers operate at the application layer — can route by URL, headers, cookies, enabling advanced rules.
Caching stores copies of expensive-to-compute data closer to the requester to reduce latency and database load. A cache hit serves data from memory in microseconds; a cache miss falls back to the slower data source.
| Strategy | How it works | Best for |
|---|---|---|
| Cache-Aside (Lazy) | App checks cache → miss → load from DB → populate cache | Read-heavy, infrequently updated data |
| Write-Through | Write to cache AND DB synchronously | Read-heavy, consistency important |
| Write-Behind (Write-Back) | Write to cache immediately, DB asynchronously | Write-heavy, tolerable data lag |
| Read-Through | Cache handles DB read on miss — app only talks to cache | Simplified app code |
| Refresh-Ahead | Proactively refresh cache before expiry | Predictable access patterns |
Cache eviction policies: LRU (Least Recently Used — most common), LFU (Least Frequently Used), FIFO, TTL (Time To Live — expire after N seconds).
A CDN is a geographically distributed network of servers (Points of Presence / PoPs) that cache and serve content from the location closest to the user — reducing latency and origin server load.
- Static assets: Images, JS, CSS, fonts — cache with long TTL (months).
- Dynamic content: Can be cached short-term (seconds) with cache-control headers.
- Benefits: Reduced latency (ms vs 100ms+ round trips), DDoS protection, reduced bandwidth cost, SSL offloading.
- Providers: Cloudflare, AWS CloudFront, Fastly, Akamai, Vercel Edge.
An index is a separate data structure (usually a B-tree or hash) that the database maintains to speed up data retrieval. Without an index, the database scans every row (full table scan — O(n)). With an index, it jumps directly to matching rows (O(log n) for B-tree).
- When to add indexes: Columns frequently used in WHERE, JOIN ON, ORDER BY, GROUP BY.
- Types: Single-column, composite (multi-column), unique, partial (filtered), full-text, spatial.
- Write overhead: Every INSERT/UPDATE/DELETE must update all affected indexes — too many indexes slow writes.
- Composite index column order matters: Index on (A, B) helps queries filtering on A or (A, B), but not B alone.
- Index selectivity: High-cardinality columns (email, ID) benefit most. Low-cardinality (boolean, gender) rarely worth indexing alone.
| Feature | SQL (Relational) | NoSQL |
|---|---|---|
| Schema | Fixed — must define upfront | Flexible — schema-less |
| Scaling | Vertical (+ read replicas) | Horizontal sharding natively |
| ACID | ✅ Full support | Varies (eventual → strong) |
| Joins | ✅ Built-in | ❌ Must embed or reference manually |
| Best for | Complex relationships, financial, reporting | Flexible schema, high write throughput, huge scale |
NoSQL database types and when to use each:
- Document (MongoDB, Firestore): Flexible JSON documents. Good for catalogs, user profiles, CMS.
- Key-Value (Redis, DynamoDB): Ultra-fast simple lookups. Sessions, caching, leaderboards.
- Column-family (Cassandra, HBase): Write-optimized, time-series. IoT, logs, analytics.
- Graph (Neo4j, Amazon Neptune): Relationship traversal. Social networks, recommendations, fraud detection.
- Search (Elasticsearch, OpenSearch): Full-text search with relevance scoring.
A message queue is a component that enables asynchronous communication between services by holding messages until a consumer processes them. The producer doesn't wait for the consumer to finish.
Problems it solves:
- Decoupling: Producer and consumer don't need to know about each other.
- Load leveling: Absorbs traffic spikes — queue fills up, workers process steadily.
- Reliability: If consumer crashes, messages persist in queue and can be retried.
- Async processing: User gets instant response; email sending / PDF generation happens in background.
- Fan-out: One message delivered to multiple consumers (e.g., new user → send email + provision account + log analytics).
Tools: RabbitMQ (AMQP, complex routing), Kafka (high-throughput streaming, replay), AWS SQS (managed, simple), Redis Streams, Google Pub/Sub.
The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously:
- C — Consistency: Every read receives the most recent write or an error. All nodes see the same data at the same time.
- A — Availability: Every request receives a response (not necessarily the latest data). The system never returns an error.
- P — Partition Tolerance: The system continues operating even when network partitions (node communication failures) occur.
| Choice | Trade-off | Real examples |
|---|---|---|
| CP | May reject requests to stay consistent | HBase, MongoDB (strong), Zookeeper |
| AP | May return stale data — eventual consistency | Cassandra, DynamoDB, CouchDB |
| CA | Cannot tolerate network partition — single node | Traditional RDBMS (single instance) |
| Feature | Monolith | Microservices |
|---|---|---|
| Deployment | Single deployable unit | Each service deploys independently |
| Scaling | Scale everything together | Scale only bottleneck services |
| Team ownership | Shared codebase — coordination needed | Each team owns their service |
| Technology | Single language/framework | Each service can use best tool for the job |
| Latency | In-process function calls (fast) | Network calls between services (slower) |
| Complexity | Simpler initially | Distributed systems complexity |
| Fault isolation | One bug can crash everything | Failures contained to one service |
| Best for | Small teams, early stage, simple domains | Large teams, complex domains, high scale |
An API Gateway is a single entry point for all client requests in a microservices architecture. It handles cross-cutting concerns so individual services don't have to implement them.
- Authentication & Authorization: Validate JWT/API keys before forwarding to services.
- Rate Limiting: Throttle clients to prevent abuse.
- Routing: Route
/api/usersto User Service,/api/ordersto Order Service. - SSL Termination: Decrypt HTTPS at the gateway; internal traffic can use HTTP.
- Request/Response Transformation: Translate protocols, aggregate multiple service calls.
- Caching: Cache responses at the edge.
- Logging & Monitoring: Centralized observability for all traffic.
Tools: AWS API Gateway, Kong, NGINX, Traefik, Envoy.
Replication copies data from one database (primary) to one or more databases (replicas) for high availability and read scalability.
| Type | How | Trade-off |
|---|---|---|
| Synchronous | Commit only after replica confirms write | Strong consistency, slower writes |
| Asynchronous | Commit immediately, replica catches up | Fast writes, possible replication lag (stale reads) |
| Semi-synchronous | Wait for at least one replica to confirm | Balance of both |
| Logical | Replicate row-level changes | Flexible, selective tables |
| Physical/Streaming | Copy raw WAL (write-ahead log) bytes | Fast, same DB version required |
| Feature | Forward Proxy | Reverse Proxy |
|---|---|---|
| Sits between | Client and internet | Internet and servers |
| Represents | The client (hides client identity) | The server (hides server identity) |
| Client knows about it? | Usually yes (configured) | Transparent to client |
| Use cases | VPNs, content filtering, anonymity | Load balancing, SSL termination, caching, security |
| Examples | Squid, corporate proxies | NGINX, Cloudflare, AWS ALB |
A reverse proxy is the workhorse of modern web architecture. NGINX acting as a reverse proxy can handle: SSL termination (HTTPS → HTTP internally), static file serving, caching, compression, rate limiting, and load balancing — all before the request reaches your app server.
Back-of-envelope estimations help size systems during design interviews. Memorise these reference numbers:
⚡Intermediate TopicsQ15–Q28
This is a classic system design question. Walk through each layer systematically.
Requirements: Shorten URL → 7-char code. Redirect short → original. 100M URLs/day created, 10B redirects/day (100:1 read:write). Links expire optionally. Analytics.
- ID generation: Auto-increment ID → Base62 encode (a-zA-Z0-9) to get 7-char code. Or pre-generate 7-char tokens in batches.
- Database schema:
short_code (PK), long_url, user_id, created_at, expires_at, click_count. - Redirect flow:
GET /{code}→ check Redis → check MySQL → 301/302 redirect. Cache hot codes in Redis with 24h TTL. - 301 vs 302: 301 (permanent) is cached by browsers — reduces future load but loses analytics. 302 (temporary) hits your server every time — better for analytics.
- Analytics: Log clicks to Kafka → stream processor → ClickHouse/BigQuery for analysis.
- Scale: 10B redirects/day ≈ 115,000 RPS — multiple app server replicas + Redis cluster handles this comfortably.
Consistent hashing is a technique for distributing data across servers so that adding or removing a server requires remapping only a minimal fraction of keys — unlike simple modular hashing which remaps everything.
The problem with key % N: If you have 4 servers and add a 5th, key % 4 vs key % 5 gives different results for almost every key → massive cache invalidation, thundering herd.
Consistent hashing solution:
- Imagine a ring (hash space 0 to 2³²).
- Hash each server to a position on the ring.
- Hash each key to a position on the ring.
- Each key is served by the first server clockwise from its position.
- Adding/removing a server only affects keys between that server and its predecessor — typically 1/N of all keys.
- Virtual nodes (vnodes): Each physical server maps to multiple positions on the ring for better load distribution and smaller rebalancing impact.
Sharding (horizontal partitioning) splits a database table across multiple database servers. Each shard holds a subset of the data — enabling horizontal scaling beyond what a single machine can handle.
| Strategy | How | Pros | Cons |
|---|---|---|---|
| Range-based | Shard by value range (A-M, N-Z or dates) | Simple, range queries easy | Hotspots if data not uniform |
| Hash-based | shard = hash(key) % N | Even distribution | Range queries span all shards |
| Directory-based | Lookup table maps key → shard | Flexible, easy to rebalance | Lookup table is a bottleneck/SPOF |
| Geographic | Data lives in user's region | Low latency, data residency | Cross-region queries difficult |
Sharding challenges: Cross-shard JOINs are expensive. Transactions spanning shards need distributed transaction protocols (2PC). Resharding when adding nodes is complex. Choose the shard key carefully — changing it later is very painful.
Rate limiting controls how many requests a client (by IP, user ID, or API key) can make in a given time window — protecting APIs from abuse, DoS attacks, and fair usage enforcement.
| Algorithm | How it works | Best for |
|---|---|---|
| Token Bucket | Bucket holds N tokens, fills at rate R. Each request consumes 1 token. | Smooth average with bursting allowed |
| Leaky Bucket | Requests queue; processed at fixed rate. Overflow dropped. | Smooth outflow, no burst |
| Fixed Window Counter | Count requests per N-second window. Reset on tick. | Simple, but edge spikes possible |
| Sliding Window Log | Track timestamps of each request. Count those in last N seconds. | Accurate but memory-heavy |
| Sliding Window Counter | Hybrid — approximate sliding window with low memory | Best accuracy/memory balance |
At scale, rate limiting state must be stored in a shared layer (Redis) accessible by all API server instances — a single server's memory won't work when you have many replicas.
A notification system needs to send messages across multiple channels (push, email, SMS) reliably, at scale, with minimal latency.
- User preferences: Store per-user channel + opt-out settings. Check before sending.
- Deduplication: Use idempotency keys to prevent duplicate sends on retry.
- Priority queues: Separate queues for real-time (OTP, alerts) vs batch (newsletters).
- Retry logic: Exponential backoff for failed sends. Dead letter queue for permanently failed.
- Template engine: Personalised content per user, localised per language.
- Delivery tracking: Webhooks from providers → update delivery status in DB → analytics.
| Feature | Push | Pull (Polling) |
|---|---|---|
| Who initiates | Server pushes to client | Client polls server |
| Latency | Near real-time | Up to poll interval |
| Server load | Must manage connections | Predictable (polling rate) |
| Client complexity | Must maintain connection | Simple HTTP requests |
| Missed data | Must handle reconnects | Client controls what it needs |
| Examples | WebSockets, SSE, push notifications | REST polling, long-polling, cron jobs |
Hybrid approaches:
- Long polling: Client makes request, server holds it open until data available (or timeout). Better than short polling but still HTTP-based.
- Server-Sent Events (SSE): One-directional push over HTTP. Server streams updates, client can't send back. Good for dashboards, live feeds.
- WebSockets: Full duplex — both sides can send at any time. Best for chat, gaming, collaborative tools. Higher infrastructure complexity.
Eventual consistency means that if no new updates are made, all replicas will eventually converge to the same value — but at any given moment, different nodes may return different data (stale reads).
- Read-your-writes: After a user writes, route their subsequent reads to the primary (or the same replica) to avoid seeing stale data. Common for profile updates.
- Monotonic reads: Ensure a user never sees older data than they've already seen — use sticky sessions or consistent hashing to route to same replica.
- CRDT (Conflict-free Replicated Data Types): Data structures that can be merged without conflicts — counters, sets, last-write-wins registers.
- Idempotent operations: Make operations safe to replay — critical when retrying after network failures.
- Sagas: For distributed transactions, break into steps with compensating actions for rollback instead of 2PC.
The circuit breaker pattern prevents cascading failures in distributed systems by stopping calls to a failing service and allowing it time to recover — like an electrical circuit breaker that trips to prevent damage.
- Fallback behaviour: When OPEN, return cached data, default value, or friendly error instead of hanging.
- Libraries: Netflix Hystrix (legacy), Resilience4j (Java), Polly (.NET), pybreaker (Python).
- Service mesh: Envoy, Istio implement circuit breaking transparently at infrastructure level.
- Data structure: In-memory hash table (O(1) get/set). Keys are strings; values can be strings, lists, hashes, sets, sorted sets.
- Persistence: AOF (Append Only File) — log every write command for durability. RDB (point-in-time snapshots) — faster recovery but potential data loss. Both can be combined.
- Single-threaded event loop: Redis processes commands sequentially — no locking needed, very fast. Network I/O is multiplexed (epoll/kqueue).
- Replication: Primary-replica async replication. Replica streams AOF from primary.
- Cluster mode: Data sharded across 16,384 hash slots distributed among nodes. Each key assigned to a slot via
CRC16(key) % 16384. - Eviction policies: LRU, LFU, random, TTL-based. Configure
maxmemory-policy. - Pub/Sub: Lightweight message broadcasting — publishers send to channels, subscribers receive.
In microservices, services spin up and down dynamically (containers, auto-scaling). Hard-coded IPs/ports break immediately. Service discovery lets services find each other dynamically by name.
| Pattern | How | Examples |
|---|---|---|
| Client-side discovery | Client queries registry, picks instance, calls directly | Eureka + Ribbon (Netflix OSS) |
| Server-side discovery | Client calls load balancer → LB queries registry | AWS ALB + ECS, Kubernetes Service |
| DNS-based | Service registered as DNS record; clients resolve normally | Consul DNS, Kubernetes DNS |
- Service registry tools: Consul, Zookeeper, etcd, AWS Cloud Map, Kubernetes built-in.
- Health checks: Registry periodically pings services; removes unhealthy instances from pool.
- Service mesh (Envoy/Istio): Handles discovery, load balancing, mTLS, circuit breaking at the sidecar proxy layer — no code changes needed.
A Saga is a sequence of local transactions where each step publishes an event that triggers the next step. If any step fails, compensating transactions are run to undo previous steps — instead of using a two-phase commit (2PC) which locks resources across services.
| Pattern | Choreography | Orchestration |
|---|---|---|
| Coordinator | None — services react to events | Central saga orchestrator |
| Coupling | Lower | Services coupled to orchestrator |
| Visibility | Hard to trace across services | Easy — orchestrator tracks state |
CQRS (Command Query Responsibility Segregation) separates write operations (commands) from read operations (queries) — using different models, and often different databases, for each.
Event Sourcing stores the history of all events that led to the current state instead of just the current state. The current state is derived by replaying events.
- Benefits: Full audit trail, time-travel queries, scale reads/writes independently, rebuild any view by replaying events.
- Challenges: Eventual consistency between write and read models, higher complexity, event schema evolution.
- Good for: Financial transactions, audit systems, collaborative editing, high-traffic e-commerce.
- Tools: EventStoreDB, Kafka (as event log), Axon Framework.
- Eliminate single points of failure (SPOF): Every component should have at least one backup. Load balancer pairs (active-passive or active-active). Multi-AZ database deployments.
- Redundancy: N+1 redundancy minimum. Critical systems N+2 or more. Active-active (both handle traffic) vs active-passive (standby takes over).
- Health checks & auto-healing: Load balancers and orchestrators (Kubernetes) automatically route away from unhealthy instances and restart failed containers.
- Geographic distribution: Multi-region deployment. DNS failover routes traffic to healthy region if one region goes down.
- Graceful degradation: System continues with reduced functionality rather than total failure. Feature flags to disable expensive features under load.
- Timeouts & retries: Never make calls without timeouts. Retry with exponential backoff + jitter. Distinguish retryable (5xx, network) from non-retryable (4xx) errors.
- Bulkhead pattern: Isolate resources (thread pools, connection pools) per dependency so one slow service can't exhaust all resources.
- Chaos engineering: Deliberately inject failures (kill random pods, add latency) to validate resilience. Netflix Chaos Monkey pioneered this.
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are Logs, Metrics, and Traces.
| Pillar | What it is | Tools | Answers |
|---|---|---|---|
| Logs | Timestamped records of discrete events | ELK Stack, Loki, CloudWatch | "What happened?" |
| Metrics | Numeric measurements over time (CPU, latency, error rate) | Prometheus, Grafana, Datadog | "How is the system performing?" |
| Traces | End-to-end request journey across services (spans) | Jaeger, Zipkin, AWS X-Ray, Tempo | "Where is the bottleneck?" |
- Golden signals (Google SRE): Latency, Traffic (requests/sec), Errors (rate), Saturation (resource utilisation). Monitor these for every service.
- SLI/SLO/SLA: Service Level Indicator (what you measure), Service Level Objective (target %, e.g. 99.9% uptime), Service Level Agreement (contract with penalty).
- Error budgets: If SLO is 99.9% uptime, you have 0.1% = ~43 min/month error budget. Once consumed, freeze new deployments until budget resets.
- Alerting: Alert on symptoms (user-visible impact), not causes. Avoid alert fatigue — too many alerts and engineers ignore them all.
🔥Advanced ArchitectureQ29–Q40
Scale: 500M users, 500M tweets/day, 28B timeline reads/day (reads >> writes, 56:1 ratio). This is the canonical fan-out problem.
Two approaches for feed generation:
| Approach | Fan-out on Write (Push) | Fan-out on Read (Pull) |
|---|---|---|
| When happens | On tweet creation — push to all followers' feeds | On read — merge tweets from all followees |
| Read speed | ⚡ Instant — pre-computed feed in Redis | 🐌 Slow — N DB queries per read |
| Write cost | High — celebrity with 10M followers = 10M writes | Low — one write |
| Problem | Celebrity problem (huge write amplification) | Hot read problem (all followers read at once) |
Twitter's hybrid approach: Fan-out on write for regular users (push tweets to follower Redis feeds). Fan-out on read for celebrities (>10K followers) — inject their tweets at read time. This limits write amplification while keeping reads fast.
- Feed storage: Redis sorted set per user (score = timestamp). Only keep last ~800 tweets per user. Older tweets fetched from DB on scroll.
- Feed service: Merges pre-computed feed with celebrity tweets, applies ranking/algorithm, paginates with cursor.
- Ranking: Engagement signals (likes, retweets, replies, recency, relationship strength) scored per item.
Requirements: Store/retrieve files of any size, 1B users, sync across devices, share files, 10 PB storage.
- Chunking: Split large files into 4MB chunks. Upload chunks in parallel. Only re-upload changed chunks on edit (delta sync).
- Deduplication: Hash each chunk (SHA-256). If hash exists in store, reference existing chunk — save storage for common files.
- Metadata DB: Tracks files, directories, versions, sharing permissions, chunk references.
- Object storage: S3-compatible (AWS S3, GCS, Azure Blob) — durable, cheap, infinitely scalable.
- Sync protocol: Long polling or WebSocket to notify clients of changes. Conflict resolution: last-write-wins or create conflict copy.
- CDN for downloads: Serve frequently accessed files from CDN edge nodes.
A multi-region distributed rate limiter must coordinate counters across geographically distributed nodes without introducing prohibitive latency or single points of failure.
- Centralized Redis: All API servers write to a single Redis cluster. Simple but the Redis cluster becomes a potential bottleneck and adds latency for distant regions.
- Local + global hybrid: Each region maintains a local counter (fast). Periodically sync to global count (async). Allow small over-limit bursts between sync cycles.
- Token bucket with sync: Each node holds a portion of the token budget. When depleted, request more from a central token service.
- CRDTs for counters: Use increment-only CRDT counters that can be merged from multiple nodes without coordination. Eventually consistent but accurate over time.
- Sliding window at the edge: Implement rate limiting at CDN/edge (Cloudflare Workers, Fastly Compute) — closest to users, before traffic hits your servers.
- Accept slight over-limiting: In a globally distributed system, perfect accuracy requires coordination that adds latency. For most APIs, being slightly over or under limit briefly is acceptable.
Real-time collaboration requires handling concurrent edits from multiple users simultaneously without losing changes or corrupting the document.
- Operational Transformation (OT): Google Docs' original approach. Each operation is transformed against concurrent operations to preserve intent. Complex to implement correctly.
- CRDT (Conflict-free Replicated Data Type): Newer approach (Figma, Linear). Data structures designed so concurrent edits automatically merge without conflict. More predictable.
- Architecture: WebSocket connection per user to a document server. Document server broadcasts operations to all connected users. Operations stored as event log.
- Cursors & presence: Broadcast cursor positions via WebSocket. Each user's cursor shown in real-time with their colour/name.
- Document sharding: Assign each document to a document server. Use consistent hashing so all users of the same document connect to the same server.
- Persistence: Operations appended to append-only log. Document state snapshotted periodically. On load, replay from latest snapshot + recent ops.
- Offline support: Queue operations locally, merge on reconnect using OT/CRDT.
A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is in a set. It can have false positives (says "yes" when answer is "no") but never false negatives (never says "no" when answer is "yes").
- How it works: A bit array of m bits + k hash functions. To add element: set bits at k hash positions. To check: if all k positions are set → "probably in set". If any is 0 → "definitely not in set".
- Space advantage: 1 billion items stored with 1% false positive rate needs ~1.14 GB vs ~8 GB for a hash set.
Real-world use cases:
- URL shortener: Check if a short code is already taken before DB query — avoid expensive lookups for definitely-new keys.
- Username availability: Instantly check if a username might exist. Only query DB on "might exist" response.
- Spam filter: Check if an email domain is in a list of known spam domains.
- Database caching: Don't bother checking cache if you know data isn't there (cache miss avoidance). Used by RocksDB, Cassandra, HBase.
- Cryptocurrency: Bitcoin uses Bloom filters for lightweight node wallet filtering.
- Password breachlist: Check if a password appears in a breached password list without storing the full list.
Generating unique IDs at scale across distributed servers without coordination is a classic system design problem. IDs should ideally be unique, sortable by time, and generated without a single coordinator.
| Approach | Structure | Pros | Cons |
|---|---|---|---|
| UUID v4 | 128-bit random | Simple, no coordination | Not sortable, large, random index fragmentation |
| Auto-increment DB | Sequential integer | Sortable, compact | Single point of failure, bottleneck |
| Twitter Snowflake | 41-bit timestamp + 10-bit machine ID + 12-bit sequence | Sortable, distributed, 4096/ms/machine | Clock skew issues, 69-year range |
| ULID | 48-bit time + 80-bit random | Sortable, URL-safe, 128-bit | Slightly larger than Snowflake |
| Ticket Server | Dedicated DB server serving ranges | Simple, sequential | SPOF if not replicated |
Requirements: 5 suggestions per keystroke. Return in <100ms. Rank by popularity. Handle 10M queries/day.
- Trie data structure: Each node represents a character. At terminal nodes, store top-N suggestions ranked by frequency. Prefix traversal gives all suggestions starting with typed text.
- Frequency tracking: Log all search queries → stream to Kafka → aggregate weekly/daily → update suggestion scores.
- Caching strategy: Most popular prefixes (first 1-3 chars) cover most traffic. Cache these in Redis sorted sets (score = frequency). Remaining prefixes hit Trie DB.
- Offline aggregation: Batch job (MapReduce/Spark) computes top suggestions per prefix weekly. Incremental updates for trending terms.
- Filtering: Remove offensive/adult terms. Support personalisation (user's past searches ranked higher).
- Latency target: CDN serves static prefix cache → <10ms. API server with Redis → <50ms. Accept eventual consistency for suggestion counts.
- Upload: Direct-to-S3 upload via pre-signed URL (bypasses app servers). Resume with multipart upload for large files.
- Transcoding: Convert raw video to multiple resolutions (240p, 480p, 720p, 1080p, 4K) and formats (HLS, DASH) using FFmpeg workers. HLS splits video into ~10-second segments for adaptive streaming.
- Adaptive bitrate streaming (ABR): Client player switches quality dynamically based on bandwidth. Starts low, increases as connection stabilises.
- CDN delivery: Video segments cached at CDN PoPs globally. Popular videos served from edge — no origin hit.
- Metadata DB: PostgreSQL for video metadata (title, description, tags, counts). Elasticsearch for search indexing.
- Comments & likes: Cassandra for comments (high write throughput). Redis for like counts (approximate is fine).
- Recommendation: ML-based (watch history, engagement signals) — separate ML pipeline.
Leader election is the process of designating one node in a cluster as the coordinator (leader) — responsible for decision-making, while others are followers. When the leader fails, a new leader must be elected quickly.
- Raft consensus algorithm: Nodes can be leader, follower, or candidate. Candidates request votes from peers. First to get majority becomes leader. Leader sends heartbeats to prevent new elections. Used by etcd, CockroachDB, TiKV.
- Paxos: The theoretical foundation. More complex than Raft — harder to understand and implement correctly. Multi-Paxos used by Google Chubby, Spanner.
- Zookeeper (ZAB): Atomic broadcast protocol. Clients race to create an ephemeral node at
/leader. First to create it wins. Others watch for deletion to trigger re-election. - Bully algorithm: Node with highest ID sends election message to all higher-ID nodes. If no response, it becomes leader. Simple but causes high message overhead.
Payment systems require the highest levels of consistency, reliability, and security. Money cannot be lost or double-charged.
- Idempotency keys: Each payment request carries a unique key. Duplicate requests return the same result without processing twice. Stored in DB with TTL.
- Ledger (double-entry bookkeeping): Every debit has a corresponding credit. Ledger is append-only — never update/delete records. Balance = sum of all transactions for an account.
- Exactly-once delivery: Use idempotency + outbox pattern. Write payment record + event in same DB transaction. Separate poller reads outbox and publishes events.
- Retry strategy: Network failures are common with payment providers. Retry with exponential backoff, checking idempotency to prevent double charge.
- PCI DSS compliance: Never store raw card numbers. Use tokenization (Stripe Elements handles card data before it touches your servers).
- Reconciliation: Daily batch job reconciles your ledger against payment processor statements to catch discrepancies.
Core challenge: Real-time location tracking of millions of drivers, fast matching of drivers to riders, ETA calculation, and price surging — all at city scale.
- Location storage: Driver location updates every ~5 seconds via mobile. Store in Redis with geospatial indexing (
GEOADD,GEODIST,GEORADIUS). Redis holds recent locations; DB holds historical trips. - Matching algorithm: When rider requests → find nearest N available drivers using geospatial query → rank by ETA, rating, vehicle type → offer to top candidate → if declined, offer to next. Time limit per offer (10–15s).
- Geospatial indexing: Divide city into S2 cells (Google S2 library). Map drivers to cells. Search radiating cells outward from pickup point.
- WebSocket for real-time: Persistent WebSocket connection per driver and rider. Location updates, trip status, notifications pushed via WebSocket.
- Trip state machine: REQUESTED → DRIVER_ACCEPTED → DRIVER_ARRIVED → TRIP_STARTED → COMPLETED → PAYMENT_PROCESSED.
- Surge pricing: Demand (ride requests per area) / supply (available drivers per area) ratio → multiplier applied. Updated every ~1 minute per geohash zone.
- ETA calculation: Routing engine (OSRM, Google Maps API) calculates drive time. Factor in real-time traffic data.
- Premature optimisation: Building for 1B users when you have 1K. Complexity is expensive to maintain. Build what you need now, design for horizontal scaling later.
- Distributed monolith: Microservices that are tightly coupled and must deploy together. Gets the worst of both worlds — network latency without independence.
- Chatty services: Service A makes 20 calls to Service B to complete one user request. High latency, fragile. Fix with aggregation, batching, or combining services.
- Shared database between services: Multiple services directly accessing the same DB schema. Breaks service independence — schema changes break all consumers. Each service should own its data store.
- Synchronous chains: Request must wait for A → B → C → D. One slow service blocks everything. Use async (queues) where response isn't immediately needed.
- Missing idempotency: Operations that aren't safe to retry. Crucial for payments, notifications, and any external API call.
- N+1 queries: Fetching a list then making one query per item. Use JOINs, eager loading, or batch APIs.
- No graceful degradation: System shows error page when a non-critical service is down. Feature flags and fallbacks maintain core functionality.
- Over-relying on a single cache layer: If Redis goes down and your app has no fallback to DB, everything breaks. Plan for cache failure.
- Not testing at scale: Load tests reveal bottlenecks invisible at low traffic. Use tools like k6, Gatling, or Locust before launch.
Up Next: Data Structures & Algorithms
System Design done. Next is DSA — arrays, trees, graphs, dynamic programming, and the algorithms that power technical interviews.
← TypeScript Q&A