WAGERBABE DOCS
Scaling Roadmap
ScalingArchitecture Decisions

Architecture Decision Records (ADR) Purpose: Document key architectural decisions for the 50k user scaling initiative.

Format: Context, Decision, Alternatives Considered, Consequences Status: Each decision is marked as Accepted, Rejected, Superseded, or Deprecated --- ## ADR-001: Redis for Distributed Caching Date: January 13, 2025 Status: Accepted Decision Makers: Engineering Team ### Context We need a fast, distributed caching layer to:

  • Reduce API calls to Optic Odds (6000/min limit)
  • Speed up frequently accessed data (sports, leagues, odds)
  • Enable horizontal scaling (share cache across instances)
  • Support pub/sub for WebSocket broadcasting ### Decision Use Redis as primary caching layer with tiered TTL strategy. - Provider: Redis Cloud (Railway Redis for <1k users)
  • Configuration: Connection pooling (50 max connections)
  • Cache Strategy: HOT (5min), WARM (1hr), COLD (24hr)
  • Eviction Policy: allkeys-lru (least recently used) ### Alternatives Considered #### Alternative 1: Memcached Pros:
  • Simpler than Redis
  • Slightly faster for pure key-value
  • Lower memory overhead Cons:
  • No pub/sub (critical for WebSocket)
  • No data structures (lists, sets, sorted sets)
  • No persistence (data lost on restart)
  • Less popular (fewer tools, support) Verdict: Rejected due to lack of pub/sub. #### Alternative 2: In-Memory (Python dict) Pros:
  • Zero latency
  • No external dependency
  • Free Cons:
  • Not distributed (can't share across instances)
  • Lost on restart
  • Limited by server RAM
  • No horizontal scaling Verdict: Rejected - doesn't scale. #### Alternative 3: Database-backed cache (Supabase table) Pros:
  • Persistent
  • Already have Supabase
  • No extra cost Cons:
  • Much slower (50-100ms vs 1-2ms)
  • Adds load to database
  • No pub/sub
  • Connection overhead Verdict: Rejected - too slow. ### Consequences Positive:
  • 1-2ms cache lookups (vs 200ms+ API calls)
  • Pub/sub enables WebSocket horizontal scaling
  • Persistent caching (survives restart with RDB)
  • Rich data structures for complex caching
  • Industry standard (well-documented, supported) Negative:
  • Additional cost ($15-180/month depending on scale)
  • Single point of failure (mitigated with Redis Cluster in Phase 3)
  • Memory limits (mitigated with eviction policy) Mitigation:
  • Implement Redis Cluster for HA in Phase 3 (10k+ users)
  • Monitor cache hit rate and evictions
  • Fallback to direct API on Redis failure --- ## ADR-002: WebSocket over Server-Sent Events (SSE) Date: January 13, 2025 Status: Accepted ### Context Need real-time odds updates without constant polling. Options:
  • WebSocket (bidirectional)
  • Server-Sent Events (SSE, unidirectional)
  • Long Polling (outdated)
  • HTTP/2 Server Push (deprecated) ### Decision Use WebSocket with Redis pub/sub for horizontal scaling. - Protocol: WebSocket (ws://)
  • Library: FastAPI native WebSocket support
  • Scaling: Redis pub/sub for cross-instance messaging
  • Reconnection: Exponential backoff (2s, 5s, 10s, 30s) ### Alternatives Considered #### Alternative 1: Server-Sent Events (SSE) Pros:
  • Simpler than WebSocket
  • Built-in reconnection
  • Works over HTTP (easier for proxies)
  • Lower server overhead Cons:
  • One-way only (server -> client)
  • Can't send user actions over same connection
  • Limited browser support for large messages
  • HTTP overhead (headers on every message) Verdict: Rejected - need bidirectional for user subscriptions. #### Alternative 2: Optimized Polling Pros:
  • Simplest to implement
  • Already working
  • No connection management Cons:
  • Wastes bandwidth (empty responses)
  • Higher latency (poll interval = min latency)
  • Server load (constant requests)
  • Not truly real-time Verdict: Rejected - not scalable to 50k users. ### Consequences Positive:
  • True real-time updates (<500ms latency)
  • Bidirectional (client can subscribe to specific games)
  • Efficient (one connection, many messages)
  • Redis pub/sub enables horizontal scaling Negative:
  • More complex than SSE
  • Connection management overhead
  • Requires load balancer sticky sessions
  • Redis pub/sub adds dependency Mitigation:
  • Implement connection pooling and health checks
  • Graceful degradation to polling if WebSocket fails
  • Heartbeat ping/pong to detect dead connections --- ## ADR-003: PgBouncer for Connection Pooling Date: January 13, 2025 Status: Accepted ### Context Supabase Free tier = 100 max connections. At scale:
  • 10 FastAPI instances × 50 connections = 500 connections needed
  • Without pooling, we hit connection limits at ~200 concurrent users ### Decision Deploy PgBouncer in transaction pooling mode. - Pooling Mode: Transaction (not session)
  • Max Client Connections: 10,000
  • Pool Size: 50-100 actual DB connections
  • Deployment: Docker container on Railway ### Alternatives Considered #### Alternative 1: Upgrade Supabase to higher tier Pros:
  • No additional setup
  • Managed by Supabase Cons:
  • Expensive ($25/mo -> $599/mo for 400 connections)
  • Still limited (hard cap at 400)
  • Doesn't solve fundamental pooling issue Verdict: Rejected - not cost-effective. #### Alternative 2: Reduce asyncpg pool size per instance Pros:
  • No changes needed
  • Free Cons:
  • Limits throughput per instance
  • Doesn't scale (same problem, just delayed)
  • Connection queuing under load Verdict: Rejected - kicks the can down the road. #### Alternative 3: PgPool-II Pros:
  • More features than PgBouncer
  • Load balancing built-in Cons:
  • More complex configuration
  • Higher overhead
  • Overkill for our use case Verdict: Rejected - PgBouncer simpler and sufficient. ### Consequences Positive:
  • 10,000 client connections -> 100 DB connections
  • Scales to 50k+ users without DB upgrade
  • Reduced connection overhead
  • Transaction mode = better performance Negative:
  • Additional infrastructure to manage
  • Single point of failure (mitigate with HA setup)
  • Transaction mode incompatible with some PostgreSQL features Mitigation:
  • Deploy 2 PgBouncer instances for HA (Phase 2)
  • Monitor connection pool utilization
  • Keep direct DATABASE_URL for migrations --- ## ADR-004: Tiered Caching by Game Status Date: January 13, 2025 Status: Accepted ### Context Not all games need same refresh frequency:
  • Live games: Odds change every 30s
  • Upcoming games: Odds change every 5min
  • Scheduled games: Odds change every 30min+ Uniform 5min cache = wasted API calls for scheduled games, stale data for live games. ### Decision Implement tiered caching based on game status. - Live: 30s Redis + 10s client
  • Starting Soon (<2hr): 2min Redis + 1min client
  • Upcoming (2-24hr): 5min Redis + 2min client
  • Scheduled (>24hr): 30min Redis + 10min client ### Alternatives Considered #### Alternative 1: Uniform 5min cache Pros:
  • Simpler
  • Already implemented Cons:
  • Live games too stale (5min old data)
  • Scheduled games over-fetched (wasted API calls)
  • Doesn't scale efficiently Verdict: Rejected - one-size-fits-all doesn't work. #### Alternative 2: User-configurable refresh Pros:
  • User controls freshness Cons:
  • Complexity for users
  • Unpredictable API usage
  • Cache fragmentation Verdict: Rejected - unnecessary complexity. ### Consequences Positive:
  • Fresh data for live games (30s latency)
  • Reduced API calls for scheduled games (-70%)
  • Better UX (right freshness for context)
  • Scalable to 50k users within API limits Negative:
  • More complex caching logic
  • Need to classify game status on every request
  • Cache key management more complex Mitigation:
  • Create GameStatusClassifier utility for consistent logic
  • Cache game status with fixture data
  • Monitor cache hit rate per tier --- ## ADR-005: Materialized View for Sidebar Date: January 13, 2025 Status: Accepted ### Context Sidebar query aggregates across sports, leagues, events:
  • Current: 450ms average (JOIN + COUNT + GROUP BY on 50k+ rows)
  • Target: <100ms for better UX
  • Updated: Every 5 minutes (not real-time critical) ### Decision Use PostgreSQL materialized view refreshed every 5 minutes. ```sql CREATE MATERIALIZED VIEW sidebar_sports_mv AS SELECT sport_key, COUNT(...), ... FROM odds_sports ... GROUP BY ...; REFRESH MATERIALIZED VIEW CONCURRENTLY sidebar_sports_mv;
**Pros:**
- Always fresh data
- No maintenance **Cons:**
- Slow (450ms)
- Database load on every request
- Doesn't scale **Verdict:** Rejected - too slow. #### Alternative 2: Cache aggregation result in Redis
**Pros:**
- Fast (2ms)
- Easy to implement **Cons:**
- Still need to compute aggregation somewhere
- Cache invalidation complexity
- Data consistency issues **Verdict:** Rejected - doesn't solve underlying problem. #### Alternative 3: Application-level aggregation
**Pros:**
- Full control **Cons:**
- Load all data into memory
- Inefficient
- Duplicates database logic **Verdict:** Rejected - database is better at aggregation. ### Consequences **Positive:**
- Query time: 450ms -> <10ms (45x faster)
- Reduced database load
- PostgreSQL handles refresh efficiently
- CONCURRENT refresh = no locking **Negative:**
- Data up to 5 minutes stale (acceptable for sidebar)
- Additional storage for materialized view
- Requires refresh job **Mitigation:**
- Background job refreshes every 5min
- Can manual refresh on demand
- Monitor refresh duration --- ## ADR-006: Railway + Supabase over Self-Hosted **Date:** January 13, 2025
**Status:** Accepted ### Context Infrastructure decision:
- **Managed:** Railway (app) + Supabase (DB) + Redis Cloud
- **Self-Hosted:** DigitalOcean droplets + self-managed PostgreSQL + Redis Cost difference: ~$200/month at 50k users ### Decision **Use managed services (Railway + Supabase + Redis Cloud).** ### Alternatives Considered #### Alternative 1: Self-hosted on DigitalOcean
**Pros:**
- $378/month cheaper at 50k users
- Full control **Cons:**
- Requires dedicated DevOps engineer ($8k+/mo)
- Manual scaling, monitoring, backups
- No managed database features (PITR, auto-failover)
- Higher operational risk **Verdict:** Rejected - DevOps cost > savings. #### Alternative 2: Serverless (AWS Lambda + RDS)
**Pros:**
- Auto-scaling
- Pay-per-use **Cons:**
- Cold starts hurt latency
- Complex debugging
- Vendor lock-in
- Similar cost to Railway **Verdict:** Rejected - complexity not worth it. ### Consequences **Positive:**
- Zero DevOps overhead
- Auto-scaling, managed backups, monitoring included
- Better developer productivity
- 99.9% SLA (Supabase Team)
- Focus on features, not infrastructure **Negative:**
- $200/month more expensive than self-hosted
- Less control over infrastructure
- Vendor dependency **Mitigation:**
- Infrastructure cost is <1% of revenue - acceptable
- Document migration path if needed in future
- Use standard PostgreSQL (easy to migrate) --- ## ADR-007: TanStack Query over Redux for Client State **Date:** January 13, 2025 (reaffirming existing decision)
**Status:** Accepted ### Context Client-side state management for server data:
- Existing: TanStack Query for all API data
- Alternative: Redux + RTK Query ### Decision **Continue using TanStack Query for all server state.** - Already implemented and working well
- Activity-based refetch optimization in place
- Query key structure well-designed ### Alternatives Considered #### Alternative 1: Redux + RTK Query
**Pros:**
- Centralized state management
- Dev tools **Cons:**
- More boilerplate
- Over-engineering for our use case
- TanStack Query already working
- Migration cost with no benefit **Verdict:** Rejected - if it ain't broke, don't fix it. ### Consequences **Positive:**
- Less code than Redux
- Built-in caching, refetching, optimistic updates
- Activity-based optimization already implemented
- Team already familiar **Negative:**
- Client state (auth, UI) in Context API (not ideal)
- No time-travel debugging **Mitigation:**
- Keep using Context API for non-server state (acceptable for our scale)
- Consider Zustand if client state becomes complex --- ## ADR-008: Virtual Scrolling for Sidebar Leagues **Date:** January 13, 2025
**Status:** Accepted ### Context Sidebar can have 100+ leagues (NCAA, internationals):
- Rendering all at once = 3-5 second lag on mobile
- Especially bad on low-end devices
- Need smooth 60fps scrolling ### Decision **Use react-window for virtual scrolling.** - Only render visible items + 5 overscan
- 60px rows for mobile touch targets
- AutoSizer for responsive height ### Alternatives Considered #### Alternative 1: Render all leagues
**Pros:**
- Simpler code
- Works for small lists **Cons:**
- Lags on 100+ leagues
- Poor mobile UX
- Wasted memory **Verdict:** Rejected - doesn't scale. #### Alternative 2: Pagination
**Pros:**
- Simple
- Reduces initial render **Cons:**
- Extra clicks to see more leagues
- Poor UX for browsing
- Doesn't solve scroll performance **Verdict:** Rejected - worse UX. #### Alternative 3: react-virtualized (older library)
**Pros:**
- More features than react-window **Cons:**
- Larger bundle size
- More complex API
- react-window is newer, lighter **Verdict:** Rejected - react-window sufficient. ### Consequences **Positive:**
- Smooth 60fps scrolling
- Instant render even with 200+ leagues
- Lower memory usage
- Better mobile UX **Negative:**
- Slightly more complex code
- CSS styling limitations (fixed height rows) **Mitigation:**
- Keep row component simple
- Use AutoSizer for responsive height --- ## ADR-009: Celery for Background Jobs (Phase 2) **Date:** January 13, 2025
**Status:** Planned (not yet implemented) ### Context Background tasks needed:
- Refresh live odds every 30s
- Refresh upcoming odds every 5min
- Refresh sidebar materialized view every 5min
- Clean up stale cache hourly ### Decision **Use Celery with Redis broker for background jobs.** - **Broker:** Redis (already have)
- **Priority Queues:** High (live), Medium (upcoming), Low (scheduled)
- **Worker Count:** 2-8 depending on phase
- **Monitoring:** Flower dashboard ### Alternatives Considered #### Alternative 1: APScheduler (in-process)
**Pros:**
- Simpler setup
- Already using for some tasks
- No extra dependencies **Cons:**
- Not distributed (can't scale across instances)
- Jobs lost if instance crashes
- No priority queues
- Limited monitoring **Verdict:** Rejected - doesn't scale. #### Alternative 2: AWS Lambda cron
**Pros:**
- Serverless
- Auto-scaling **Cons:**
- Vendor lock-in
- Cold starts
- More complex deployment
- Not integrated with our stack **Verdict:** Rejected - over-engineering. ### Consequences **Positive:**
- Distributed task processing
- Priority queues
- Retry logic
- Flower monitoring dashboard
- Scales horizontally (add workers) **Negative:**
- Additional complexity
- Requires Redis (already have)
- Worker instances cost money **Mitigation:**
- Start with 2 workers, scale as needed
- Monitor task success rate
- Implement exponential backoff retries --- ## Summary Table | ADR | Decision | Status | Phase |
|-----|----------|--------|-------|
| 001 | Redis for Caching | Accepted | 1 |
| 002 | WebSocket over SSE | Accepted | 2 |
| 003 | PgBouncer | Accepted | 1 |
| 004 | Tiered Caching | Accepted | 1 |
| 005 | Materialized View | Accepted | 1 |
| 006 | Managed over Self-Hosted | Accepted | All |
| 007 | TanStack Query | Accepted | Existing |
| 008 | Virtual Scrolling | Accepted | 1 |
| 009 | Celery | Planned | 2 | --- **Last Updated:** January 13, 2025
**Next Review:** After Phase 1 completion
**Format:** Based on [ADR template by Michael Nygard](https://github.com/joelparkerhenderson/architecture-decision-record)