Architecture Decision Records (ADR) Purpose: Document key architectural decisions for the 50k user scaling initiative.

Format: Context, Decision, Alternatives Considered, Consequences Status: Each decision is marked as Accepted, Rejected, Superseded, or Deprecated --- ## ADR-001: Redis for Distributed Caching Date: January 13, 2025 Status: Accepted Decision Makers: Engineering Team ### Context We need a fast, distributed caching layer to:

Reduce API calls to Optic Odds (6000/min limit)
Speed up frequently accessed data (sports, leagues, odds)
Enable horizontal scaling (share cache across instances)
Support pub/sub for WebSocket broadcasting ### Decision Use Redis as primary caching layer with tiered TTL strategy. - Provider: Redis Cloud (Railway Redis for <1k users)
Configuration: Connection pooling (50 max connections)
Cache Strategy: HOT (5min), WARM (1hr), COLD (24hr)
Eviction Policy: allkeys-lru (least recently used) ### Alternatives Considered #### Alternative 1: Memcached Pros:
Simpler than Redis
Slightly faster for pure key-value
Lower memory overhead Cons:
No pub/sub (critical for WebSocket)
No data structures (lists, sets, sorted sets)
No persistence (data lost on restart)
Less popular (fewer tools, support) Verdict: Rejected due to lack of pub/sub. #### Alternative 2: In-Memory (Python dict) Pros:
Zero latency
No external dependency
Free Cons:
Not distributed (can't share across instances)
Lost on restart
Limited by server RAM
No horizontal scaling Verdict: Rejected - doesn't scale. #### Alternative 3: Database-backed cache (Supabase table) Pros:
Persistent
Already have Supabase
No extra cost Cons:
Much slower (50-100ms vs 1-2ms)
Adds load to database
No pub/sub
Connection overhead Verdict: Rejected - too slow. ### Consequences Positive:
1-2ms cache lookups (vs 200ms+ API calls)
Pub/sub enables WebSocket horizontal scaling
Persistent caching (survives restart with RDB)
Rich data structures for complex caching
Industry standard (well-documented, supported) Negative:
Additional cost ($15-180/month depending on scale)
Single point of failure (mitigated with Redis Cluster in Phase 3)
Memory limits (mitigated with eviction policy) Mitigation:
Implement Redis Cluster for HA in Phase 3 (10k+ users)
Monitor cache hit rate and evictions
Fallback to direct API on Redis failure --- ## ADR-002: WebSocket over Server-Sent Events (SSE) Date: January 13, 2025 Status: Accepted ### Context Need real-time odds updates without constant polling. Options:
WebSocket (bidirectional)
Server-Sent Events (SSE, unidirectional)
Long Polling (outdated)
HTTP/2 Server Push (deprecated) ### Decision Use WebSocket with Redis pub/sub for horizontal scaling. - Protocol: WebSocket (ws://)
Library: FastAPI native WebSocket support
Scaling: Redis pub/sub for cross-instance messaging
Reconnection: Exponential backoff (2s, 5s, 10s, 30s) ### Alternatives Considered #### Alternative 1: Server-Sent Events (SSE) Pros:
Simpler than WebSocket
Built-in reconnection
Works over HTTP (easier for proxies)
Lower server overhead Cons:
One-way only (server -> client)
Can't send user actions over same connection
Limited browser support for large messages
HTTP overhead (headers on every message) Verdict: Rejected - need bidirectional for user subscriptions. #### Alternative 2: Optimized Polling Pros:
Simplest to implement
Already working
No connection management Cons:
Wastes bandwidth (empty responses)
Higher latency (poll interval = min latency)
Server load (constant requests)
Not truly real-time Verdict: Rejected - not scalable to 50k users. ### Consequences Positive:
True real-time updates (<500ms latency)
Bidirectional (client can subscribe to specific games)
Efficient (one connection, many messages)
Redis pub/sub enables horizontal scaling Negative:
More complex than SSE
Connection management overhead
Requires load balancer sticky sessions
Redis pub/sub adds dependency Mitigation:
Implement connection pooling and health checks
Graceful degradation to polling if WebSocket fails
Heartbeat ping/pong to detect dead connections --- ## ADR-003: PgBouncer for Connection Pooling Date: January 13, 2025 Status: Accepted ### Context Supabase Free tier = 100 max connections. At scale:
10 FastAPI instances × 50 connections = 500 connections needed
Without pooling, we hit connection limits at ~200 concurrent users ### Decision Deploy PgBouncer in transaction pooling mode. - Pooling Mode: Transaction (not session)
Max Client Connections: 10,000
Pool Size: 50-100 actual DB connections
Deployment: Docker container on Railway ### Alternatives Considered #### Alternative 1: Upgrade Supabase to higher tier Pros:
No additional setup
Managed by Supabase Cons:
Expensive ($25/mo -> $599/mo for 400 connections)
Still limited (hard cap at 400)
Doesn't solve fundamental pooling issue Verdict: Rejected - not cost-effective. #### Alternative 2: Reduce asyncpg pool size per instance Pros:
No changes needed
Free Cons:
Limits throughput per instance
Doesn't scale (same problem, just delayed)
Connection queuing under load Verdict: Rejected - kicks the can down the road. #### Alternative 3: PgPool-II Pros:
More features than PgBouncer
Load balancing built-in Cons:
More complex configuration
Higher overhead
Overkill for our use case Verdict: Rejected - PgBouncer simpler and sufficient. ### Consequences Positive:
10,000 client connections -> 100 DB connections
Scales to 50k+ users without DB upgrade
Reduced connection overhead
Transaction mode = better performance Negative:
Additional infrastructure to manage
Single point of failure (mitigate with HA setup)
Transaction mode incompatible with some PostgreSQL features Mitigation:
Deploy 2 PgBouncer instances for HA (Phase 2)
Monitor connection pool utilization
Keep direct DATABASE_URL for migrations --- ## ADR-004: Tiered Caching by Game Status Date: January 13, 2025 Status: Accepted ### Context Not all games need same refresh frequency:
Live games: Odds change every 30s
Upcoming games: Odds change every 5min
Scheduled games: Odds change every 30min+ Uniform 5min cache = wasted API calls for scheduled games, stale data for live games. ### Decision Implement tiered caching based on game status. - Live: 30s Redis + 10s client
Starting Soon (<2hr): 2min Redis + 1min client
Upcoming (2-24hr): 5min Redis + 2min client
Scheduled (>24hr): 30min Redis + 10min client ### Alternatives Considered #### Alternative 1: Uniform 5min cache Pros:
Simpler
Already implemented Cons:
Live games too stale (5min old data)
Scheduled games over-fetched (wasted API calls)
Doesn't scale efficiently Verdict: Rejected - one-size-fits-all doesn't work. #### Alternative 2: User-configurable refresh Pros:
User controls freshness Cons:
Complexity for users
Unpredictable API usage
Cache fragmentation Verdict: Rejected - unnecessary complexity. ### Consequences Positive:
Fresh data for live games (30s latency)
Reduced API calls for scheduled games (-70%)
Better UX (right freshness for context)
Scalable to 50k users within API limits Negative:
More complex caching logic
Need to classify game status on every request
Cache key management more complex Mitigation:
Create GameStatusClassifier utility for consistent logic
Cache game status with fixture data
Monitor cache hit rate per tier --- ## ADR-005: Materialized View for Sidebar Date: January 13, 2025 Status: Accepted ### Context Sidebar query aggregates across sports, leagues, events:
Current: 450ms average (JOIN + COUNT + GROUP BY on 50k+ rows)
Target: <100ms for better UX
Updated: Every 5 minutes (not real-time critical) ### Decision Use PostgreSQL materialized view refreshed every 5 minutes. ```sql CREATE MATERIALIZED VIEW sidebar_sports_mv AS SELECT sport_key, COUNT(...), ... FROM odds_sports ... GROUP BY ...; REFRESH MATERIALIZED VIEW CONCURRENTLY sidebar_sports_mv;

**Pros:**
- Always fresh data
- No maintenance **Cons:**
- Slow (450ms)
- Database load on every request
- Doesn't scale **Verdict:** Rejected - too slow. #### Alternative 2: Cache aggregation result in Redis
**Pros:**
- Fast (2ms)
- Easy to implement **Cons:**
- Still need to compute aggregation somewhere
- Cache invalidation complexity
- Data consistency issues **Verdict:** Rejected - doesn't solve underlying problem. #### Alternative 3: Application-level aggregation
**Pros:**
- Full control **Cons:**
- Load all data into memory
- Inefficient
- Duplicates database logic **Verdict:** Rejected - database is better at aggregation. ### Consequences **Positive:**
- Query time: 450ms -> <10ms (45x faster)
- Reduced database load
- PostgreSQL handles refresh efficiently
- CONCURRENT refresh = no locking **Negative:**
- Data up to 5 minutes stale (acceptable for sidebar)
- Additional storage for materialized view
- Requires refresh job **Mitigation:**
- Background job refreshes every 5min
- Can manual refresh on demand
- Monitor refresh duration --- ## ADR-006: Railway + Supabase over Self-Hosted **Date:** January 13, 2025
**Status:** Accepted ### Context Infrastructure decision:
- **Managed:** Railway (app) + Supabase (DB) + Redis Cloud
- **Self-Hosted:** DigitalOcean droplets + self-managed PostgreSQL + Redis Cost difference: ~$200/month at 50k users ### Decision **Use managed services (Railway + Supabase + Redis Cloud).** ### Alternatives Considered #### Alternative 1: Self-hosted on DigitalOcean
**Pros:**
- $378/month cheaper at 50k users
- Full control **Cons:**
- Requires dedicated DevOps engineer ($8k+/mo)
- Manual scaling, monitoring, backups
- No managed database features (PITR, auto-failover)
- Higher operational risk **Verdict:** Rejected - DevOps cost > savings. #### Alternative 2: Serverless (AWS Lambda + RDS)
**Pros:**
- Auto-scaling
- Pay-per-use **Cons:**
- Cold starts hurt latency
- Complex debugging
- Vendor lock-in
- Similar cost to Railway **Verdict:** Rejected - complexity not worth it. ### Consequences **Positive:**
- Zero DevOps overhead
- Auto-scaling, managed backups, monitoring included
- Better developer productivity
- 99.9% SLA (Supabase Team)
- Focus on features, not infrastructure **Negative:**
- $200/month more expensive than self-hosted
- Less control over infrastructure
- Vendor dependency **Mitigation:**
- Infrastructure cost is <1% of revenue - acceptable
- Document migration path if needed in future
- Use standard PostgreSQL (easy to migrate) --- ## ADR-007: TanStack Query over Redux for Client State **Date:** January 13, 2025 (reaffirming existing decision)
**Status:** Accepted ### Context Client-side state management for server data:
- Existing: TanStack Query for all API data
- Alternative: Redux + RTK Query ### Decision **Continue using TanStack Query for all server state.** - Already implemented and working well
- Activity-based refetch optimization in place
- Query key structure well-designed ### Alternatives Considered #### Alternative 1: Redux + RTK Query
**Pros:**
- Centralized state management
- Dev tools **Cons:**
- More boilerplate
- Over-engineering for our use case
- TanStack Query already working
- Migration cost with no benefit **Verdict:** Rejected - if it ain't broke, don't fix it. ### Consequences **Positive:**
- Less code than Redux
- Built-in caching, refetching, optimistic updates
- Activity-based optimization already implemented
- Team already familiar **Negative:**
- Client state (auth, UI) in Context API (not ideal)
- No time-travel debugging **Mitigation:**
- Keep using Context API for non-server state (acceptable for our scale)
- Consider Zustand if client state becomes complex --- ## ADR-008: Virtual Scrolling for Sidebar Leagues **Date:** January 13, 2025
**Status:** Accepted ### Context Sidebar can have 100+ leagues (NCAA, internationals):
- Rendering all at once = 3-5 second lag on mobile
- Especially bad on low-end devices
- Need smooth 60fps scrolling ### Decision **Use react-window for virtual scrolling.** - Only render visible items + 5 overscan
- 60px rows for mobile touch targets
- AutoSizer for responsive height ### Alternatives Considered #### Alternative 1: Render all leagues
**Pros:**
- Simpler code
- Works for small lists **Cons:**
- Lags on 100+ leagues
- Poor mobile UX
- Wasted memory **Verdict:** Rejected - doesn't scale. #### Alternative 2: Pagination
**Pros:**
- Simple
- Reduces initial render **Cons:**
- Extra clicks to see more leagues
- Poor UX for browsing
- Doesn't solve scroll performance **Verdict:** Rejected - worse UX. #### Alternative 3: react-virtualized (older library)
**Pros:**
- More features than react-window **Cons:**
- Larger bundle size
- More complex API
- react-window is newer, lighter **Verdict:** Rejected - react-window sufficient. ### Consequences **Positive:**
- Smooth 60fps scrolling
- Instant render even with 200+ leagues
- Lower memory usage
- Better mobile UX **Negative:**
- Slightly more complex code
- CSS styling limitations (fixed height rows) **Mitigation:**
- Keep row component simple
- Use AutoSizer for responsive height --- ## ADR-009: Celery for Background Jobs (Phase 2) **Date:** January 13, 2025
**Status:** Planned (not yet implemented) ### Context Background tasks needed:
- Refresh live odds every 30s
- Refresh upcoming odds every 5min
- Refresh sidebar materialized view every 5min
- Clean up stale cache hourly ### Decision **Use Celery with Redis broker for background jobs.** - **Broker:** Redis (already have)
- **Priority Queues:** High (live), Medium (upcoming), Low (scheduled)
- **Worker Count:** 2-8 depending on phase
- **Monitoring:** Flower dashboard ### Alternatives Considered #### Alternative 1: APScheduler (in-process)
**Pros:**
- Simpler setup
- Already using for some tasks
- No extra dependencies **Cons:**
- Not distributed (can't scale across instances)
- Jobs lost if instance crashes
- No priority queues
- Limited monitoring **Verdict:** Rejected - doesn't scale. #### Alternative 2: AWS Lambda cron
**Pros:**
- Serverless
- Auto-scaling **Cons:**
- Vendor lock-in
- Cold starts
- More complex deployment
- Not integrated with our stack **Verdict:** Rejected - over-engineering. ### Consequences **Positive:**
- Distributed task processing
- Priority queues
- Retry logic
- Flower monitoring dashboard
- Scales horizontally (add workers) **Negative:**
- Additional complexity
- Requires Redis (already have)
- Worker instances cost money **Mitigation:**
- Start with 2 workers, scale as needed
- Monitor task success rate
- Implement exponential backoff retries --- ## Summary Table | ADR | Decision | Status | Phase |
|-----|----------|--------|-------|
| 001 | Redis for Caching | Accepted | 1 |
| 002 | WebSocket over SSE | Accepted | 2 |
| 003 | PgBouncer | Accepted | 1 |
| 004 | Tiered Caching | Accepted | 1 |
| 005 | Materialized View | Accepted | 1 |
| 006 | Managed over Self-Hosted | Accepted | All |
| 007 | TanStack Query | Accepted | Existing |
| 008 | Virtual Scrolling | Accepted | 1 |
| 009 | Celery | Planned | 2 | --- **Last Updated:** January 13, 2025
**Next Review:** After Phase 1 completion
**Format:** Based on [ADR template by Michael Nygard](https://github.com/joelparkerhenderson/architecture-decision-record)