Scaling Roadmap: 50 to 50,000 Users Roadmap Version: 1.0

Created: January 13, 2025 Timeline: 4 months to production-ready Target Capacity: 50,000 concurrent users --- ## Strategic Overview ### Scaling Philosophy Principle: Start simple, scale smart, measure everything We're implementing a phased approach that prioritizes:

Quick Wins First - High impact, low complexity (sidebar, caching)
Foundation Before Features - Solid infrastructure before optimization
Data-Driven Decisions - Measure before and after each phase
Horizontal Over Vertical - Scale out, not up
Managed Over Self-Hosted - Developer productivity > cost savings ### Success Criteria Technical Metrics:

API response time <200ms (p95)
Cache hit rate >95%
Database queries <50ms (p95)
WebSocket supports 50k connections
99.9% uptime (8.76 hours downtime/year max) Business Metrics:
Infrastructure cost <1% of revenue
Zero downtime during normal operations
Support event-driven traffic spikes (10x normal)
Horizontal scaling without code changes --- ## Phase Timeline | Phase | Duration | User Capacity | Key Deliverables | Status | |-------|----------|---------------|------------------|--------| | Phase 0 | Week 0 | 50 | Documentation, baseline | Done | | Phase 1 | Weeks 1-2 | 1,000 | Sidebar, caching, PgBouncer | In Progress | | Phase 2 | Weeks 3-4 | 10,000 | WebSocket, background jobs | Planned | | Phase 3 | Month 2 | 25,000 | CDN, APM, read replicas | Planned | | Phase 4 | Month 3-4 | 50,000 | Event streaming, CQRS | Planned | --- ## Phase 0: Documentation & Baseline Duration: Week 0 (January 13-19, 2025) Status: Complete ### Objectives
Document current state with hard metrics
Create comprehensive cost analysis
Define success criteria for each phase
Establish monitoring baseline ### Deliverables
docs/scaling/README.md - Overview and quick reference
docs/scaling/CURRENT_STATE.md - Baseline metrics
docs/scaling/COST_ANALYSIS.md - Financial projections
docs/scaling/SCALING_ROADMAP.md (this file)
docs/scaling/ARCHITECTURE_DECISIONS.md - ADRs
docs/scaling/metrics/ - Benchmarks and tracking
docs/scaling/infrastructure/ - Setup guides ### Success Metrics
All baseline metrics documented
Team aligned on priorities (sidebar first)
Cost projections validated
Monitoring strategy defined --- ## Phase 1: Foundation (Weeks 1-2) Duration: 2 weeks (January 20 - February 2, 2025) Target Capacity: 1,000 concurrent users Focus: Sidebar optimization, tiered caching, database foundation ### Week 1: Sidebar Optimization & Database #### Objective Optimize sidebar to load instantly, filter non-bettable sports, prioritize American sports, and set up database for scale. --- #### Task 1.1: Enhanced Sidebar Service File: server/app/services/shared/sidebar_cache_service.py Estimated Time: 4 hours Changes:

class SidebarCacheService: def filter_bettable_sports(self, sports_data: list) -> list: """Remove sports/leagues with zero bettable games.""" return [ sport for sport in sports_data if sport.get('total_games', 0) > 0 ] def prioritize_american_sports(self, sports_data: list) -> list: """Sort: NFL, NBA, MLB, NHL, NCAAF, NCAAB first, then others alphabetically.""" priority_keys = ['nfl', 'nba', 'mlb', 'nhl', 'ncaaf', 'ncaab'] priority = [s for s in sports_data if s['key'] in priority_keys] others = sorted([s for s in sports_data if s['key'] not in priority_keys], key=lambda x: x['display_name']) return priority + others async def get_sidebar_data_optimized(self, force_refresh: bool = False) -> dict: """Get sidebar data with filtering, prioritization, and 5-15min tiered cache.""" cache_key = "wagerbabe:sidebar:optimized" # Try cache first if not force_refresh: cached = await self.redis.get(cache_key) if cached: return json.loads(cached) # Fetch from materialized view (much faster than aggregation) async with db_pool.acquire() as conn: sports = await conn.fetch(""" SELECT * FROM sidebar_sports_mv ORDER BY priority ASC, display_name ASC """) # Transform and filter sports_data = [dict(row) for row in sports] sports_data = self.filter_bettable_sports(sports_data) sports_data = self.prioritize_american_sports(sports_data) # Cache with tiered TTL (5min for live counts, 15min for structure) has_live_games = any(s.get('live_count', 0) > 0 for s in sports_data) ttl = 300 if has_live_games else 900 # 5min or 15min await self.redis.setex(cache_key, ttl, json.dumps(sports_data)) return { 'sports': sports_data, 'cache_info': { 'cached_at': datetime.now().isoformat(), 'ttl': ttl, 'has_live_games': has_live_games } }
``` **Success Criteria:**
- Only sports with games displayed
- American sports always at top
- Cache hit rate >85% --- #### Task 1.2: Database Materialized View
**File:** `server/migrations/create_sidebar_materialized_view.sql`
**Estimated Time:** 2 hours **SQL:**
```sql
-- Drop existing view if any
DROP MATERIALIZED VIEW IF EXISTS sidebar_sports_mv CASCADE; -- Create materialized view for sidebar data
CREATE MATERIALIZED VIEW sidebar_sports_mv AS
SELECT s.id, s.sport_key, s.display_name, s.group_name, s.priority, s.has_outrights, s.is_active, COUNT(DISTINCT e.id) FILTER ( WHERE e.betting_enabled = true AND e.commence_time > NOW() AND e.status != 'cancelled' ) as games_count, COUNT(DISTINCT e.id) FILTER ( WHERE e.is_live = true AND e.betting_enabled = true ) as live_count, MIN(e.commence_time) FILTER ( WHERE e.commence_time > NOW() AND e.betting_enabled = true ) as next_game_time, MAX(e.updated_at) as last_updated
FROM odds_sports s
LEFT JOIN odds_events e ON s.sport_key = e.sport_key
WHERE s.is_active = true
GROUP BY s.id, s.sport_key, s.display_name, s.group_name, s.priority, s.has_outrights, s.is_active; -- Create unique index for fast lookups
CREATE UNIQUE INDEX idx_sidebar_sports_mv_sport_key ON sidebar_sports_mv (sport_key); -- Create index for sorting
CREATE INDEX idx_sidebar_sports_mv_priority ON sidebar_sports_mv (priority, display_name); -- Auto-refresh materialized view every 5 minutes (requires pg_cron or manual job)
-- This will be handled by background worker in Phase 2
``` **Background Refresh Job (add to existing or create new):**
```python
# server/app/background/refresh_sidebar_mv.py
from apscheduler.schedulers.asyncio import AsyncIOScheduler async def refresh_sidebar_materialized_view(): """Refresh sidebar materialized view every 5 minutes.""" async with db_pool.acquire() as conn: await conn.execute(""" REFRESH MATERIALIZED VIEW CONCURRENTLY sidebar_sports_mv """) logger.info("Refreshed sidebar materialized view") # Register with scheduler
scheduler.add_job( refresh_sidebar_materialized_view, 'interval', minutes=5, id='refresh_sidebar_mv'
)
``` **Success Criteria:**
- Sidebar query time <10ms (from 450ms)
- Automatic refresh every 5 minutes
- No manual aggregation queries --- #### Task 1.3: Sidebar API Enhancements
**File:** `server/app/api/v1/endpoints/sidebar.py`
**Estimated Time:** 3 hours **Changes:**
```python
@router.get("/sports", response_model=SidebarResponse)
async def get_sidebar_sports( filter: str = Query("bettable", enum=["all", "bettable"]), sort: str = Query("priority", enum=["priority", "alphabetical", "popular"]), force_refresh: bool = False, current_user: User = Depends(get_current_user)
): """ Get optimized sidebar sports data. - **filter**: 'all' or 'bettable' (default: bettable) - **sort**: 'priority' (American sports first), 'alphabetical', or 'popular' (most games) - **force_refresh**: Bypass cache Returns: - Sports with leagues and game counts - Cache metadata - Performance stats """ sidebar_service = SidebarCacheService() data = await sidebar_service.get_sidebar_data_optimized(force_refresh) # Apply filters if filter == "bettable": data['sports'] = sidebar_service.filter_bettable_sports(data['sports']) # Apply sorting if sort == "priority": data['sports'] = sidebar_service.prioritize_american_sports(data['sports']) elif sort == "alphabetical": data['sports'] = sorted(data['sports'], key=lambda x: x['display_name']) elif sort == "popular": data['sports'] = sorted(data['sports'], key=lambda x: x['games_count'], reverse=True) # Add cache headers for HTTP caching / CDN headers = { "Cache-Control": "public, max-age=300", # 5 minutes "ETag": hashlib.md5(json.dumps(data).encode()).hexdigest(), "X-Cache-Hit": "true" if not force_refresh else "false" } return Response( content=json.dumps(data), media_type="application/json", headers=headers )
``` **Success Criteria:**
- API response <100ms (cached)
- HTTP cache headers for CDN
- Flexible filtering and sorting
- Cache metadata included --- #### Task 1.4: PgBouncer Setup
**File:** `docker-compose.pgbouncer.yml` + `pgbouncer.ini`
**Estimated Time:** 4 hours **Docker Compose:**
```yaml
version: '3.8' services: pgbouncer: image: edoburu/pgbouncer:latest container_name: wagerbabe-pgbouncer ports: - "6432:6432" environment: - DATABASE_URL=${DATABASE_URL} - POOL_MODE=transaction - MAX_CLIENT_CONN=10000 - DEFAULT_POOL_SIZE=50 - MIN_POOL_SIZE=20 - RESERVE_POOL_SIZE=10 - RESERVE_POOL_TIMEOUT=5 - MAX_DB_CONNECTIONS=100 - SERVER_IDLE_TIMEOUT=600 - SERVER_LIFETIME=3600 volumes: - ./pgbouncer.ini:/etc/pgbouncer/pgbouncer.ini healthcheck: test: ["CMD", "pg_isready", "-h", "localhost", "-p", "6432"] interval: 10s timeout: 5s retries: 5 restart: unless-stopped
``` **PgBouncer Config:**
```ini
[databases]
wagerbabe = ${DATABASE_URL} [pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 50
min_pool_size = 20
reserve_pool_size = 10
reserve_pool_timeout = 5
max_db_connections = 100
server_reset_query = DISCARD ALL
server_idle_timeout = 600
server_lifetime = 3600
server_connect_timeout = 15
query_timeout = 30
ignore_startup_parameters = extra_float_digits
admin_users = postgres
stats_users = postgres
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
``` **Environment Variable Update:**
```bash
# .env (update DATABASE_URL to point to PgBouncer)
DATABASE_URL=postgresql://user:pass@pgbouncer:6432/wagerbabe # Was: ...@supabase:5432/...
DIRECT_DATABASE_URL=postgresql://user:pass@supabase:5432/wagerbabe # Keep for migrations
``` **Success Criteria:**
- 10,000 client connections -> 100 DB connections
- No connection timeouts under load
- Health check endpoint working
- Connection stats accessible --- #### Task 1.5: Client - Virtual Scrolling Sidebar
**File:** `client/src/components/primitives/sidebar.tsx`
**Estimated Time:** 5 hours **Install Dependencies:**
```bash
npm install react-window react-window-infinite-loader
``` **Implementation:**
```typescript
import { FixedSizeList as List } from 'react-window';
import AutoSizer from 'react-virtualized-auto-sizer'; export function VirtualizedSidebar() { const { data: sidebarData, isLoading } = useSidebarSports(); // Flatten sports and leagues for virtual list const flattenedItems = useMemo(() => { if (!sidebarData?.sports) return []; const items: SidebarItem[] = []; sidebarData.sports.forEach((sport) => { items.push({ type: 'sport', data: sport }); if (sport.is_expanded) { sport.leagues?.forEach((league) => { if (league.games_count > 0) { // Only show leagues with games items.push({ type: 'league', data: league, sportKey: sport.key }); } }); } }); return items; }, [sidebarData]); const Row = ({ index, style }: { index: number; style: React.CSSProperties }) => { const item = flattenedItems[index]; return ( <div style={style}> {item.type === 'sport' ? ( <SportRow sport={item.data} /> ) : ( <LeagueRow league={item.data} sportKey={item.sportKey} /> )} </div> ); }; if (isLoading) { return <SidebarSkeleton />; } return ( <div className="h-full"> <AutoSizer> {({ height, width }) => ( <List height={height} itemCount={flattenedItems.length} itemSize={60} // 60px per row (mobile-optimized) width={width} overscanCount={5} // Render 5 extra items for smooth scrolling > {Row} </List> )} </AutoSizer> </div> );
}
``` **Success Criteria:**
- Smooth scrolling with 100+ leagues
- Only render visible items
- <50ms render time
- Mobile-optimized (44px touch targets) --- #### Task 1.6: Sidebar TanStack Query Hook
**File:** `client/src/lib/hooks/use-sidebar-data.ts`
**Estimated Time:** 2 hours **Implementation:**
```typescript
import { useQuery, useQueryClient } from '@tanstack/react-query'; export function useSidebarSports(options = {}) { const queryClient = useQueryClient(); return useQuery({ queryKey: ['sidebar', 'sports'], queryFn: async () => { const response = await fetch('/api/v1/sidebar/sports?filter=bettable&sort=priority'); if (!response.ok) throw new Error('Failed to fetch sidebar data'); return response.json(); }, staleTime: 15 * 60 * 1000, // 15 minutes cacheTime: 30 * 60 * 1000, // 30 minutes refetchOnWindowFocus: false, refetchOnMount: false, refetchOnReconnect: true, ...options });
} // Prefetch on app load
export function usePrefetchSidebar() { const queryClient = useQueryClient(); useEffect(() => { queryClient.prefetchQuery({ queryKey: ['sidebar', 'sports'], queryFn: async () => { const response = await fetch('/api/v1/sidebar/sports?filter=bettable&sort=priority'); return response.json(); } }); }, [queryClient]);
} // Invalidate on WebSocket event
export function useInvalidateSidebarOnUpdate() { const queryClient = useQueryClient(); const { lastMessage } = useOddsWebSocket(); useEffect(() => { if (lastMessage?.type === 'sports_updated') { queryClient.invalidateQueries({ queryKey: ['sidebar', 'sports'] }); } }, [lastMessage, queryClient]);
}
``` **Success Criteria:**
- No localStorage caching (rely on TanStack Query)
- 15min stale time
- Prefetch on app mount
- Invalidate on WebSocket update --- ### Week 1 Success Criteria - Sidebar loads <100ms (cached), <300ms (fresh)
- Only sports with bettable games shown
- American sports prioritized (NFL, NBA, MLB, NHL first)
- Virtual scrolling handles 100+ leagues smoothly
- Database supports 500 concurrent connections via PgBouncer
- Cache hit rate >85% for sidebar
- Materialized view refreshes every 5 minutes --- ### Week 2: Tiered Caching & API Efficiency #### Objective
Implement smart caching based on game status, batch API requests, optimize database queries, and reduce external API calls by 60%. --- #### Task 2.1: Game Status Classifier
**File:** `server/app/utils/game_status_classifier.py`
**Estimated Time:** 3 hours **Implementation:**
```python
from enum import Enum
from datetime import datetime, timedelta class GameStatus(Enum): LIVE = "live" # Currently in progress STARTING_SOON = "starting_soon" # <2 hours until start UPCOMING = "upcoming" # 2-24 hours SCHEDULED = "scheduled" # >24 hours COMPLETED = "completed" # Game over CANCELLED = "cancelled" # Cancelled class GameStatusClassifier: @staticmethod def classify(game: dict) -> GameStatus: """Determine game status for tiered caching.""" if game.get('is_live'): return GameStatus.LIVE if game.get('status') == 'completed': return GameStatus.COMPLETED if game.get('status') == 'cancelled': return GameStatus.CANCELLED commence_time = game.get('commence_time') if not commence_time: return GameStatus.SCHEDULED if isinstance(commence_time, str): commence_time = datetime.fromisoformat(commence_time.replace('Z', '+00:00')) time_until_start = commence_time - datetime.now(commence_time.tzinfo) if time_until_start.total_seconds() < 0: return GameStatus.LIVE # Should have started elif time_until_start.total_seconds() < 7200: # 2 hours return GameStatus.STARTING_SOON elif time_until_start.total_seconds() < 86400: # 24 hours return GameStatus.UPCOMING else: return GameStatus.SCHEDULED @staticmethod def get_cache_ttl(status: GameStatus) -> int: """Get cache TTL in seconds based on game status.""" return { GameStatus.LIVE: 30, # 30 seconds (very fresh) GameStatus.STARTING_SOON: 120, # 2 minutes GameStatus.UPCOMING: 300, # 5 minutes GameStatus.SCHEDULED: 1800, # 30 minutes GameStatus.COMPLETED: 86400, # 24 hours (static) GameStatus.CANCELLED: 86400 # 24 hours (static) }[status] @staticmethod def get_client_stale_time(status: GameStatus) -> int: """Get TanStack Query staleTime in milliseconds.""" return { GameStatus.LIVE: 10000, # 10 seconds GameStatus.STARTING_SOON: 60000, # 1 minute GameStatus.UPCOMING: 120000, # 2 minutes GameStatus.SCHEDULED: 600000, # 10 minutes GameStatus.COMPLETED: 3600000, # 1 hour GameStatus.CANCELLED: 3600000 # 1 hour }[status]
``` **Success Criteria:**
- Accurate classification of game status
- Tiered TTL based on urgency
- Client and server TTLs aligned --- #### Task 2.2: Tiered Redis Caching in Odds Service
**File:** `server/app/services/odds/optic_odds_service.py`
**Estimated Time:** 4 hours **Implementation:**
```python
from app.utils.game_status_classifier import GameStatusClassifier, GameStatus class OpticOddsService: async def get_fixture_with_tiered_cache(self, fixture_id: str) -> dict: """Get fixture with cache TTL based on game status.""" # Try cache first cache_key = f"wagerbabe:odds:fixture:{fixture_id}" cached = await self.redis_cache.get(cache_key) if cached: fixture_data = json.loads(cached) # Check if cache is still valid based on game status status = GameStatusClassifier.classify(fixture_data) cache_age = time.time() - fixture_data.get('_cached_at', 0) ttl = GameStatusClassifier.get_cache_ttl(status) if cache_age < ttl: logger.info(f"Cache hit for fixture {fixture_id}, status={status.value}, age={cache_age}s") return fixture_data # Cache miss or stale - fetch fresh data logger.info(f"Cache miss for fixture {fixture_id}, fetching from API") fixture_data = await self.optic_odds_client.get_fixture(fixture_id) # Classify and cache with appropriate TTL status = GameStatusClassifier.classify(fixture_data) ttl = GameStatusClassifier.get_cache_ttl(status) fixture_data['_cached_at'] = time.time() fixture_data['_cache_status'] = status.value await self.redis_cache.setex( cache_key, ttl, json.dumps(fixture_data) ) logger.info(f"Cached fixture {fixture_id} with status={status.value}, ttl={ttl}s") return fixture_data async def batch_refresh_by_status(self, status: GameStatus): """Background job to refresh all games of a specific status.""" # Get all cached fixtures with this status pattern = f"wagerbabe:odds:fixture:*" cursor = 0 refreshed = 0 while True: cursor, keys = await self.redis_cache.scan(cursor, match=pattern, count=100) for key in keys: cached = await self.redis_cache.get(key) if cached: fixture_data = json.loads(cached) if fixture_data.get('_cache_status') == status.value: # Refresh this fixture fixture_id = key.split(':')[-1] await self.get_fixture_with_tiered_cache(fixture_id) refreshed += 1 if cursor == 0: break logger.info(f"Refreshed {refreshed} {status.value} fixtures") return refreshed
``` **Success Criteria:**
- Live games cache for 30s
- Upcoming games cache for 5min
- Scheduled games cache for 30min
- Cache hit rate >85%
- API calls reduced by 60% --- [Continue in next message due to length...] --- ### Week 2 Success Criteria - API usage <3000 req/min (50% of limit)
- Cache hit rate >90% for scheduled games, >70% for live
- Database queries <50ms p95
- Request deduplication reduces API calls by 40%
- All top 5 slow queries optimized --- ### Phase 1 Final Success Criteria **Performance:**
- Sidebar: <100ms (cached), <300ms (fresh)
- API endpoints: <200ms p95
- Database: <50ms p95
- Cache hit rate: >90% **Scalability:**
- Database: 500+ concurrent connections via PgBouncer
- API usage: <3000 req/min (50% buffer)
- Capacity: 1,000 concurrent users **Cost:**
- Monthly cost: ~$102/month
- API costs: $0 (still in free tier) --- ## Phase 2: Real-Time Infrastructure (Weeks 3-4) **Target Capacity:** 10,000 concurrent users
**Focus:** WebSocket horizontal scaling, background jobs, monitoring ### Week 3: WebSocket & Redis Pub/Sub
### Week 4: Background Jobs & Monitoring [Details in `phases/phase-2-realtime.md`] --- ## Phase 3: Enterprise Features (Month 2) **Target Capacity:** 25,000 concurrent users
**Focus:** CDN, APM, read replicas, predictive caching [Details in `phases/phase-3-enterprise.md`] --- ## Phase 4: Advanced Architecture (Months 3-4) **Target Capacity:** 50,000 concurrent users
**Focus:** Event streaming, CQRS, multi-region, machine learning [Details in `phases/phase-4-advanced.md`] --- ## Progress Tracking ### Current Status (Updated Weekly) **Phase 1 Progress:** 15% complete (3/19 tasks done) | Task | Status | ETA |
|------|--------|-----|
| Enhanced Sidebar Service | Not Started | Jan 20 |
| Database Materialized View | Not Started | Jan 21 |
| Sidebar API Enhancements | Not Started | Jan 22 |
| PgBouncer Setup | Not Started | Jan 23-24 |
| Virtual Scrolling Sidebar | Not Started | Jan 25-26 |
| Sidebar TanStack Query Hook | Not Started | Jan 27 |
| Game Status Classifier | Not Started | Jan 28 |
| Tiered Redis Caching | Not Started | Jan 29 |
| Request Batching | Not Started | Jan 30 |
| DB Query Optimization | Not Started | Jan 31 |
| Dynamic TanStack Config | Not Started | Feb 1-2 | --- ## Risk Mitigation ### High-Risk Items 1. **Database storage at 97%** - Mitigation: Archive odds_history data immediately - Timeline: This week 2. **PgBouncer deployment complexity** - Mitigation: Test in development first - Fallback: Direct connection with increased pool size 3. **API quota exceeded during development** - Mitigation: Use staging environment with rate limiting - Monitoring: Alert at 70% of quota --- ## Related Documentation - **Current State:** [CURRENT_STATE.md](CURRENT_STATE.md)
- **Cost Analysis:** [COST_ANALYSIS.md](COST_ANALYSIS.md)
- **Architecture Decisions:** [ARCHITECTURE_DECISIONS.md](ARCHITECTURE_DECISIONS.md)
- **Performance Benchmarks:** [metrics/PERFORMANCE_BENCHMARKS.md](metrics/PERFORMANCE_BENCHMARKS.md) --- **Last Updated:** January 13, 2025
**Next Review:** January 20, 2025 (Phase 1 kickoff)
**Maintained By:** Engineering Team

Scaling Roadmap

Target Capacity

Implementation Phases

Phase 1: Foundation

Phase 2: API Resilience

Phase 3: Distribution

Phase 4: Enterprise

Related Documents

Current State

Cost Analysis

Architecture Decisions

Full Roadmap Document

Scaling Roadmap: 50 to 50,000 Users Roadmap Version: 1.0