WAGERBABE DOCS
Scaling Roadmap
ScalingCurrent State

Current State Assessment Assessment Date: January 13, 2025

System Version: v2.0 (Post-FastAPI Migration) Baseline For: 50k User Scaling Initiative --- ## Infrastructure Snapshot ### Application Servers Hosting Platform: Railway

  • Instance Count: 1 (single instance, no redundancy)
  • Instance Specs: 2 vCPU, 4GB RAM
  • Operating System: Linux (containerized)
  • Runtime: Python 3.11, Node.js 18
  • Deployment: Auto-deploy from main branch
  • Scaling: Manual only (no autoscaling configured)
  • Geographic Region: us-west-1
  • Uptime (Last 30 days): 99.2% (6 hours downtime) Capacity:
  • Max Concurrent Requests: ~100 req/s (estimated)
  • Typical Load: 5-10 req/s during business hours
  • Peak Load: 25 req/s (Saturday evenings)
  • Burst Capacity: None (crashes at >150 req/s) Known Bottlenecks:
  • Single instance = single point of failure
  • No load balancing
  • No horizontal scaling capability
  • Memory spikes during cache refresh --- ### Database Provider: Supabase Tier: Free Plan Database: PostgreSQL 15.1 Specifications:
  • Max Concurrent Connections: 100
  • Current Average Connections: 15-25
  • Peak Connections: 45 (approaching limit)
  • Storage Used: 487MB / 500MB (97.4% full)
  • Paused After Inactivity: 7 days
  • Backup Retention: None (free tier)
  • Point-in-Time Recovery: Not available Connection Pool (asyncpg):
  • Min Pool Size: 10 connections
  • Max Pool Size: 50 connections
  • Connection Timeout: 30 seconds
  • Idle Timeout: 600 seconds (10 min) Performance:
  • Average Query Time: 35ms (p50)
  • p95 Query Time: 180ms
  • p99 Query Time: 450ms
  • Slow Query Threshold: >500ms
  • Slow Queries/Day: ~50 queries Table Sizes: | Table | Row Count | Size | |-------|-----------|------| | users | 145 | 12MB | | bets | 1,234 | 48MB | | odds_events | 3,567 | 156MB | | odds_data | 45,892 | 189MB | | odds_history | 89,234 | 67MB | | game_results | 2,341 | 15MB | Critical Issues:
  • Storage nearly full (97.4%)
  • No automated backups
  • Connection limit problematic for scaling
  • No read replicas
  • No connection pooler (PgBouncer) --- ### Caching Infrastructure Provider: Railway Redis Version: Redis 7.2 Memory: 256MB allocated Configuration:
  • Eviction Policy: allkeys-lru (least recently used)
  • Persistence: RDB snapshots every 5 minutes
  • Max Connections: 50
  • Current Connections: 5-12 average
  • Connection Pool: 50 max (redis-py) Cache Performance (Last 7 Days):
  • Hit Rate: 62.3% (needs improvement)
  • Miss Rate: 37.7%
  • Evictions: 1,247 keys/day (memory pressure)
  • Keys Stored: ~8,500 active keys
  • Average Key Size: 15KB
  • Memory Usage: 187MB / 256MB (73%) Cache Breakdown by Prefix: | Prefix | Keys | Hit Rate | Avg TTL | |--------|------|----------|---------| | wagerbabe:odds:* | 4,200 | 58% | 5 min | | wagerbabe:cache:* | 2,800 | 71% | 15 min | | wagerbabe:session:* | 1,100 | 89% | 30 min | | wagerbabe:sidebar:* | 400 | 45% | 30 min | Critical Issues:
  • Low cache hit rate (target: >90%)
  • Frequent evictions due to memory pressure
  • No Redis clustering/HA
  • Sidebar cache particularly ineffective (45%)
  • No pub/sub for WebSocket (in-memory only) --- ### External API Integration Provider: Optic Odds API Plan: Free Tier Rate Limit: 6,000 requests/minute Current Usage (Last 7 Days):
  • Total Requests: 89,543 requests
  • Daily Average: 12,792 requests/day
  • Requests per Minute: 8.9 avg, 34 peak
  • Percentage of Limit: 3.3% average, 13.4% peak
  • Cost: $0 (within free tier) Request Breakdown: | Endpoint | Requests/Day | % of Total | |----------|--------------|------------| | /fixtures | 8,400 | 65.7% | | /leagues | 2,100 | 16.4% | | /sports | 1,500 | 11.7% | | /sportsbooks | 600 | 4.7% | | /markets | 192 | 1.5% | API Response Times:
  • Average: 234ms
  • p95: 450ms
  • p99: 680ms
  • Timeout Rate: 0.3% (3-4 timeouts/day) Projected Usage at Scale: | User Count | Requests/Day | % of Limit | Status | |------------|--------------|------------|--------| | 100 | 25,000 | 6% | SAFE | | 1,000 | 200,000 | 50% | MONITOR | | 10,000 | 1,600,000 | 400% | EXCEEDED | | 50,000 | 8,000,000 | 2000% | CRITICAL | Critical: Without caching improvements, we'll hit API limits at ~1,000 concurrent users. --- ### WebSocket Infrastructure Current Implementation: In-memory connection tracking Library: FastAPI WebSocket support (native) Capacity:
  • Max Concurrent Connections: ~100 (tested)
  • Current Peak Connections: 12
  • Connection Duration: 5-45 minutes average
  • Disconnection Rate: 15% (reconnection required) Message Traffic:
  • Messages Sent/Min: 45 average
  • Messages Received/Min: 12 average
  • Average Message Size: 1.2KB
  • Bandwidth: ~50KB/s outbound Limitations:
  • Single Instance Limitation: All connections on one server
  • No Horizontal Scaling: Can't distribute across instances
  • No Redis Pub/Sub: In-memory state prevents multi-instance
  • No Connection Recovery: Clients must reconnect after server restart
  • No Message Persistence: Messages lost if client offline Critical: WebSocket architecture cannot scale beyond 500-1,000 concurrent connections on single instance. --- ## Performance Baseline ### API Endpoint Response Times (p95) | Endpoint | Current (p95) | Target | Gap | |----------|---------------|--------|-----| | GET /sidebar/sports | 450ms | <100ms | -350ms | | POST /optic-odds/fixtures | 680ms | <200ms | -480ms | | GET /optic-odds/fixtures/{id} | 320ms | <150ms | -170ms | | GET /optic-odds/sports | 280ms | <100ms | -180ms | | GET /optic-odds/leagues | 310ms | <150ms | -160ms | | GET /dashboard/metrics | 520ms | <200ms | -320ms | | POST /auth/login | 180ms | <200ms | OK | Overall API Performance:
  • p50 (median): 245ms
  • p95: 520ms
  • p99: 850ms
  • p99.9: 1,200ms (timeout territory)
  • Error Rate: 0.8% (mostly timeouts)
  • Timeout Threshold: 30 seconds ### Client-Side Performance Lighthouse Scores (Mobile):
  • Performance: 67/100 - First Contentful Paint: 2.1s
  • Largest Contentful Paint: 3.8s
  • Time to Interactive: 4.2s
  • Cumulative Layout Shift: 0.14 Bundle Sizes:
  • Initial Load: 487KB (target: <200KB)
  • First Load JS: 234KB
  • Total Page Weight: 1.2MB TanStack Query Performance:
  • Cache Hit Rate: 68% (client-side)
  • Background Refetch: Every 2-10 minutes (activity-based)
  • Cache Stale Time: 2-15 minutes (varies by endpoint)
  • Prefetch Success Rate: 82% --- ### Cache Performance Deep Dive Redis Cache Statistics: Hit/Miss Breakdown:
Total Cache Operations (7 days): 145,234
├── Cache Hits: 90,431 (62.3%)
├── Cache Misses: 54,803 (37.7%)
└── Cache Errors: 0 (0%)
``` **Cache by Data Type:**
| Data Type | Hit Rate | Avg Age | TTL |
|-----------|----------|---------|-----|
| Sports List | 78% | 12 min | 15 min |
| League Data | 71% | 8 min | 15 min |
| Fixture Odds (Live) | 42% | 3 min | 5 min |
| Fixture Odds (Upcoming) | 65% | 7 min | 10 min |
| Sidebar Data | 45% | 6 min | 30 min |
| User Sessions | 89% | 18 min | 30 min | **Why Low Hit Rates?**
1. **Sidebar Cache:** Full refresh instead of partial updates
2. **Live Odds:** Short TTL (5min) with high request variance
3. **Memory Pressure:** Evicting popular keys due to 256MB limit
4. **No Prefetching:** Reactive caching only (miss = fetch)
5. **Poor Key Design:** Not leveraging cache hierarchy **Cost of Low Hit Rate:**
- **Extra API Calls:** ~20,000/day (could be cached)
- **Extra DB Queries:** ~15,000/day (could be cached)
- **Increased Latency:** 400ms average for cache misses
- **Revenue Impact:** None currently, but limits scalability --- ### Database Performance **Query Performance (Last 24 Hours):** **Top 5 Slowest Queries:**
1. `SELECT * FROM odds_data WHERE event_id IN (...)` - 680ms avg
2. `SELECT ... FROM odds_events JOIN odds_data ...` - 520ms avg
3. `Sidebar sports aggregation (no materialized view)` - 450ms avg
4. `SELECT ... FROM bets WHERE user_id = ... ORDER BY ...` - 380ms avg
5. `Full-text search on teams` - 340ms avg **Query Volume:**
- **Total Queries/Day:** 45,000
- **SELECTs:** 40,500 (90%)
- **INSERTs:** 3,000 (6.7%)
- **UPDATEs:** 1,200 (2.7%)
- **DELETEs:** 300 (0.7%) **Index Usage:**
- **Total Indexes:** 23
- **Used Indexes:** 18 (78%)
- **Unused Indexes:** 5 (wasted storage)
- **Missing Indexes:** 7 (identified by query planner) **Connection Pool Health:**
- **Pool Utilization:** 30-50% average
- **Peak Utilization:** 90% (Saturday evenings)
- **Connection Wait Time:** 0ms (no queuing yet)
- **Idle Connections:** 40-60% at any time --- ## User Metrics ### Concurrent Users **Current Capacity:** ~50 concurrent users tested
**Target Capacity:** 50,000 concurrent users **Historical Peaks:**
| Date | Peak Users | Duration |
|------|------------|----------|
| Jan 7 | 42 | 2 hours |
| Jan 6 | 38 | 1.5 hours |
| Dec 30 | 51 | 3 hours (launched feature) | **User Behavior Patterns:**
- **Session Duration:** 18 minutes average
- **Pages per Session:** 12 average
- **Bets per Session:** 2.3 average
- **Peak Hours:** 6pm-11pm EST (75% of traffic)
- **Mobile vs Desktop:** 68% mobile, 32% desktop **Projected User Load:**
| Time | Users | Requests/Min | DB Queries/Min |
|------|-------|--------------|----------------|
| Off-Peak (3am-8am) | 50 | 25 | 15 |
| Business Hours | 200 | 100 | 60 |
| **Peak (8pm)** | **500** | **400** | **250** |
| Event-Driven Spike | 1,200 | 1,000 | 600 | --- ## Critical Pain Points ### High Priority (P0) 1. **Database Storage: 97.4% Full** - **Impact:** Service will fail when storage hits 100% - **Timeline:** ~2 weeks at current growth - **Solution:** Upgrade to Supabase Pro or archive old data 2. **Single Point of Failure** - **Impact:** One server crash = total downtime - **Probability:** Monthly (based on Railway uptime) - **Solution:** Multi-instance deployment + load balancer 3. **WebSocket Scalability** - **Impact:** Can't scale beyond 500-1,000 connections - **Current:** 12 peak connections (lots of headroom) - **Solution:** Redis pub/sub for multi-instance support ### Medium Priority (P1) 4. **Cache Hit Rate: 62% (Target: >90%)** - **Impact:** Extra API calls, slower response times - **Cost:** Not financial yet, but limits scaling - **Solution:** Tiered caching + aggressive prefetching 5. **No Database Connection Pooling (PgBouncer)** - **Impact:** Connection exhaustion at scale - **Current:** 45/100 peak (45% utilization) - **Solution:** Deploy PgBouncer for 1,000+ effective connections 6. **API Quota Risk at Scale** - **Impact:** Will exceed free tier at 1,000 users - **Projected Cost:** $300/month at 50k users - **Solution:** Aggressive caching (reduce API calls by 80%) ### Low Priority (P2) 7. **No Monitoring/Alerting** - **Impact:** Blind to issues until users complain - **Solution:** Add monitoring endpoints + basic alerting 8. **No Load Balancing** - **Impact:** Can't distribute traffic across instances - **Solution:** Railway load balancer or external (HAProxy) 9. **Client Bundle Size: 487KB** - **Impact:** Slow initial load on mobile - **Solution:** Code splitting + lazy loading --- ## Growth Trajectory ### Historical Growth (Last 90 Days) | Metric | 90 Days Ago | 60 Days Ago | 30 Days Ago | Today | Trend |
|--------|-------------|-------------|-------------|-------|-------|
| Total Users | 45 | 78 | 112 | 145 | +222% |
| Daily Active | 8 | 18 | 32 | 42 | +425% |
| Bets Placed/Day | 12 | 45 | 98 | 156 | +1200% |
| API Calls/Day | 3,200 | 6,800 | 10,200 | 12,792 | +300% |
| DB Storage | 145MB | 287MB | 398MB | 487MB | +236% | **Growth Rate:**
- **Users:** 1.6 new users/day
- **Bets:** 4.2% compound daily growth
- **API Calls:** Linear with user count
- **Storage:** 1.5MB/day **Projected Timeline to Limits:**
- **Database Storage Full:** 14 days (if no cleanup)
- **Connection Limit (100):** 120 days (at current growth)
- **API Free Tier Exceeded:** 180 days (without caching)
- **Single Instance CPU Limit:** 90 days --- ## Immediate Actions Required ### This Week
1. Create scaling documentation (this file)
2. Archive old odds_history data (free up 50MB)
3. Set up database monitoring (alert at 95% storage)
4. Begin Phase 1 implementation (sidebar optimization) ### Next 2 Weeks (Phase 1)
1. Implement tiered Redis caching
2. Deploy PgBouncer
3. Create materialized view for sidebar
4. Optimize top 5 slow queries
5. Add basic monitoring endpoints ### Next Month (Phase 2)
1. WebSocket scaling with Redis pub/sub
2. Background job infrastructure (Celery)
3. Upgrade to Supabase Pro
4. Add read replicas (if supported) --- ## Data Sources This assessment compiled from:
- Railway metrics dashboard
- Supabase dashboard analytics
- Redis INFO command output
- Application logs (last 7 days)
- Optic Odds API dashboard
- Manual load testing (k6 scripts)
- Google Analytics (user behavior) **Next Assessment:** January 20, 2025 (weekly during Phase 1) --- **Assessment Prepared By:** Engineering Team
**Last Updated:** January 13, 2025
**Confidence Level:** High (based on production data)