WAGERBABE DOCS
Documentation
DocsProduct Brief Wagerbabe 2025 01 13

Product Brief: WagerBabe Scaling Initiative Date: January 13, 2025

Author: babe Context: Enterprise scaling initiative --- ## Executive Summary WagerBabe has secured committed agents ready to migrate their user bases from competing platforms, but technical capacity is the bottleneck. With funding and marketing secured, the platform needs infrastructure scaling to unlock controlled onboarding of waiting users. The goal is capacity-driven growth: onboard users as fast as the system can reliably handle them, reaching 50,000 concurrent users. --- ## Core Vision ### Problem Statement The Opportunity: Multiple sports betting agents have committed to migrating their customer books to WagerBabe from existing platforms. These agents represent thousands of active users ready to transition. The Bottleneck: Current infrastructure can only support ~50 concurrent users reliably. System limitations prevent us from capturing this existing demand:

  • Database at 97.4% capacity - will fail at 100%
  • Single point of failure - one server crash = total downtime for all users
  • Connection limits - 100 max database connections exhausted at ~200 users
  • API quota risk - will exceed free tier limits at 1,000 users without caching improvements
  • No horizontal scaling - can't distribute load across multiple instances Business Impact: Every day we can't onboard these committed users:
  • Lost revenue - Users waiting represent $5/user/month in unrealized revenue
  • Agent frustration - Committed agents may lose confidence or explore alternatives
  • Competitive risk - Users might settle into competing platforms while waiting
  • Reputation risk - Word spreads if platform can't handle growth ### Proposed Solution Phased Infrastructure Scaling with Controlled User Onboarding Execute a 4-phase infrastructure transformation that progressively increases system capacity while validating stability at each gate. Onboard committed agent users in controlled batches only after proving system reliability at each capacity level. Rollout Strategy:
  • Phase 0 (Testing): Validate with 10-100 test users before any production onboarding
  • Controlled Batches: Onboard 500-1,000 users at a time
  • Validation Gates: Monitor system performance for 3-7 days after each batch
  • Adjust & Iterate: Scale infrastructure based on observed bottlenecks before next batch
  • Risk Posture: Conservative (Option A) - prove stability before expanding capacity Infrastructure Evolution:
  1. Phase 1 (Weeks 1-2): Foundation - sidebar optimization, tiered caching, PgBouncer -> 1,000 user capacity
  2. Phase 2 (Weeks 3-4): Real-time - WebSocket horizontal scaling, background jobs -> 10,000 user capacity
  3. Phase 3 (Month 2): Enterprise - CDN, APM, read replicas -> 25,000 user capacity
  4. Phase 4 (Months 3-4): Advanced - event streaming, CQRS optimization -> 50,000 user capacity Key Principle: System capacity gates user growth, not the reverse. Never onboard beyond proven capacity. --- ### Key Differentiators Why Agents Are Choosing WagerBabe Over Competitors (Buckeye, etc.): 1. Modern, Mobile-First Experience - Built from the ground up for mobile (68% of betting happens on mobile) - Fast, responsive, app-like PWA experience - Competitors stuck with legacy desktop-first UIs 2. Built by Agents, For Agents - Deep understanding of agent workflows and pain points - Features designed around real agent needs, not corporate assumptions - Agent Hub built specifically for book management 3. Superior User Experience - Faster load times, smoother interactions - Intuitive navigation vs clunky competitor interfaces - Better odds display and betting slip UX 4. Competitive Economics - $4-6 PPH (per head) pricing - competitive with industry standard - Better margins for agents vs existing platforms Primary Competitor: Buckeye (legacy platform with outdated UX) --- ## Target Users ### Primary Users: Sports Bettors (End Users) Profile:
  • Active sports bettors placing regular wagers
  • 68% mobile, 32% desktop - mobile is primary platform
  • Session behavior: 18 min average, 12 pages, 2.3 bets per session
  • Peak activity: 6pm-11pm EST (75% of traffic)
  • Value mobile speed, ease of use, and real-time odds Current Pain Points (on legacy platforms like Buckeye):
  • Slow, clunky mobile interfaces
  • Outdated UX that feels like 2010
  • Hard to navigate during live games
  • Poor performance on mobile devices What WagerBabe Offers:
  • Fast, modern mobile-first experience
  • Real-time odds updates via WebSocket
  • Intuitive betting slip and navigation
  • Smooth scrolling, instant interactions ### Secondary Users: Sports Betting Agents Profile:
  • Professional bookmakers managing customer books
  • 2 master agents committed (each with network of sub-agents)
  • Manage dozens to hundreds of active bettors
  • Need efficient tools for customer management, settlements, reporting Current Pain Points (on platforms like Buckeye):
  • Legacy agent tools, poor UX
  • Manual settlement processes
  • Limited reporting and analytics
  • Inflexible platform, can't adapt to their needs What WagerBabe Offers:
  • Purpose-built Agent Hub (cashier, customer management, reports)
  • Tuesday settlement cycles with compliance tracking
  • Mobile-optimized agent interface
  • Built by agents who understand the workflow --- ## Success Metrics ### Phase 1 Target (2-3 Months): 10,000 Users Technical Metrics:
  • API response time <200ms (p95)
  • Cache hit rate >90%
  • Database queries <50ms (p95)
  • System uptime 99.9%
  • Zero critical incidents during onboarding batches Business Metrics:
  • Successfully onboard 10,000 users in controlled batches
  • 1,000 user batches validated with 3-7 day monitoring windows
  • Agent satisfaction: No major complaints about platform stability
  • User retention: >85% week-over-week active users
  • Revenue: $40k-60k/month ($4-6 PPH × 10k users) Validation Gates:
  • Each 1k batch requires 3-7 days of stable operation before next batch
  • No degradation in performance metrics during scaling
  • Agent approval before next batch onboarding ### Long-Term Target (1-2 Years): 50,000 Users Technical Metrics:
  • Support 50,000 concurrent users
  • WebSocket capacity: 50k+ connections
  • API response time <200ms under load
  • 99.9% uptime SLA Business Metrics:
  • Revenue: $200k-300k/month ($4-6 PPH × 50k users)
  • Infrastructure cost: <1% of revenue ($1,364/mo = 0.5%)
  • Multiple agent partnerships established
  • Market position: Competitive alternative to Buckeye ### Key Performance Indicators Growth Tracking:
  • Weekly active users (WAU)
  • User onboarding rate (users/week)
  • Agent acquisition rate Technical Health:
  • System uptime %
  • API response time (p95)
  • Error rate
  • Cache hit rate Business Health:
  • Monthly Recurring Revenue (MRR)
  • Revenue per user (PPH)
  • Agent satisfaction scores
  • User retention rate --- ## MVP Scope: Phase 1 Foundation (1,000 Users) Objective: Scale infrastructure to reliably support 1,000 concurrent users while maintaining all existing features and performance. ### Core Features (Must Maintain During Scaling) 1. Mobile-First Betting Experience - Non-negotiable - This is our competitive advantage vs Buckeye - Fast load times (<3s initial, <1s navigation) - Smooth scrolling sidebar (virtual scrolling for 100+ leagues) - Instant betting slip interactions - Real-time odds updates 2. Sports Sidebar & Navigation - Filter to only bettable sports - Prioritize American sports (NFL, NBA, MLB, NHL) - Game counts per league - Fast loading (<100ms cached, <300ms fresh) 3. Real-Time Odds Display - Live odds updates via WebSocket (30s latency for live games) - Tiered caching (live 30s, upcoming 5min, scheduled 30min) - Multiple sportsbook odds comparison 4. Betting Slip Functionality - Add/remove bets smoothly - Calculate parlays correctly - Submit bets reliably (zero data loss) 5. Agent Hub (Critical for Agent Retention) - Customer management (CRUD operations) - Cashier interface (balance management) - Tuesday settlement cycles - Basic reporting 6. Authentication & User Management - JWT-based secure authentication - Agent vs user role separation - Session management ### Phase 1 Infrastructure Deliverables Week 1: Database & Sidebar Optimization
  1. Enhanced sidebar service (filtering, prioritization)
  2. Database materialized view for sidebar
  3. Sidebar API enhancements
  4. PgBouncer setup (10k connections -> 100 DB connections)
  5. Virtual scrolling in sidebar component
  6. Sidebar TanStack Query hook Week 2: Tiered Caching & API Efficiency
  7. Game status classifier (live, upcoming, scheduled)
  8. Tiered Redis caching (30s to 30min based on game status)
  9. Request batching & deduplication
  10. Database query optimization
  11. Dynamic TanStack Query configuration Success Criteria for Phase 1:
  • Sidebar loads <100ms (cached)
  • API usage <3,000 req/min (50% of limit)
  • Cache hit rate >90%
  • Database queries <50ms (p95)
  • Support 500 concurrent DB connections via PgBouncer
  • All existing features work perfectly at 1,000 users ### Out of Scope for Phase 1 Deferred to Phase 2+ (Not needed for 1k users):
  • WebSocket horizontal scaling (works fine for <1k users)
  • CDN for static assets (nice-to-have, not critical)
  • Advanced monitoring/APM (use built-in dashboards for now)
  • Read replicas (single DB sufficient for 1k users)
  • Background job clustering (APScheduler sufficient) Future Vision (Phase 3-4):
  • Event streaming (Kafka/RabbitMQ)
  • CQRS pattern
  • Microservices architecture
  • Multi-region deployment
  • Advanced caching strategies (predictive pre-caching) ### Onboarding Process Manual User Creation:
  • Agent creates user accounts one-by-one via Agent Hub
  • Simple form: username, password, contact info
  • No bulk import needed yet (deferred to Phase 2+)
  • Agents manage their own user books Validation Process:
  1. Test with 10-100 users (internal/friendly agents)
  2. Monitor for 3-7 days
  3. If stable -> onboard first 1,000 user batch
  4. Monitor for 7 days
  5. If stable -> proceed to Phase 2 planning ## Timeline Constraints Phase 1 Target: 2-3 months to 10,000 users
  • Month 1: Complete Phase 1 infrastructure (Weeks 1-2), validate with test users
  • Months 2-3: Controlled onboarding in 1,000 user batches with 3-7 day validation windows Long-Term Target: 1-2 years to 50,000 users
  • Months 1-3: Phase 1 + Phase 2 (10k capacity)
  • Months 4-6: Phase 3 (25k capacity) if growth continues
  • Year 2: Phase 4 (50k capacity) as needed Flexibility: "Pay-as-you-go" approach - build next phase only when current capacity approaches limits. System breathes and scales based on actual demand, not projections. --- ## Risks and Assumptions ### Critical Assumptions 1. Agent Commitment Holds - Assumes 2 master agents follow through with user migration - Mitigation: Maintain strong agent relationships, deliver on promises 2. Users Accept Platform - Assumes Buckeye users will adapt to WagerBabe UX - Mitigation: Mobile-first UX is superior - should exceed expectations 3. $4-6 PPH Revenue Model - Assumes industry-standard pricing holds - Mitigation: Pricing is competitive and validated by market 4. Manual Onboarding Scales - Assumes one-by-one user creation is acceptable for 1k batches - Mitigation: Can build bulk import tools in Phase 2 if needed ### Technical Risks High Risk:
  1. Database Storage (97.4% full) - Impact: Service fails at 100% - Timeline: ~2 weeks - Mitigation: Archive old odds_history data immediately, upgrade to Supabase Pro 2. Single Point of Failure - Impact: One server crash = total downtime - Mitigation: Phase 1 includes redundancy and PgBouncer HA setup Medium Risk:
  2. API Quota Exceeded - Impact: Service degradation if hit 6k/min limit - Mitigation: Aggressive caching in Phase 1 reduces API calls by 60% 4. Migration Complexity - Impact: Users face friction during account creation - Mitigation: Simple onboarding process, agent training Low Risk:
  3. Infrastructure Cost Overruns - Impact: Budget exceeded - Mitigation: Cost is <1% of revenue - negligible risk ### Business Risks High Priority:
  4. Agent Retention - Risk: Agents get frustrated if platform unstable - Mitigation: Conservative rollout, validation gates, zero-tolerance for critical bugs 2. Competitive Response - Risk: Buckeye improves UX in response - Mitigation: Speed to market, build moat with superior mobile experience Medium Priority:
  5. User Churn During Migration - Risk: Users drop off during transition - Mitigation: Smooth onboarding, agent support, superior UX retention --- ## Enterprise Operations & Reliability Philosophy: Scale like Google, not like a startup. Build reliability, observability, and operational excellence into the foundation - not bolted on later. ### Site Reliability Engineering (SRE) Framework Service Level Objectives (SLOs): Phase 1 SLOs (1,000 users):
  • Availability: 99.9% uptime per month (≤ 43 minutes downtime)
  • Latency: 95% of API requests complete in <200ms
  • Error Rate: <2% of requests result in 5xx errors
  • Data Durability: Zero data loss on bet submissions Phase 2+ SLOs (10,000+ users):
  • Availability: 99.95% uptime per month (≤ 21 minutes downtime)
  • Latency: 95% of API requests <150ms, 99% <500ms
  • Error Rate: <1% of requests result in 5xx errors
  • WebSocket: 99.9% message delivery rate Service Level Indicators (SLIs):
  • API response time (p50, p95, p99)
  • HTTP status code distribution (2xx, 4xx, 5xx)
  • Database query latency
  • Cache hit rate
  • WebSocket connection success rate
  • Bet submission success rate Error Budget:
  • 99.9% uptime = 0.1% error budget = 43 minutes/month
  • If error budget exhausted: STOP new feature deployments, focus on reliability
  • Track error budget burn rate (alerts if burning too fast)
  • Monthly error budget review and adjustment --- ### Observability & Monitoring Three Pillars Implementation: 1. Structured Logging
  • Format: JSON with consistent schema
  • Required Fields: timestamp, trace_id, user_id, request_id, severity, service, message
  • Centralized: All logs aggregated (Datadog, CloudWatch, or LogDNA)
  • Retention: 30 days hot, 90 days archived
  • Correlation: Trace IDs link related logs across services Example Log Entry:
{ "timestamp": "2025-01-13T18:45:23.123Z", "trace_id": "abc123", "user_id": "user_456", "request_id": "req_789", "severity": "ERROR", "service": "betting-api", "endpoint": "/api/v1/bets", "error": "Database connection timeout", "latency_ms": 5230
}
``` **2. Metrics & Dashboards**
- **Real-time Dashboards:** Grafana or Datadog with 1-minute refresh
- **Key Metrics:** - Request rate (req/min) - Error rate (errors/min) - Latency percentiles (p50, p95, p99) - Database connection pool utilization - Cache hit/miss rates - Redis memory usage - Active WebSocket connections
- **Business Metrics:** - Bets placed/min - Revenue/hour - Active users (concurrent) **3. Distributed Tracing**
- **Phase 1:** Basic request tracing (trace ID in logs)
- **Phase 2+:** Full distributed tracing (Jaeger, Zipkin, or Datadog APM)
- **Track:** Request flow from client -> API -> database -> cache
- **Use Cases:** Debug slow requests, identify bottlenecks **Alerting Philosophy:** **Actionable Alerts Only:**
- Every alert must have a runbook (what to do)
- Alerts go to on-call engineer (not ignored)
- No "FYI" alerts - only actionable issues **Alert Definitions:** **Critical (Page On-Call):**
- API error rate >5% for 5 minutes
- Database connections >90% for 5 minutes
- API latency p95 >1000ms for 5 minutes
- Service down (health check fails) **High (Slack/Email):**
- API error rate >2% for 10 minutes
- Cache hit rate <80% for 15 minutes
- Database queries >100ms p95 for 10 minutes
- Error budget burn rate >10%/day **Medium (Dashboard):**
- API quota usage >70%
- Database storage >90%
- Redis memory >85% **On-Call Rotation:**
- **Phase 1:** Single engineer on-call 24/7 (week rotations)
- **Phase 2+:** Primary + backup on-call with escalation
- **Compensation:** On-call pay or comp time
- **Tools:** PagerDuty, Opsgenie, or VictorOps --- ### Load Testing & Capacity Planning **Continuous Load Testing Strategy:** **Testing Cadence:**
- **Weekly:** Automated load tests at 1.5x current peak
- **Before Phase Gate:** Load test at 2x target capacity
- **After Incidents:** Reproduce load conditions that caused issue **Load Testing Scenarios:** **Scenario 1: Normal Load**
- Simulate typical user behavior (18 min session, 2.3 bets)
- Ramp up to expected concurrent users over 10 minutes
- Sustain for 30 minutes
- **Success:** All SLOs met **Scenario 2: Peak Load (2x Normal)**
- Simulate peak hours (6pm-11pm EST)
- Ramp to 2x concurrent users
- Sustain for 60 minutes
- **Success:** Graceful degradation, no crashes **Scenario 3: Spike Load**
- Simulate traffic spike (major sporting event)
- 0 -> 5x users in 5 minutes
- Sustain for 30 minutes
- **Success:** Auto-scaling works, no user-facing errors **Scenario 4: Sustained Load**
- Simulate growth (10% more users daily for 7 days)
- **Success:** No performance degradation over time **Tools:**
- **k6** (primary - scriptable, CI/CD integration)
- **Locust** (secondary - Python-based, distributed)
- **Artillery** (API-specific load testing) **Capacity Forecasting:**
- Weekly capacity reports (current usage vs limits)
- Alerts when approaching 80% capacity (any resource)
- 3-month growth projections based on onboarding rate
- Pre-provision infrastructure 30 days before hitting 80% --- ### Deployment & Release Strategy **Deployment Philosophy:**
- **Zero-downtime deployments** - users never see outages
- **Rollback in <5 minutes** - fast recovery from bad deploys
- **Feature flags** - decouple deployment from release
- **Database migrations** - run while system is live **Phase 1: Manual Deployments with Validation** **Pre-Deployment Checklist:**
1. All tests pass (unit, integration, e2e)
2. Load test at 2x capacity passes
3. Database migrations tested on staging
4. Rollback plan documented
5. Monitoring dashboards ready
6. On-call engineer available **Deployment Process:**
1. Deploy to staging, validate
2. Run smoke tests (critical paths work)
3. Deploy to production during low-traffic window
4. Monitor error rates for 30 minutes
5. If error rate spikes >2%: immediate rollback **Rollback Procedure:**
1. Revert to previous Docker image/commit
2. Rollback database migrations (if safe)
3. Clear caches to prevent stale data
4. Post-mortem within 24 hours **Phase 2+: Automated Deployments** **Canary Releases:**
- Deploy to 5% of users first
- Monitor for 15 minutes
- If metrics healthy -> 25% -> 50% -> 100%
- Auto-rollback if error rate >2% **Feature Flags:**
- All risky features behind flags (LaunchDarkly, Unleash, custom)
- Enable for internal users first
- Gradual rollout: 1% -> 10% -> 50% -> 100%
- Kill switch: disable feature instantly if issues **Blue/Green Deployments (Phase 3+):**
- Two identical production environments
- Deploy to "green" while "blue" serves traffic
- Switch traffic instantly
- Keep "blue" for instant rollback --- ### Disaster Recovery & Business Continuity **Recovery Objectives:** **RTO (Recovery Time Objective):**
- **Phase 1:** <2 hours from total outage to operational
- **Phase 2+:** <30 minutes from total outage to operational
- **Critical systems:** <15 minutes (betting, agent cashier) **RPO (Recovery Point Objective):**
- **Bets:** Zero data loss (synchronous writes, transaction logs)
- **User data:** <5 minutes (database backup every 5 min)
- **Odds data:** <30 minutes (can re-fetch from API) **Backup Strategy:** **Database Backups (Supabase):**
- **Frequency:** - Phase 1: Daily backups (Supabase Free/Pro) - Phase 2+: Hourly snapshots, 7-day retention
- **Testing:** Monthly restore drills (restore to staging)
- **Retention:** 30 days hot, 90 days archived **Application Backups:**
- **Docker Images:** All tagged images retained 90 days
- **Code:** Git is source of truth (GitHub)
- **Configuration:** Infrastructure as Code (IaC) in repo **Redis Backups:**
- **RDB Snapshots:** Every 5 minutes
- **Acceptable Data Loss:** Cache can be rebuilt from DB **Disaster Recovery Drills:** **Quarterly DR Exercises:**
- **Q1:** Database restore from backup
- **Q2:** Full service recovery from zero (infrastructure rebuild)
- **Q3:** Database corruption scenario (point-in-time recovery)
- **Q4:** Multi-failure scenario (database + app server down) **Runbooks for Common Failures:**
- Database connection exhaustion
- API quota exceeded
- Redis out of memory
- Deployment rollback procedure
- Database restore procedure
- DNS/CDN failure
- Third-party API outage (Optic Odds) --- ### Security & Compliance **Security Baseline (Phase 1):** **Automated Security Scanning:**
- **Dependency Scanning:** Snyk or Dependabot (weekly scans)
- **Container Scanning:** Scan Docker images for vulnerabilities
- **Code Scanning:** GitHub Advanced Security or SonarQube
- **Secret Scanning:** Prevent secrets in commits (git-secrets) **Secrets Management:**
- **Never Hardcode:** All secrets in environment variables
- **Rotation:** API keys rotated every 90 days
- **Access Control:** Secrets encrypted at rest (Railway/Supabase handles this)
- **Phase 2+:** Vault or AWS Secrets Manager for centralized secrets **Rate Limiting:**
- **Login Endpoint:** 5 attempts/min per IP
- **API Endpoints:** 100 req/min per user (burst: 200)
- **Agent Endpoints:** 200 req/min per agent
- **Public Endpoints:** 1000 req/min total (Cloudflare handles this) **Audit Logging:**
- **Agent Actions:** Log all financial operations (cashier, settlements)
- **User Actions:** Log bet placements, withdrawals
- **Admin Actions:** Log all admin operations
- **Retention:** 1 year minimum (compliance) **DDoS Protection:**
- **Cloudflare:** Free tier provides basic DDoS protection
- **Phase 2+:** Cloudflare Pro for advanced DDoS mitigation
- **Rate Limiting:** API-level rate limiting prevents abuse **Compliance Considerations:** **Phase 1 (Nice to Have):**
- GDPR compliance (if EU users)
- Data retention policies
- User data export/deletion **Phase 2+ (Required for Scale):**
- SOC 2 Type II (for enterprise agents)
- Penetration testing (quarterly)
- Security audits (annual) --- ### Data Management at Scale **Data Retention Policy:** **Hot Data (PostgreSQL):**
- **Bets:** Retain all (required for settlements)
- **User Data:** Retain active users indefinitely
- **Odds Data (Current):** 7 days in main tables
- **Odds Data (Historical):** Move to archive after 7 days
- **Logs:** 30 days in primary storage **Cold Data (Archival):**
- **Odds History >7 days:** Archive to S3/Glacier (cheap storage)
- **Inactive Users (>1 year):** Soft delete, archive data
- **Old Logs:** Compress and archive to S3, retain 90 days **Archival Strategy (Immediate - Database 97.4% Full):** **Priority 1 (This Week):**
1. Archive `odds_history` older than 7 days to S3/export
2. Delete archived rows from database
3. **Expected Savings:** ~50MB (10% storage freed) **Priority 2 (Phase 1):**
4. Set up automated archival job (weekly)
5. Compress old odds data before archival
6. Document restoration procedure **Data Quality Monitoring:**
- **Anomaly Detection:** Alert if bet counts drop >50% (potential bug)
- **Data Validation:** Check for null required fields
- **Consistency Checks:** Verify settlement totals match bet totals **Sharding Strategy (Phase 4+):**
- **When:** Single database can't handle 50k+ users
- **How:** Shard by agent_id (each agent's users on same shard)
- **Benefit:** Agents' data stays together, easier queries --- ### Chaos Engineering (Phase 2+) **Testing Resilience Through Controlled Failure:** **Chaos Experiments:**
- **Kill Random API Instance:** Does load balancer recover?
- **Delay Database Queries:** Does connection pool handle it?
- **Fill Redis Memory:** Does eviction policy work?
- **Simulate API Quota Exceeded:** Does fallback work? **GameDay Exercises:**
- Quarterly "break production on purpose" drills
- Test incident response, on-call, runbooks
- Netflix-style: simulate outages during business hours --- ## Enterprise Operations Roadmap ### Phase 1 (Weeks 1-2): Operations Foundation
**Must Have:**
- Structured logging with trace IDs
- Real-time dashboards (request rate, error rate, latency)
- Critical alerts with runbooks (error rate, database connections, latency)
- Basic load testing (k6 scripts for normal + peak scenarios)
- Deployment checklist and rollback procedure
- Daily database backups (Supabase Pro)
- Archive old odds_history data (free 50MB storage)
- Dependency scanning (Snyk/Dependabot)
- Rate limiting on public endpoints **Nice to Have (Defer if Needed):**
- On-call rotation setup (can use email alerts initially)
- Load test automation in CI/CD --- ### Phase 2 (Weeks 3-4): Advanced Observability
**Add:**
- Distributed tracing (Datadog APM or Jaeger)
- On-call rotation with PagerDuty
- Canary deployments for major changes
- Feature flags for risky features
- Hourly database snapshots
- Automated load testing in CI/CD
- Quarterly DR drill (database restore) --- ### Phase 3 (Month 2): Enterprise-Grade Operations
**Add:**
- SOC 2 Type II preparation
- Penetration testing (external firm)
- Automated rollback on error rate spike
- Blue/green deployment infrastructure
- Chaos engineering experiments
- Multi-region planning --- ### Phase 4 (Months 3-4): Scale Hardening
**Add:**
- Multi-region active-passive
- Advanced chaos engineering (GameDays)
- Data sharding strategy
- Predictive capacity planning models --- ## Supporting Materials This Product Brief incorporates analysis from: **Scaling Documentation:**
- **[SCALING_ROADMAP.md](scaling/SCALING_ROADMAP.md)** - Detailed 4-phase implementation plan
- **[CURRENT_STATE.md](scaling/CURRENT_STATE.md)** - Baseline infrastructure metrics and bottlenecks
- **[COST_ANALYSIS.md](scaling/COST_ANALYSIS.md)** - Financial projections showing infrastructure <1% of revenue
- **[ARCHITECTURE_DECISIONS.md](scaling/ARCHITECTURE_DECISIONS.md)** - 9 ADRs documenting technical choices **Key Findings from Analysis:**
- Infrastructure costs scale sub-linearly (per-user cost drops 73% from 100 to 50k users)
- Current capacity: ~50 concurrent users
- Target capacity: 50,000 concurrent users (1000x increase)
- ROI: 146-220x at full scale
- Managed services (Railway + Supabase) preferred over self-hosted for developer productivity **Existing Platform Features:**
- FastAPI server with JWT authentication
- Next.js client with mobile-first responsive design
- Agent Hub with customer management, cashier, and reporting
- Real-time odds via WebSocket
- Tuesday settlement cycles --- _This Product Brief captures the vision and requirements for WagerBabe Scaling Initiative._ _It was created through collaborative discovery and reflects the unique needs of this enterprise project._ **Next Steps:** Transform this brief into a detailed PRD that will guide architecture decisions and epic/story breakdown.