WAGERBABE DOCS
All Stories
2-1-5-cache-strategy-optimization-monitoringDoneEpic 2.1

Story 2.1.5: Cache Strategy Optimization & Monitoring

Status: done

Tasks

  • Task 1: Implement TieredCacheService with soft/hard TTL support (AC: #1, #2)
    • 1.1: Create `server/app/services/tiered_cache.py` with `TieredCacheService` class
    • 1.2: Implement `get_with_swr(key, fetch_fn, soft_ttl, hard_ttl)` method
    • 1.3: Implement soft TTL check: if expired but within hard TTL, serve stale + trigger background refresh
    • 1.4: Implement hard TTL check: if expired, block and fetch fresh data
    • 1.5: Store cache metadata with timestamps: `{data, cached_at, soft_expires_at, hard_expires_at}`
    • 1.6: Add graceful degradation: on refresh failure, extend stale data TTL by 50%
    • 1.7: Add unit tests for all TTL scenarios (deferred to testing story)
  • Task 2: Configure endpoint-specific TTL settings (AC: #1)
    • 2.1: Create `server/app/core/cache_config.py` with TTL constants per endpoint
    • 2.2: Define cache key patterns (from tech spec):
    • 2.3: Update each endpoint handler to use `TieredCacheService` with configured TTLs (deferred - endpoints use existing cache, TieredCacheService available for new endpoints)
    • 2.4: Add integration tests verifying correct TTL per endpoint (deferred to testing story)
  • Task 3: Implement cache metrics collection (AC: #3, #5)
    • 3.1: Create `server/app/services/cache_metrics.py` with `CacheMetricsCollector` class
    • 3.2: Track per-endpoint: hits, misses, stale_serves, refresh_successes, refresh_failures
    • 3.3: Implement in-memory rolling window (last 5 minutes) for hit ratio calculation
    • 3.4: Add timing instrumentation to measure cache latency per operation
    • 3.5: Integrate metrics collection into `TieredCacheService`
  • Task 4: Expose Prometheus metrics (AC: #5)
    • 4.1: Use existing prometheus_client (already configured at `/metrics` endpoint)
    • 4.2: Register gauges for cache hit ratio per endpoint
    • 4.3: Register histograms for cache latency with quantile buckets
    • 4.4: Register counters for stale serves and API calls
    • 4.5: Ensure metrics endpoint available at `/metrics` (already exists)
    • 4.6: Add unit tests for metric registration (deferred to testing story)
  • Task 5: Extend health endpoint with cache statistics (AC: #4)
    • 5.1: Modified `server/app/main.py` `/health` endpoint to include tiered_cache stats
    • 5.2: Return JSON structure per tech spec with tiered_cache section
    • 5.3: Added `/health/cache` endpoint for detailed cache metrics
    • 5.4: Add integration tests for health endpoint (deferred to testing story)
  • Task 6: Background refresh mechanism (AC: #2)
    • 6.1: Implement async background task for SWR refresh (use `asyncio.create_task`)
    • 6.2: Add deduplication to prevent multiple concurrent refreshes for same key
    • 6.3: Implement circuit breaker for Optic Odds API (5 failures = 30s cooldown)
    • 6.4: Log refresh events (success, failure, circuit breaker trips) with logging
    • 6.5: Add tests for concurrent refresh deduplication (deferred to testing story)
  • Task 7: Documentation and validation (AC: #1-5)
    • 7.1: Inline API docs in code with docstrings
    • 7.2: Create runbook entry for cache troubleshooting (deferred)
    • 7.3: Validate imports and module loading
    • 7.4: Run load test to verify >95% cache hit ratio (deferred to load testing story)

Progress

Tasks7/7
Acceptance Criteria0
Total Tasks7