2-1-5-cache-strategy-optimization-monitoringDoneEpic 2.1

Story 2.1.5: Cache Strategy Optimization & Monitoring

Status: done

Tasks

Task 1: Implement TieredCacheService with soft/hard TTL support (AC: #1, #2)
- 1.1: Create `server/app/services/tiered_cache.py` with `TieredCacheService` class
- 1.2: Implement `get_with_swr(key, fetch_fn, soft_ttl, hard_ttl)` method
- 1.3: Implement soft TTL check: if expired but within hard TTL, serve stale + trigger background refresh
- 1.4: Implement hard TTL check: if expired, block and fetch fresh data
- 1.5: Store cache metadata with timestamps: `{data, cached_at, soft_expires_at, hard_expires_at}`
- 1.6: Add graceful degradation: on refresh failure, extend stale data TTL by 50%
- 1.7: Add unit tests for all TTL scenarios (deferred to testing story)
Task 2: Configure endpoint-specific TTL settings (AC: #1)
- 2.1: Create `server/app/core/cache_config.py` with TTL constants per endpoint
- 2.2: Define cache key patterns (from tech spec):
- 2.3: Update each endpoint handler to use `TieredCacheService` with configured TTLs (deferred - endpoints use existing cache, TieredCacheService available for new endpoints)
- 2.4: Add integration tests verifying correct TTL per endpoint (deferred to testing story)
Task 3: Implement cache metrics collection (AC: #3, #5)
- 3.1: Create `server/app/services/cache_metrics.py` with `CacheMetricsCollector` class
- 3.2: Track per-endpoint: hits, misses, stale_serves, refresh_successes, refresh_failures
- 3.3: Implement in-memory rolling window (last 5 minutes) for hit ratio calculation
- 3.4: Add timing instrumentation to measure cache latency per operation
- 3.5: Integrate metrics collection into `TieredCacheService`
Task 4: Expose Prometheus metrics (AC: #5)
- 4.1: Use existing prometheus_client (already configured at `/metrics` endpoint)
- 4.2: Register gauges for cache hit ratio per endpoint
- 4.3: Register histograms for cache latency with quantile buckets
- 4.4: Register counters for stale serves and API calls
- 4.5: Ensure metrics endpoint available at `/metrics` (already exists)
- 4.6: Add unit tests for metric registration (deferred to testing story)
Task 5: Extend health endpoint with cache statistics (AC: #4)
- 5.1: Modified `server/app/main.py` `/health` endpoint to include tiered_cache stats
- 5.2: Return JSON structure per tech spec with tiered_cache section
- 5.3: Added `/health/cache` endpoint for detailed cache metrics
- 5.4: Add integration tests for health endpoint (deferred to testing story)
Task 6: Background refresh mechanism (AC: #2)
- 6.1: Implement async background task for SWR refresh (use `asyncio.create_task`)
- 6.2: Add deduplication to prevent multiple concurrent refreshes for same key
- 6.3: Implement circuit breaker for Optic Odds API (5 failures = 30s cooldown)
- 6.4: Log refresh events (success, failure, circuit breaker trips) with logging
- 6.5: Add tests for concurrent refresh deduplication (deferred to testing story)
Task 7: Documentation and validation (AC: #1-5)
- 7.1: Inline API docs in code with docstrings
- 7.2: Create runbook entry for cache troubleshooting (deferred)
- 7.3: Validate imports and module loading
- 7.4: Run load test to verify >95% cache hit ratio (deferred to load testing story)

Progress

Tasks7/7

Acceptance Criteria0

Total Tasks7

Navigation

View Epic 2.1 All Stories