Concepts¶
Semantic caching vs. key-value caching¶
┌───────────────────────────────────────────────┐
│ Traditional cache (Redis GET/SET) │
│ │
│ "What is X?" → hash("What is X?") │
│ "Tell me about X." → hash("Tell me ...") │
│ ─ DIFFERENT KEYS → cache miss │
└───────────────────────────────────────────────┘
┌───────────────────────────────────────────────┐
│ cogcache │
│ │
│ "What is X?" → [0.12, -0.4, ...] vec │
│ "Tell me about X." → [0.14, -0.3, ...] vec │
│ ─ cosine(vec1, vec2) = 0.94 ≥ 0.92 → HIT │
└───────────────────────────────────────────────┘
cogcache stores each query as a dense vector (embedding). New queries are embedded and compared by cosine similarity to existing entries. If the best match exceeds the threshold, the cached answer is returned.
The request lifecycle¶
sequenceDiagram
participant U as User
participant C as cogcache
participant E as Embedder
participant S as Store
participant J as Judge (opt)
participant L as LLM
U->>C: query("What is X?")
C->>E: embed("What is X?")
E-->>C: [0.12, -0.4, ...]
C->>S: find_match(vec, route, intent)
alt Cache HIT
S-->>C: entry (sim ≥ threshold, same route)
C-->>U: cached answer
else Cache MISS
C->>L: llm_fn("What is X?")
L-->>C: answer
opt Judge enabled
C->>J: score(query, answer)
J-->>C: quality (0-1)
end
C->>S: store(entry)
C-->>U: answer
end
Route and intent isolation¶
A single cogcache instance can serve multiple logical caches, partitioned
by route and intent:
cache.query("How do I login?", llm, route="user_help", intent="auth")
cache.query("How do I login?", llm, route="admin_help", intent="auth") # MISS — different route
cache.query("How do I login?", llm, route="user_help", intent="docs") # MISS — different intent
This prevents leaking admin-targeted answers to regular users, or mixing chitchat answers into technical contexts.
"Write strict, hit lenient" Judge policy¶
When an LLM Judge is enabled, cogcache scores answer quality on TWO axes with different strictness:
| Event | Threshold | What happens if score is below |
|---|---|---|
| Write | write_min_quality (e.g. 0.80) |
Reject the write OR cache with short low_quality_ttl |
| Hit (async, optional) | hit_min_quality (e.g. 0.60) |
Log a warning. The request still returns the cached answer. |
The asymmetry is intentional. Writes happen once per unique query — making the gate strict prevents bad answers from being served to many future users. Hits happen many times — making the gate lenient avoids latency hits, and any issue gets surfaced via metrics for offline review.
Fail-open by design¶
cogcache treats itself as best-effort optimization, never a hard dependency:
| Failure mode | Behavior |
|---|---|
| Redis disconnects | Cache lookup returns miss, LLM still called |
| Embedding API rate-limited | Falls back to local hash embedder |
| Judge LLM returns 500 | Score defaults to 1.0, write proceeds |
| Sink (Prometheus) crashes | Event recorded, sink isolated |
Your /ask endpoint never returns a 500 because the cache had an issue.
Threshold trade-off¶
threshold = 0.50 threshold = 0.92
┌──────────────┐ ┌──────────────┐
│ Hit rate ↑↑↑ │ │ Hit rate ↓ │
│ FP rate ↑↑ │ │ FP rate ~0 │
│ Needs Judge │ │ Safe │
└──────────────┘ └──────────────┘
aggressive conservative
See Tuning for picking the right threshold on your dataset.