Concepts¶

Semantic caching vs. key-value caching¶

┌───────────────────────────────────────────────┐
│  Traditional cache (Redis GET/SET)            │
│                                                │
│  "What is X?"        → hash("What is X?")     │
│  "Tell me about X."  → hash("Tell me ...")    │
│  ─ DIFFERENT KEYS → cache miss                 │
└───────────────────────────────────────────────┘

┌───────────────────────────────────────────────┐
│  cogcache                                      │
│                                                │
│  "What is X?"        → [0.12, -0.4, ...] vec  │
│  "Tell me about X."  → [0.14, -0.3, ...] vec  │
│  ─ cosine(vec1, vec2) = 0.94 ≥ 0.92 → HIT     │
└───────────────────────────────────────────────┘

cogcache stores each query as a dense vector (embedding). New queries are embedded and compared by cosine similarity to existing entries. If the best match exceeds the threshold, the cached answer is returned.

The request lifecycle¶

sequenceDiagram
    participant U as User
    participant C as cogcache
    participant E as Embedder
    participant S as Store
    participant J as Judge (opt)
    participant L as LLM

    U->>C: query("What is X?")
    C->>E: embed("What is X?")
    E-->>C: [0.12, -0.4, ...]
    C->>S: find_match(vec, route, intent)
    alt Cache HIT
        S-->>C: entry (sim ≥ threshold, same route)
        C-->>U: cached answer
    else Cache MISS
        C->>L: llm_fn("What is X?")
        L-->>C: answer
        opt Judge enabled
            C->>J: score(query, answer)
            J-->>C: quality (0-1)
        end
        C->>S: store(entry)
        C-->>U: answer
    end

Route and intent isolation¶

A single cogcache instance can serve multiple logical caches, partitioned by route and intent:

cache.query("How do I login?", llm, route="user_help",    intent="auth")
cache.query("How do I login?", llm, route="admin_help",   intent="auth")  # MISS — different route
cache.query("How do I login?", llm, route="user_help",    intent="docs")  # MISS — different intent

This prevents leaking admin-targeted answers to regular users, or mixing chitchat answers into technical contexts.

"Write strict, hit lenient" Judge policy¶

When an LLM Judge is enabled, cogcache scores answer quality on TWO axes with different strictness:

Event	Threshold	What happens if score is below
Write	`write_min_quality` (e.g. 0.80)	Reject the write OR cache with short `low_quality_ttl`
Hit (async, optional)	`hit_min_quality` (e.g. 0.60)	Log a warning. The request still returns the cached answer.

The asymmetry is intentional. Writes happen once per unique query — making the gate strict prevents bad answers from being served to many future users. Hits happen many times — making the gate lenient avoids latency hits, and any issue gets surfaced via metrics for offline review.

Fail-open by design¶

cogcache treats itself as best-effort optimization, never a hard dependency:

Failure mode	Behavior
Redis disconnects	Cache lookup returns miss, LLM still called
Embedding API rate-limited	Falls back to local hash embedder
Judge LLM returns 500	Score defaults to 1.0, write proceeds
Sink (Prometheus) crashes	Event recorded, sink isolated

Your /ask endpoint never returns a 500 because the cache had an issue.

Threshold trade-off¶

threshold = 0.50              threshold = 0.92
┌──────────────┐              ┌──────────────┐
│ Hit rate ↑↑↑ │              │ Hit rate ↓   │
│ FP rate ↑↑   │              │ FP rate ~0   │
│ Needs Judge  │              │ Safe         │
└──────────────┘              └──────────────┘
   aggressive                    conservative

See Tuning for picking the right threshold on your dataset.