Skip to content

Concepts

Semantic caching vs. key-value caching

┌───────────────────────────────────────────────┐
│  Traditional cache (Redis GET/SET)            │
│                                                │
│  "What is X?"        → hash("What is X?")     │
│  "Tell me about X."  → hash("Tell me ...")    │
│  ─ DIFFERENT KEYS → cache miss                 │
└───────────────────────────────────────────────┘

┌───────────────────────────────────────────────┐
│  cogcache                                      │
│                                                │
│  "What is X?"        → [0.12, -0.4, ...] vec  │
│  "Tell me about X."  → [0.14, -0.3, ...] vec  │
│  ─ cosine(vec1, vec2) = 0.94 ≥ 0.92 → HIT     │
└───────────────────────────────────────────────┘

cogcache stores each query as a dense vector (embedding). New queries are embedded and compared by cosine similarity to existing entries. If the best match exceeds the threshold, the cached answer is returned.

The request lifecycle

sequenceDiagram
    participant U as User
    participant C as cogcache
    participant E as Embedder
    participant S as Store
    participant J as Judge (opt)
    participant L as LLM

    U->>C: query("What is X?")
    C->>E: embed("What is X?")
    E-->>C: [0.12, -0.4, ...]
    C->>S: find_match(vec, route, intent)
    alt Cache HIT
        S-->>C: entry (sim ≥ threshold, same route)
        C-->>U: cached answer
    else Cache MISS
        C->>L: llm_fn("What is X?")
        L-->>C: answer
        opt Judge enabled
            C->>J: score(query, answer)
            J-->>C: quality (0-1)
        end
        C->>S: store(entry)
        C-->>U: answer
    end

Route and intent isolation

A single cogcache instance can serve multiple logical caches, partitioned by route and intent:

cache.query("How do I login?", llm, route="user_help",    intent="auth")
cache.query("How do I login?", llm, route="admin_help",   intent="auth")  # MISS — different route
cache.query("How do I login?", llm, route="user_help",    intent="docs")  # MISS — different intent

This prevents leaking admin-targeted answers to regular users, or mixing chitchat answers into technical contexts.

"Write strict, hit lenient" Judge policy

When an LLM Judge is enabled, cogcache scores answer quality on TWO axes with different strictness:

Event Threshold What happens if score is below
Write write_min_quality (e.g. 0.80) Reject the write OR cache with short low_quality_ttl
Hit (async, optional) hit_min_quality (e.g. 0.60) Log a warning. The request still returns the cached answer.

The asymmetry is intentional. Writes happen once per unique query — making the gate strict prevents bad answers from being served to many future users. Hits happen many times — making the gate lenient avoids latency hits, and any issue gets surfaced via metrics for offline review.

Fail-open by design

cogcache treats itself as best-effort optimization, never a hard dependency:

Failure mode Behavior
Redis disconnects Cache lookup returns miss, LLM still called
Embedding API rate-limited Falls back to local hash embedder
Judge LLM returns 500 Score defaults to 1.0, write proceeds
Sink (Prometheus) crashes Event recorded, sink isolated

Your /ask endpoint never returns a 500 because the cache had an issue.

Threshold trade-off

threshold = 0.50              threshold = 0.92
┌──────────────┐              ┌──────────────┐
│ Hit rate ↑↑↑ │              │ Hit rate ↓   │
│ FP rate ↑↑   │              │ FP rate ~0   │
│ Needs Judge  │              │ Safe         │
└──────────────┘              └──────────────┘
   aggressive                    conservative

See Tuning for picking the right threshold on your dataset.