cogcache¶
Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.
from cogcache import CogniCache
cache = CogniCache(similarity_threshold=0.92)
answer = cache.query("What is gradient descent?", llm_fn=my_llm)
answer = cache.query("Explain gradient descent.", llm_fn=my_llm) # cache HIT
Why?¶
Traditional caches (Redis, Memcached) match exact keys. Two semantically identical questions phrased differently miss the cache entirely:
| Query | Traditional cache | cogcache |
|---|---|---|
"What is X?" (first) |
MISS → LLM | MISS → LLM |
"What is X?" (again) |
HIT | HIT |
"Tell me what X is." |
MISS → LLM | HIT (similarity 0.94) |
"X 是什么?" |
MISS → LLM | HIT (if embedding model is multilingual) |
A typical paraphrased query takes 6 seconds and ~300 tokens. A cache hit takes <1 ms and 0 tokens. At scale that's 99 %+ cost reduction on the redundant tail of your traffic.
Features¶
-
Semantic matching
Cosine similarity over embeddings, with configurable threshold (default 0.92) and per-route isolation.
-
Pluggable storage
MemoryStorefor dev,RedisStore(Redis Stack 7+ with HNSW vector search) for production. Both share the sameCacheStoreABC. -
LLM-as-Judge quality gate
"Write strict, hit lenient": LLM scores answer quality at write time to block bad answers from polluting the cache. Async warning on hit.
-
Observable
Built-in
MetricsCollectorwith p50/p95/p99 latency, hit rate, token savings. OptionalPrometheusSinkfor monitoring. -
Fail-open
Redis disconnect, Judge crash, embedding failure — none of these break your request path. The LLM call continues; the cache silently bypasses.
-
Lightweight
~35 KB wheel. One required dependency (
numpy). Everything else is an opt-in extra.
Install¶
pip install cogcache # core
pip install cogcache[redis] # + Redis Stack backend
pip install cogcache[prometheus] # + Prometheus metrics
pip install cogcache[openai-judge] # + LLM-as-Judge
pip install cogcache[all] # everything
Production deployment¶
Looking for a full reference deployment with FastAPI, admin dashboard, JWT auth, audit logging, and Docker Compose?
→ See cogcache-playground