cogcache¶

Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.

from cogcache import CogniCache

cache = CogniCache(similarity_threshold=0.92)
answer = cache.query("What is gradient descent?", llm_fn=my_llm)
answer = cache.query("Explain gradient descent.", llm_fn=my_llm)  # cache HIT

Why?¶

Traditional caches (Redis, Memcached) match exact keys. Two semantically identical questions phrased differently miss the cache entirely:

Query	Traditional cache	cogcache
`"What is X?"` (first)	MISS → LLM	MISS → LLM
`"What is X?"` (again)	HIT	HIT
`"Tell me what X is."`	MISS → LLM	HIT (similarity 0.94)
`"X 是什么?"`	MISS → LLM	HIT (if embedding model is multilingual)

A typical paraphrased query takes 6 seconds and ~300 tokens. A cache hit takes <1 ms and 0 tokens. At scale that's 99 %+ cost reduction on the redundant tail of your traffic.

Features¶

Semantic matching

Cosine similarity over embeddings, with configurable threshold (default 0.92) and per-route isolation.
Pluggable storage

MemoryStore for dev, RedisStore (Redis Stack 7+ with HNSW vector search) for production. Both share the same CacheStore ABC.
LLM-as-Judge quality gate

"Write strict, hit lenient": LLM scores answer quality at write time to block bad answers from polluting the cache. Async warning on hit.
Observable

Built-in MetricsCollector with p50/p95/p99 latency, hit rate, token savings. Optional PrometheusSink for monitoring.
Fail-open

Redis disconnect, Judge crash, embedding failure — none of these break your request path. The LLM call continues; the cache silently bypasses.
Lightweight

~35 KB wheel. One required dependency (numpy). Everything else is an opt-in extra.

Install¶

pip install cogcache                   # core
pip install cogcache[redis]            # + Redis Stack backend
pip install cogcache[prometheus]       # + Prometheus metrics
pip install cogcache[openai-judge]     # + LLM-as-Judge
pip install cogcache[all]              # everything

Get started → See the API →

Production deployment¶

Looking for a full reference deployment with FastAPI, admin dashboard, JWT auth, audit logging, and Docker Compose?

→ See cogcache-playground