Skip to content

cogcache

Semantic LLM answer cache — reuse paraphrased queries, cut latency and token cost.

from cogcache import CogniCache

cache = CogniCache(similarity_threshold=0.92)
answer = cache.query("What is gradient descent?", llm_fn=my_llm)
answer = cache.query("Explain gradient descent.", llm_fn=my_llm)  # cache HIT

Why?

Traditional caches (Redis, Memcached) match exact keys. Two semantically identical questions phrased differently miss the cache entirely:

Query Traditional cache cogcache
"What is X?" (first) MISS → LLM MISS → LLM
"What is X?" (again) HIT HIT
"Tell me what X is." MISS → LLM HIT (similarity 0.94)
"X 是什么?" MISS → LLM HIT (if embedding model is multilingual)

A typical paraphrased query takes 6 seconds and ~300 tokens. A cache hit takes <1 ms and 0 tokens. At scale that's 99 %+ cost reduction on the redundant tail of your traffic.


Features

  • Semantic matching


    Cosine similarity over embeddings, with configurable threshold (default 0.92) and per-route isolation.

  • Pluggable storage


    MemoryStore for dev, RedisStore (Redis Stack 7+ with HNSW vector search) for production. Both share the same CacheStore ABC.

  • LLM-as-Judge quality gate


    "Write strict, hit lenient": LLM scores answer quality at write time to block bad answers from polluting the cache. Async warning on hit.

  • Observable


    Built-in MetricsCollector with p50/p95/p99 latency, hit rate, token savings. Optional PrometheusSink for monitoring.

  • Fail-open


    Redis disconnect, Judge crash, embedding failure — none of these break your request path. The LLM call continues; the cache silently bypasses.

  • Lightweight


    ~35 KB wheel. One required dependency (numpy). Everything else is an opt-in extra.


Install

pip install cogcache                   # core
pip install cogcache[redis]            # + Redis Stack backend
pip install cogcache[prometheus]       # + Prometheus metrics
pip install cogcache[openai-judge]     # + LLM-as-Judge
pip install cogcache[all]              # everything

Get started → See the API →


Production deployment

Looking for a full reference deployment with FastAPI, admin dashboard, JWT auth, audit logging, and Docker Compose?

→ See cogcache-playground