Quickstart¶
Install¶
cogcache works on Python 3.10+. The only required dependency is numpy.
Your first cache hit¶
quickstart.py
from cogcache import CogniCache
# 1. Create a cache instance
cache = CogniCache(similarity_threshold=0.92)
# 2. Wrap any (str) -> str LLM call
def call_llm(query: str) -> str:
# Replace with your real LLM call
import time; time.sleep(4) # simulate latency
return f"Answer to: {query}"
# 3. First call → LLM is invoked
result = cache.query_with_meta("What is gradient descent?", call_llm)
print(f"Cache: {result.cache_hit} Latency: {result.entry.text!r}")
# → Cache: False Latency: 'What is gradient descent?'
# 4. Same question, paraphrased → cache HIT
result = cache.query_with_meta("Explain gradient descent.", call_llm)
print(f"Cache: {result.cache_hit} Sim: {result.similarity:.3f}")
# → Cache: True Sim: 0.943
As a decorator¶
cache = CogniCache(similarity_threshold=0.90)
@cache.cached()
def ask_llm(query: str) -> str:
return your_llm_client.chat(query)
ask_llm("What is X?") # LLM call
ask_llm("Tell me X.") # cache HIT
With OpenAI / DashScope¶
from openai import OpenAI
from cogcache import CogniCache
client = OpenAI(
api_key="sk-...",
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
cache = CogniCache(similarity_threshold=0.90)
def llm(query: str) -> str:
r = client.chat.completions.create(
model="qwen-turbo",
messages=[{"role": "user", "content": query}],
)
return r.choices[0].message.content
answer = cache.query("What is machine learning?", llm)
With Redis backend (persistent across restarts)¶
cache = CogniCache(
redis_url="redis://localhost:6379/0",
similarity_threshold=0.92,
vector_dim=1536, # match your embedding model
)
Redis Stack required
The Redis backend uses RediSearch HNSW indexes for fast vector search. Plain Redis won't work. The easiest setup is Docker:
Custom embedding function¶
By default cogcache uses a small built-in hash-bucket embedder (fast, low quality). For production you'll want a real embedding model:
from openai import OpenAI
client = OpenAI()
def embed(text: str) -> list[float]:
r = client.embeddings.create(model="text-embedding-3-small", input=text)
return r.data[0].embedding
cache = CogniCache(
similarity_threshold=0.85,
vector_dim=1536,
embed_fn=embed,
)
Next steps¶
- Concepts — How semantic caching works under the hood
- Tuning — Pick the right threshold and embedding model
- Observability — Prometheus + JSON metrics
- Benchmarks — Run real-LLM benchmarks on your queries