Skip to content

Quickstart

Install

pip install cogcache
pip install cogcache[redis]
pip install cogcache[all]

cogcache works on Python 3.10+. The only required dependency is numpy.

Your first cache hit

quickstart.py
from cogcache import CogniCache

# 1. Create a cache instance
cache = CogniCache(similarity_threshold=0.92)

# 2. Wrap any (str) -> str LLM call
def call_llm(query: str) -> str:
    # Replace with your real LLM call
    import time; time.sleep(4)   # simulate latency
    return f"Answer to: {query}"

# 3. First call → LLM is invoked
result = cache.query_with_meta("What is gradient descent?", call_llm)
print(f"Cache: {result.cache_hit}  Latency: {result.entry.text!r}")
# → Cache: False  Latency: 'What is gradient descent?'

# 4. Same question, paraphrased → cache HIT
result = cache.query_with_meta("Explain gradient descent.", call_llm)
print(f"Cache: {result.cache_hit}  Sim: {result.similarity:.3f}")
# → Cache: True   Sim: 0.943

As a decorator

cache = CogniCache(similarity_threshold=0.90)

@cache.cached()
def ask_llm(query: str) -> str:
    return your_llm_client.chat(query)

ask_llm("What is X?")      # LLM call
ask_llm("Tell me X.")       # cache HIT

With OpenAI / DashScope

from openai import OpenAI
from cogcache import CogniCache

client = OpenAI(
    api_key="sk-...",
    base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
cache = CogniCache(similarity_threshold=0.90)

def llm(query: str) -> str:
    r = client.chat.completions.create(
        model="qwen-turbo",
        messages=[{"role": "user", "content": query}],
    )
    return r.choices[0].message.content

answer = cache.query("What is machine learning?", llm)

With Redis backend (persistent across restarts)

cache = CogniCache(
    redis_url="redis://localhost:6379/0",
    similarity_threshold=0.92,
    vector_dim=1536,                   # match your embedding model
)

Redis Stack required

The Redis backend uses RediSearch HNSW indexes for fast vector search. Plain Redis won't work. The easiest setup is Docker:

docker run -d -p 6379:6379 redis/redis-stack:7.4.0-v0

Custom embedding function

By default cogcache uses a small built-in hash-bucket embedder (fast, low quality). For production you'll want a real embedding model:

from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list[float]:
    r = client.embeddings.create(model="text-embedding-3-small", input=text)
    return r.data[0].embedding

cache = CogniCache(
    similarity_threshold=0.85,
    vector_dim=1536,
    embed_fn=embed,
)

Next steps

  • Concepts — How semantic caching works under the hood
  • Tuning — Pick the right threshold and embedding model
  • Observability — Prometheus + JSON metrics
  • Benchmarks — Run real-LLM benchmarks on your queries