CogniCache Tuning Guide¶

similarity_threshold¶

The single most impactful knob. It controls the cosine-similarity cut-off above which a cached answer is returned instead of calling the LLM.

Threshold	Typical behaviour
0.50–0.65	Aggressive. High hit rate, elevated false-positive risk. Suitable only with an LLM Judge enabled.
0.70–0.80	Balanced. Recommended starting point for most production workloads.
0.85–0.92	Conservative (default: 0.92). Near-zero false positives. Safe without a judge.
0.95+	Very strict. Only exact or near-exact paraphrases hit. Low ROI unless query distribution is very narrow.

How to pick:

Run the threshold sweep on a representative sample of your query logs:

python -m cogcache.bench.run_bench \
    --real-llm \
    --dataset your_queries.jsonl \
    --sweep 0.70,0.75,0.80,0.85,0.90,0.92,0.95 \
    --output bench_reports/sweep.md

Find the highest threshold where Hit Rate still meets your target (≥ 10–20% for in-domain).
Check FP Rate is ≤ 5 %. If not, raise the threshold or enable the LLM Judge.

Embedding model¶

The embedding model determines how well semantically similar queries cluster together.

Model	Dim	Quality	Latency	Cost
`local_embedding` (built-in)	512	Low (hash-bucket n-gram)	~0 ms	Free
`text-embedding-v3` (DashScope)	1536	High	~50 ms	¥0.0007 / 1K tokens
`text-embedding-3-small` (OpenAI)	1536	High	~50 ms	$0.02 / 1M tokens
BGE-Small-zh (local SBERT)	512	Medium-High (Chinese)	~10 ms	Free

With a better embedding model you can raise the threshold and still achieve good hit rates. The offline bench uses local_embedding; real-LLM benchmarks should always use an API embedding.

LLM Judge¶

Enable when you operate below threshold 0.85, or when answer staleness matters.

from cogcache import CogniCache
from cogcache.judge import LLMJudgeOpenAI

cache = CogniCache(
    similarity_threshold=0.75,
    enable_judge=True,
    judge=LLMJudgeOpenAI(client=your_client, model="qwen-turbo"),
    write_min_quality=0.80,   # reject writes below this score
    judge_on_hit=True,        # async quality check on every hit
    hit_min_quality=0.60,     # log warning if hit scores below this
    low_quality_ttl=1800,     # cache low-quality answers for 30 min only
)

write_min_quality — gate for writing to cache. Set 0.80–0.90. Below this the answer is served but not cached (or cached only for low_quality_ttl seconds).

hit_min_quality — async warning threshold. Does not block the response. Set 0.50–0.70.

TTL¶

cache = CogniCache(ttl=3600)   # expire all entries after 1 hour

Set TTL when: - Answers reference time-sensitive data (prices, inventory, news) - You want to cap memory/Redis growth without manual eviction

-1 (default) means entries never expire.

max_cache_size¶

MemoryStore evicts the oldest entry when the cache is full.

Use case	Recommended size
Single-tenant dev	100–500
Multi-tenant API	5 000–50 000 (or use RedisStore)
RedisStore	Unlimited (governed by Redis `maxmemory`)

Running the real-LLM bench (DashScope / Aliyun Bailian)¶

export OPENAI_API_KEY=sk-your-dashscope-key
export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

python -m cogcache.bench.run_bench \
    --real-llm \
    --llm-model qwen-turbo \
    --embed-model text-embedding-v3 \
    --dataset cogcache/bench/datasets/ecommerce.jsonl \
    --threshold 0.85 \
    --sweep 0.70,0.75,0.80,0.85,0.90,0.92,0.95 \
    --max-queries 30 \
    --output bench_reports/v0.2_real_llm.md

--max-queries 30 limits API spend during exploration. Remove the flag for a full run.

Prometheus / observability¶

import os
os.environ["COGNICACHE_PROMETHEUS_ENABLED"] = "true"

# Then GET /metrics/prom  →  Prometheus text format
# GET /metrics/json       →  JSON snapshot (hit rate, percentiles, token savings)

Key metrics to watch:

Metric	Alert if
`cogcache_queries_total{cache_hit="false"}` rate	rising fast (cache eviction?)
`cogcache_query_latency_seconds{cache_hit="true"} p99`	> 50 ms (embedding bottleneck)
`cogcache_tokens_total{kind="saved"}` / `kind="used"` ratio	< 10% (threshold too strict?)
`cogcache_quality_score`	drops below 0.70 (judge or LLM degradation)