CogniCache Tuning Guide¶
similarity_threshold¶
The single most impactful knob. It controls the cosine-similarity cut-off above which a cached answer is returned instead of calling the LLM.
| Threshold | Typical behaviour |
|---|---|
| 0.50–0.65 | Aggressive. High hit rate, elevated false-positive risk. Suitable only with an LLM Judge enabled. |
| 0.70–0.80 | Balanced. Recommended starting point for most production workloads. |
| 0.85–0.92 | Conservative (default: 0.92). Near-zero false positives. Safe without a judge. |
| 0.95+ | Very strict. Only exact or near-exact paraphrases hit. Low ROI unless query distribution is very narrow. |
How to pick:
- Run the threshold sweep on a representative sample of your query logs:
- Find the highest threshold where
Hit Ratestill meets your target (≥ 10–20% for in-domain). - Check
FP Rateis ≤ 5 %. If not, raise the threshold or enable the LLM Judge.
Embedding model¶
The embedding model determines how well semantically similar queries cluster together.
| Model | Dim | Quality | Latency | Cost |
|---|---|---|---|---|
local_embedding (built-in) |
512 | Low (hash-bucket n-gram) | ~0 ms | Free |
text-embedding-v3 (DashScope) |
1536 | High | ~50 ms | ¥0.0007 / 1K tokens |
text-embedding-3-small (OpenAI) |
1536 | High | ~50 ms | $0.02 / 1M tokens |
| BGE-Small-zh (local SBERT) | 512 | Medium-High (Chinese) | ~10 ms | Free |
With a better embedding model you can raise the threshold and still achieve good hit rates.
The offline bench uses local_embedding; real-LLM benchmarks should always use an API embedding.
LLM Judge¶
Enable when you operate below threshold 0.85, or when answer staleness matters.
from cogcache import CogniCache
from cogcache.judge import LLMJudgeOpenAI
cache = CogniCache(
similarity_threshold=0.75,
enable_judge=True,
judge=LLMJudgeOpenAI(client=your_client, model="qwen-turbo"),
write_min_quality=0.80, # reject writes below this score
judge_on_hit=True, # async quality check on every hit
hit_min_quality=0.60, # log warning if hit scores below this
low_quality_ttl=1800, # cache low-quality answers for 30 min only
)
write_min_quality — gate for writing to cache. Set 0.80–0.90. Below this the answer is
served but not cached (or cached only for low_quality_ttl seconds).
hit_min_quality — async warning threshold. Does not block the response. Set 0.50–0.70.
TTL¶
Set TTL when: - Answers reference time-sensitive data (prices, inventory, news) - You want to cap memory/Redis growth without manual eviction
-1 (default) means entries never expire.
max_cache_size¶
MemoryStore evicts the oldest entry when the cache is full.
| Use case | Recommended size |
|---|---|
| Single-tenant dev | 100–500 |
| Multi-tenant API | 5 000–50 000 (or use RedisStore) |
| RedisStore | Unlimited (governed by Redis maxmemory) |
Running the real-LLM bench (DashScope / Aliyun Bailian)¶
export OPENAI_API_KEY=sk-your-dashscope-key
export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
python -m cogcache.bench.run_bench \
--real-llm \
--llm-model qwen-turbo \
--embed-model text-embedding-v3 \
--dataset cogcache/bench/datasets/ecommerce.jsonl \
--threshold 0.85 \
--sweep 0.70,0.75,0.80,0.85,0.90,0.92,0.95 \
--max-queries 30 \
--output bench_reports/v0.2_real_llm.md
--max-queries 30 limits API spend during exploration. Remove the flag for a full run.
Prometheus / observability¶
import os
os.environ["COGNICACHE_PROMETHEUS_ENABLED"] = "true"
# Then GET /metrics/prom → Prometheus text format
# GET /metrics/json → JSON snapshot (hit rate, percentiles, token savings)
Key metrics to watch:
| Metric | Alert if |
|---|---|
cogcache_queries_total{cache_hit="false"} rate |
rising fast (cache eviction?) |
cogcache_query_latency_seconds{cache_hit="true"} p99 |
> 50 ms (embedding bottleneck) |
cogcache_tokens_total{kind="saved"} / kind="used" ratio |
< 10% (threshold too strict?) |
cogcache_quality_score |
drops below 0.70 (judge or LLM degradation) |