Skip to content

CogniCache Tuning Guide

similarity_threshold

The single most impactful knob. It controls the cosine-similarity cut-off above which a cached answer is returned instead of calling the LLM.

Threshold Typical behaviour
0.50–0.65 Aggressive. High hit rate, elevated false-positive risk. Suitable only with an LLM Judge enabled.
0.70–0.80 Balanced. Recommended starting point for most production workloads.
0.85–0.92 Conservative (default: 0.92). Near-zero false positives. Safe without a judge.
0.95+ Very strict. Only exact or near-exact paraphrases hit. Low ROI unless query distribution is very narrow.

How to pick:

  1. Run the threshold sweep on a representative sample of your query logs:
    python -m cogcache.bench.run_bench \
        --real-llm \
        --dataset your_queries.jsonl \
        --sweep 0.70,0.75,0.80,0.85,0.90,0.92,0.95 \
        --output bench_reports/sweep.md
    
  2. Find the highest threshold where Hit Rate still meets your target (≥ 10–20% for in-domain).
  3. Check FP Rate is ≤ 5 %. If not, raise the threshold or enable the LLM Judge.

Embedding model

The embedding model determines how well semantically similar queries cluster together.

Model Dim Quality Latency Cost
local_embedding (built-in) 512 Low (hash-bucket n-gram) ~0 ms Free
text-embedding-v3 (DashScope) 1536 High ~50 ms ¥0.0007 / 1K tokens
text-embedding-3-small (OpenAI) 1536 High ~50 ms $0.02 / 1M tokens
BGE-Small-zh (local SBERT) 512 Medium-High (Chinese) ~10 ms Free

With a better embedding model you can raise the threshold and still achieve good hit rates. The offline bench uses local_embedding; real-LLM benchmarks should always use an API embedding.


LLM Judge

Enable when you operate below threshold 0.85, or when answer staleness matters.

from cogcache import CogniCache
from cogcache.judge import LLMJudgeOpenAI

cache = CogniCache(
    similarity_threshold=0.75,
    enable_judge=True,
    judge=LLMJudgeOpenAI(client=your_client, model="qwen-turbo"),
    write_min_quality=0.80,   # reject writes below this score
    judge_on_hit=True,        # async quality check on every hit
    hit_min_quality=0.60,     # log warning if hit scores below this
    low_quality_ttl=1800,     # cache low-quality answers for 30 min only
)

write_min_quality — gate for writing to cache. Set 0.80–0.90. Below this the answer is served but not cached (or cached only for low_quality_ttl seconds).

hit_min_quality — async warning threshold. Does not block the response. Set 0.50–0.70.


TTL

cache = CogniCache(ttl=3600)   # expire all entries after 1 hour

Set TTL when: - Answers reference time-sensitive data (prices, inventory, news) - You want to cap memory/Redis growth without manual eviction

-1 (default) means entries never expire.


max_cache_size

MemoryStore evicts the oldest entry when the cache is full.

Use case Recommended size
Single-tenant dev 100–500
Multi-tenant API 5 000–50 000 (or use RedisStore)
RedisStore Unlimited (governed by Redis maxmemory)

Running the real-LLM bench (DashScope / Aliyun Bailian)

export OPENAI_API_KEY=sk-your-dashscope-key
export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

python -m cogcache.bench.run_bench \
    --real-llm \
    --llm-model qwen-turbo \
    --embed-model text-embedding-v3 \
    --dataset cogcache/bench/datasets/ecommerce.jsonl \
    --threshold 0.85 \
    --sweep 0.70,0.75,0.80,0.85,0.90,0.92,0.95 \
    --max-queries 30 \
    --output bench_reports/v0.2_real_llm.md

--max-queries 30 limits API spend during exploration. Remove the flag for a full run.


Prometheus / observability

import os
os.environ["COGNICACHE_PROMETHEUS_ENABLED"] = "true"

# Then GET /metrics/prom  →  Prometheus text format
# GET /metrics/json       →  JSON snapshot (hit rate, percentiles, token savings)

Key metrics to watch:

Metric Alert if
cogcache_queries_total{cache_hit="false"} rate rising fast (cache eviction?)
cogcache_query_latency_seconds{cache_hit="true"} p99 > 50 ms (embedding bottleneck)
cogcache_tokens_total{kind="saved"} / kind="used" ratio < 10% (threshold too strict?)
cogcache_quality_score drops below 0.70 (judge or LLM degradation)