Benchmarks¶

cogcache ships with a self-contained benchmark suite that reports the proposal §5.2 headline metrics: hit rate, token savings, latency gain, and false-positive rate.

Offline mode (no API needed)¶

Uses a deterministic fake LLM and the built-in hash embedder. Good for CI sanity checks; not representative of production quality.

python -m cogcache.bench.run_bench \
    --dataset cogcache/bench/datasets/ecommerce.jsonl \
    --threshold 0.50 \
    --output bench_reports/offline.md

Real-LLM mode¶

Uses DashScope / OpenAI for both chat completions AND embeddings. This is the only mode that produces meaningful absolute numbers.

export OPENAI_API_KEY=sk-...
export OPENAI_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1

python -m cogcache.bench.run_bench \
    --real-llm \
    --llm-model qwen-turbo \
    --embed-model text-embedding-v3 \
    --dataset cogcache/bench/datasets/ecommerce.jsonl \
    --threshold 0.85 \
    --sweep 0.70,0.75,0.80,0.85,0.90,0.92,0.95 \
    --max-queries 30 \
    --output bench_reports/v0.2_real_llm.md

The --max-queries 30 cap keeps API spend under $1 during exploration. Drop it for full-corpus runs.

What the report contains¶

Section	What you see
§5.2 Headline metrics	Hit rate, token savings, P50 hit / P50 miss latency, per-hit gain, FP rate, judge accuracy
§5.3 Per-experiment latency	Hit & miss p50 / p99 per experiment
§5.4 Threshold sweep	Hit rate, FP rate, judge accuracy at each threshold
§6 Targets vs. observed	Proposal compliance checklist (hit rate ≥ 20%, gain ≥ 60%, FP ≤ 5%)

Bring your own dataset¶

The dataset format is one JSON object per line:

{"query": "How do I get a refund?",   "cluster": "refund"}
{"query": "Can I cancel my order?",   "cluster": "cancel"}
{"query": "Where is my package?",     "cluster": "shipping"}

query — the user's question
cluster — the ground-truth semantic group. Used to compute false-positive rate (a cache hit from a different cluster = FP).

The bench shuffles by --seed, partitions repeated queries within the same cluster, and reports false-positive rate as cross-cluster hits ÷ total hits.

Running CogniCache + Judge in the bench¶

The bench wires up a built-in ClusterAwareStubJudge that uses the cluster labels to deterministically score answer quality. This lets you verify the Judge plumbing without paying for real LLM judge calls.

from cogcache import CogniCache
from cogcache.bench.run_bench import ClusterAwareStubJudge

cache = CogniCache(
    similarity_threshold=0.50,
    enable_judge=True,
    judge=ClusterAwareStubJudge({"query1": "cluster_A", ...}),
    write_min_quality=0.80,
    low_quality_ttl=0,   # reject low-quality writes outright
)

Reading the headline metrics¶

| Experiment    | Hit Rate | Token Savings | P50 Hit | P50 Miss | Per-Hit Gain |
|---------------|---------:|--------------:|--------:|---------:|-------------:|
| Baseline      |     0.0% |             — |   0.0ms |  500.1ms |            — |
| Cache Only    |    14.0% |         14.0% |   1.9ms |  501.3ms |        +100% |
| Cache + Judge |    14.0% |         14.0% |   1.8ms |  501.3ms |        +100% |

Per-Hit Gain — (P50_miss - P50_hit) / P50_miss. The proposal target is ≥ 60%.
Token Savings — Tokens saved by cache / baseline total tokens. In offline mode this equals hit rate because we count fixed tokens per answer.
FP Rate — Hits that returned an answer from a different cluster. Should be ≤ 5% in production. Higher means raise the threshold.

Continuous benchmarking¶

For regression testing, integrate the bench into CI:

- name: Run offline bench
  run: |
    pip install cogcache[bench]
    python -m cogcache.bench.run_bench \
        --threshold 0.50 \
        --output /tmp/bench.md
    grep -q "Hit Rate ≥ 20%" /tmp/bench.md   # fail if regression