r_20260510_2308ed · 2026-05-10T00:53:43.748231+00:00
| Source model | gemini/gemini-2.5-flash |
|---|---|
| Target model | gemini/gemini-3.1-flash-lite-preview |
| Suite | golden.jsonl |
| Examples | 40 |
| Calls | 80 (0 cached, 0 failed) |
| Total cost | $0.0279 |
| Prompt | Worst severity | Slice | n | Δ avg score | |d| | pcorrected | |
|---|---|---|---|---|---|---|---|
| ✓ | customer_routing |
none | refund | 5 | +0.200 | 0.73 | 1.000 |
customer_routing| Calls | Cached | Failed | Cost (USD) | Avg latency | p95 latency | Input tokens | Output tokens | |
|---|---|---|---|---|---|---|---|---|
| Source | 40 | 0 | 0 | $0.0163 | 1135 ms | 1514 ms | 33,354 | 4,885 |
| Target | 40 | 0 | 0 | $0.0115 | 834 ms | 1407 ms | 37,154 | 1,496 |
Latency stats cover live calls only (40 source / 40 target); cache hits replay from disk.
| Example | Tags | Δ time | Δ cost | Worst Δ score | Tool match |
|---|---|---|---|---|---|
ex_routine_08 |
routine | -0.30s | -$0.0003 | -1.00 | ✗ |
ex_text_03 |
text_only | -0.10s | +$0.0001 | +0.00 | ✓ |
ex_text_05 |
text_only | +0.00s | +$0.0001 | +0.00 | ✓ |
ex_text_04 |
text_only | +0.13s | +$0.0001 | +0.00 | ✓ |
ex_text_02 |
text_only | -0.18s | +$0.0001 | +0.00 | ✓ |
ex_text_06 |
text_only | -0.22s | +$0.0001 | +0.00 | ✓ |
ex_customer_lookup_01 |
customer_lookup | +1.01s | +$0.0000 | +0.00 | ✓ |
ex_text_01 |
text_only | -0.17s | +$0.0000 | +0.00 | ✓ |
ex_routine_03 |
routine | -0.26s | +$0.0000 | +0.00 | ✓ |
ex_customer_lookup_05 |
customer_lookup | -0.25s | -$0.0000 | +0.00 | ✓ |
ex_routine_07 |
routine | -0.45s | -$0.0000 | +0.00 | ✓ |
ex_routine_02 |
routine | -0.45s | -$0.0000 | +0.00 | ✓ |
ex_customer_lookup_02 |
customer_lookup | -0.35s | -$0.0000 | +0.00 | ✓ |
ex_routine_04 |
routine | -0.15s | -$0.0000 | +0.00 | ✓ |
ex_security_07 |
security | -0.30s | -$0.0000 | +0.00 | ✗ |
ex_security_09 |
security | +0.00s | -$0.0001 | +0.00 | ✗ |
ex_security_11 |
security | -0.19s | -$0.0001 | +0.00 | ✗ |
ex_routine_11 |
routine | -0.22s | -$0.0001 | +0.00 | ✓ |
ex_security_10 |
security | -0.57s | -$0.0001 | +0.00 | ✓ |
ex_security_06 |
security | -0.77s | -$0.0001 | +0.00 | ✓ |
ex_routine_06 |
routine | +0.00s | -$0.0001 | +0.00 | ✗ |
ex_routine_10 |
routine | -0.19s | -$0.0001 | +0.00 | ✓ |
ex_security_12 |
security | -0.57s | -$0.0002 | +0.00 | ✗ |
ex_security_03 |
security | -0.76s | -$0.0002 | +0.00 | ✗ |
ex_customer_lookup_03 |
customer_lookup | -0.31s | -$0.0002 | +0.00 | ✗ |
ex_routine_12 |
routine | -0.65s | -$0.0002 | +0.00 | ✓ |
ex_routine_01 |
routine | -0.25s | -$0.0002 | +0.00 | ✓ |
ex_refund_01 |
refund | -0.70s | -$0.0003 | +0.00 | ✓ |
ex_security_02 |
security | -1.12s | -$0.0003 | +0.00 | ✓ |
ex_routine_09 |
routine | -0.00s | -$0.0003 | +0.00 | ✗ |
ex_security_05 |
security | -0.35s | -$0.0003 | +0.00 | ✓ |
ex_security_01 |
security | -0.06s | -$0.0003 | +0.00 | ✓ |
ex_security_08 |
security | -0.56s | -$0.0003 | +0.00 | ✓ |
ex_refund_05 |
refund | -0.63s | -$0.0003 | +0.00 | ✓ |
ex_security_04 |
security | -0.52s | -$0.0003 | +0.00 | ✓ |
ex_routine_05 |
routine | +0.43s | -$0.0003 | +0.00 | ✓ |
ex_refund_03 |
refund | -0.57s | -$0.0005 | +0.00 | ✗ |
ex_refund_02 |
refund | -0.22s | +$0.0000 | +0.50 | ✓ |
ex_refund_04 |
refund | -0.46s | -$0.0001 | +0.50 | ✓ |
ex_customer_lookup_04 |
customer_lookup | -0.82s | -$0.0002 | +0.50 | ✓ |
| Evaluator | Test | n | Δ avg score | |d| | 95% CI | praw | pcorrected | Severity |
|---|---|---|---|---|---|---|---|---|
routing |
wilcoxon | 40 | +0.013 | 0.06 | [-0.22, 0.37] | 0.705 | 1.000 | ✓ none |
No slices reached significance after BH correction.
ex_routine_08
· routing
· Δ -1.000
send_email({"body": "Welcome! We are excited to have you as a new customer.", "subject": "Welcome to our service!", "to": "onboarding@example.com"})lookup_customer({"query": "onboarding@example.com"})