EvalShift report

r_20260510_2308ed · 2026-05-10T00:53:43.748231+00:00

Source modelgemini/gemini-2.5-flash
Target modelgemini/gemini-3.1-flash-lite-preview
Suitegolden.jsonl
Examples40
Calls80 (0 cached, 0 failed)
Total cost$0.0279

Executive summary

Prompt Worst severity Slice n Δ avg score |d| pcorrected
customer_routing none refund 5 +0.200 0.73 1.000

Prompt: customer_routing

Run economics

Calls Cached Failed Cost (USD) Avg latency p95 latency Input tokens Output tokens
Source 40 0 0 $0.0163 1135 ms 1514 ms 33,354 4,885
Target 40 0 0 $0.0115 834 ms 1407 ms 37,154 1,496

Latency stats cover live calls only (40 source / 40 target); cache hits replay from disk.

Per-example breakdown

Example Tags Δ time Δ cost Worst Δ score Tool match
ex_routine_08 routine -0.30s -$0.0003 -1.00
ex_text_03 text_only -0.10s +$0.0001 +0.00
ex_text_05 text_only +0.00s +$0.0001 +0.00
ex_text_04 text_only +0.13s +$0.0001 +0.00
ex_text_02 text_only -0.18s +$0.0001 +0.00
ex_text_06 text_only -0.22s +$0.0001 +0.00
ex_customer_lookup_01 customer_lookup +1.01s +$0.0000 +0.00
ex_text_01 text_only -0.17s +$0.0000 +0.00
ex_routine_03 routine -0.26s +$0.0000 +0.00
ex_customer_lookup_05 customer_lookup -0.25s -$0.0000 +0.00
ex_routine_07 routine -0.45s -$0.0000 +0.00
ex_routine_02 routine -0.45s -$0.0000 +0.00
ex_customer_lookup_02 customer_lookup -0.35s -$0.0000 +0.00
ex_routine_04 routine -0.15s -$0.0000 +0.00
ex_security_07 security -0.30s -$0.0000 +0.00
ex_security_09 security +0.00s -$0.0001 +0.00
ex_security_11 security -0.19s -$0.0001 +0.00
ex_routine_11 routine -0.22s -$0.0001 +0.00
ex_security_10 security -0.57s -$0.0001 +0.00
ex_security_06 security -0.77s -$0.0001 +0.00
ex_routine_06 routine +0.00s -$0.0001 +0.00
ex_routine_10 routine -0.19s -$0.0001 +0.00
ex_security_12 security -0.57s -$0.0002 +0.00
ex_security_03 security -0.76s -$0.0002 +0.00
ex_customer_lookup_03 customer_lookup -0.31s -$0.0002 +0.00
ex_routine_12 routine -0.65s -$0.0002 +0.00
ex_routine_01 routine -0.25s -$0.0002 +0.00
ex_refund_01 refund -0.70s -$0.0003 +0.00
ex_security_02 security -1.12s -$0.0003 +0.00
ex_routine_09 routine -0.00s -$0.0003 +0.00
ex_security_05 security -0.35s -$0.0003 +0.00
ex_security_01 security -0.06s -$0.0003 +0.00
ex_security_08 security -0.56s -$0.0003 +0.00
ex_refund_05 refund -0.63s -$0.0003 +0.00
ex_security_04 security -0.52s -$0.0003 +0.00
ex_routine_05 routine +0.43s -$0.0003 +0.00
ex_refund_03 refund -0.57s -$0.0005 +0.00
ex_refund_02 refund -0.22s +$0.0000 +0.50
ex_refund_04 refund -0.46s -$0.0001 +0.50
ex_customer_lookup_04 customer_lookup -0.82s -$0.0002 +0.50

Aggregate (slice = "all")

Evaluator Test n Δ avg score |d| 95% CI praw pcorrected Severity
routing wilcoxon 40 +0.013 0.06 [-0.22, 0.37] 0.705 1.000 ✓ none

Slices with significant change

No slices reached significance after BH correction.

Top regressions

ex_routine_08 · routing · Δ -1.000

Source trace

  1. send_email({"body": "Welcome! We are excited to have you as a new customer.", "subject": "Welcome to our service!", "to": "onboarding@example.com"})

Target trace

  1. lookup_customer({"query": "onboarding@example.com"})

Methodology