evalshift · r_20260510

Source model	`gemini/gemini-2.5-flash`
Target model	`gemini/gemini-3.1-flash-lite-preview`
Suite	`golden.jsonl`
Examples	40
Calls	80 (0 cached, 0 failed)
Total cost	$0.0279

	Prompt	Worst severity	Slice	n	Δ avg score	\|d\|	p_corrected
✓	`customer_routing`	none	refund	5	+0.200	0.73	1.000

Prompt: `customer_routing`

Run economics

	Calls	Cached	Failed	Cost (USD)	Avg latency	p95 latency	Input tokens	Output tokens
Source	40	0	0	$0.0163	1135 ms	1514 ms	33,354	4,885
Target	40	0	0	$0.0115	834 ms	1407 ms	37,154	1,496

Latency stats cover live calls only (40 source / 40 target); cache hits replay from disk.

Per-example breakdown

Example	Tags	Δ time	Δ cost	Worst Δ score	Tool match
`ex_routine_08`	routine	-0.30s	-$0.0003	-1.00	✗
`ex_text_03`	text_only	-0.10s	+$0.0001	+0.00	✓
`ex_text_05`	text_only	+0.00s	+$0.0001	+0.00	✓
`ex_text_04`	text_only	+0.13s	+$0.0001	+0.00	✓
`ex_text_02`	text_only	-0.18s	+$0.0001	+0.00	✓
`ex_text_06`	text_only	-0.22s	+$0.0001	+0.00	✓
`ex_customer_lookup_01`	customer_lookup	+1.01s	+$0.0000	+0.00	✓
`ex_text_01`	text_only	-0.17s	+$0.0000	+0.00	✓
`ex_routine_03`	routine	-0.26s	+$0.0000	+0.00	✓
`ex_customer_lookup_05`	customer_lookup	-0.25s	-$0.0000	+0.00	✓
`ex_routine_07`	routine	-0.45s	-$0.0000	+0.00	✓
`ex_routine_02`	routine	-0.45s	-$0.0000	+0.00	✓
`ex_customer_lookup_02`	customer_lookup	-0.35s	-$0.0000	+0.00	✓
`ex_routine_04`	routine	-0.15s	-$0.0000	+0.00	✓
`ex_security_07`	security	-0.30s	-$0.0000	+0.00	✗
`ex_security_09`	security	+0.00s	-$0.0001	+0.00	✗
`ex_security_11`	security	-0.19s	-$0.0001	+0.00	✗
`ex_routine_11`	routine	-0.22s	-$0.0001	+0.00	✓
`ex_security_10`	security	-0.57s	-$0.0001	+0.00	✓
`ex_security_06`	security	-0.77s	-$0.0001	+0.00	✓
`ex_routine_06`	routine	+0.00s	-$0.0001	+0.00	✗
`ex_routine_10`	routine	-0.19s	-$0.0001	+0.00	✓
`ex_security_12`	security	-0.57s	-$0.0002	+0.00	✗
`ex_security_03`	security	-0.76s	-$0.0002	+0.00	✗
`ex_customer_lookup_03`	customer_lookup	-0.31s	-$0.0002	+0.00	✗
`ex_routine_12`	routine	-0.65s	-$0.0002	+0.00	✓
`ex_routine_01`	routine	-0.25s	-$0.0002	+0.00	✓
`ex_refund_01`	refund	-0.70s	-$0.0003	+0.00	✓
`ex_security_02`	security	-1.12s	-$0.0003	+0.00	✓
`ex_routine_09`	routine	-0.00s	-$0.0003	+0.00	✗
`ex_security_05`	security	-0.35s	-$0.0003	+0.00	✓
`ex_security_01`	security	-0.06s	-$0.0003	+0.00	✓
`ex_security_08`	security	-0.56s	-$0.0003	+0.00	✓
`ex_refund_05`	refund	-0.63s	-$0.0003	+0.00	✓
`ex_security_04`	security	-0.52s	-$0.0003	+0.00	✓
`ex_routine_05`	routine	+0.43s	-$0.0003	+0.00	✓
`ex_refund_03`	refund	-0.57s	-$0.0005	+0.00	✗
`ex_refund_02`	refund	-0.22s	+$0.0000	+0.50	✓
`ex_refund_04`	refund	-0.46s	-$0.0001	+0.50	✓
`ex_customer_lookup_04`	customer_lookup	-0.82s	-$0.0002	+0.50	✓

Aggregate (slice = "all")

Evaluator	Test	n	Δ avg score	\|d\|	95% CI	p_raw	p_corrected	Severity
`routing`	wilcoxon	40	+0.013	0.06	[-0.22, 0.37]	0.705	1.000	✓ none

Slices with significant change

No slices reached significance after BH correction.

Top regressions

ex_routine_08 · routing · Δ -1.000

Source trace

send_email({"body": "Welcome! We are excited to have you as a new customer.", "subject": "Welcome to our service!", "to": "onboarding@example.com"})

Target trace

lookup_customer({"query": "onboarding@example.com"})

EvalShift report

Executive summary

Prompt: `customer_routing`

Run economics

Per-example breakdown

Aggregate (slice = "all")

Slices with significant change

Top regressions

Source trace

Target trace

Methodology

Executive summary

Prompt: customer_routing

Run economics

Per-example breakdown

Aggregate (slice = "all")

Slices with significant change

Top regressions

Source trace

Target trace

Methodology

Prompt: `customer_routing`