Debate Result · fake

Disputatio Fake E2E Fixture 20260601T201438Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · 2f30e730-c722-43fd-bb58-4b123c6ee704

completed

Consensus WinnerUnclear3/3 valid judges

Model Agreement100.0%Divergence: Medium

Closest To Consensusmodel_a_budgetClosest to consensus among schema-valid judgements. Latency: 210 ms.

Cost$0.0000003805 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

Consensus Score7.03median of valid judges

Disagreement1.50lower means more aligned

Evaluated Source

Transparent demo/source context for this run.

Source Typefixturefixture

PurposeManual debate source.

Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.

Schema reliability 3/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 3/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement Medium

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread Medium

Score spread 1.50 across valid model judgements.

Rhetorical manipulation spread Medium

Score spread 1.50 across valid model judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.

LLM Judge	Verdict	Score	Confidence	Why
model_a_budget completed	Unclear	7.03	0.9100	Fake Model A produced a deterministic fake judgement for deployment validation.
model_b_reasoning completed	Unclear	6.33	0.9100	Fake Model B produced a deterministic fake judgement for deployment validation.
model_c_contrast completed	Unclear	7.83	0.9100	Fake Model C produced a deterministic fake judgement for deployment validation.

Model Cards

Operational details for trust and debugging.

model_a_budget

completed

7.03total score

Winner: Unclear
JSON: True
Schema: True
Latency: 210 ms
Tokens: 1240
Cost: $0
Provider: fake
Finish: n/a

Fake Model A produced a deterministic fake judgement for deployment validation.

model_b_reasoning

completed

6.33total score

Winner: Unclear
JSON: True
Schema: True
Latency: 310 ms
Tokens: 1310
Cost: $0
Provider: fake
Finish: n/a

Fake Model B produced a deterministic fake judgement for deployment validation.

model_c_contrast

completed

7.83total score

Winner: Unclear
JSON: True
Schema: True
Latency: 260 ms
Tokens: 1255
Cost: $0
Provider: fake
Finish: n/a

Fake Model C produced a deterministic fake judgement for deployment validation.

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

Model	Dimension	Score	Confidence	Reason
model_a_budget	clarity	7.40	0.9000	Deterministic fake score for clarity.
model_a_budget	context_fidelity	7.20	0.9000	Deterministic fake score for context_fidelity.
model_a_budget	counterarguments	7.10	0.9000	Deterministic fake score for counterarguments.
model_a_budget	evidence	6.70	0.9000	Deterministic fake score for evidence.
model_a_budget	factual_grounding	6.80	0.9000	Deterministic fake score for factual_grounding.
model_a_budget	fairness	7.00	0.9000	Deterministic fake score for fairness.
model_a_budget	logic	7.20	0.9000	Deterministic fake score for logic.
model_a_budget	relevance	7.30	0.9000	Deterministic fake score for relevance.
model_a_budget	rhetorical_manipulation	6.60	0.9000	Deterministic fake score for rhetorical_manipulation.
model_b_reasoning	clarity	6.70	0.9000	Deterministic fake score for clarity.
model_b_reasoning	context_fidelity	6.50	0.9000	Deterministic fake score for context_fidelity.
model_b_reasoning	counterarguments	6.40	0.9000	Deterministic fake score for counterarguments.
model_b_reasoning	evidence	6.00	0.9000	Deterministic fake score for evidence.
model_b_reasoning	factual_grounding	6.10	0.9000	Deterministic fake score for factual_grounding.
model_b_reasoning	fairness	6.30	0.9000	Deterministic fake score for fairness.
model_b_reasoning	logic	6.50	0.9000	Deterministic fake score for logic.
model_b_reasoning	relevance	6.60	0.9000	Deterministic fake score for relevance.
model_b_reasoning	rhetorical_manipulation	5.90	0.9000	Deterministic fake score for rhetorical_manipulation.
model_c_contrast	clarity	8.20	0.9000	Deterministic fake score for clarity.
model_c_contrast	context_fidelity	8.00	0.9000	Deterministic fake score for context_fidelity.
model_c_contrast	counterarguments	7.90	0.9000	Deterministic fake score for counterarguments.
model_c_contrast	evidence	7.50	0.9000	Deterministic fake score for evidence.
model_c_contrast	factual_grounding	7.60	0.9000	Deterministic fake score for factual_grounding.
model_c_contrast	fairness	7.80	0.9000	Deterministic fake score for fairness.
model_c_contrast	logic	8.00	0.9000	Deterministic fake score for logic.
model_c_contrast	relevance	8.10	0.9000	Deterministic fake score for relevance.
model_c_contrast	rhetorical_manipulation	7.40	0.9000	Deterministic fake score for rhetorical_manipulation.

Artifacts

Kind	Path	Size	Checksum
report_md	`reports/fake_jury_2f30e730-c722-43fd-bb58-4b123c6ee704.md` open	669	`eddd5bff7a5963ef`
report_html	`reports/fake_jury_2f30e730-c722-43fd-bb58-4b123c6ee704.html`	439	`8676be3ed1c0171e`

Manifest

Software: 0.4.0
Prompt Hash: 5358e5ba89ce6b5e2b
Rubric Hash: c9144dd5d4c5fcd823
Input Hash: 1e566c63812e138299