Debate Result · fake

Disputatio Fake E2E Fixture 20260601T201438Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · 2f30e730-c722-43fd-bb58-4b123c6ee704

completed
Consensus WinnerUnclear3/3 valid judges
Model Agreement100.0%Divergence: Medium
Closest To Consensusmodel_a_budgetClosest to consensus among schema-valid judgements. Latency: 210 ms.
Cost$0.0000003805 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

Consensus Score7.03median of valid judges
Disagreement1.50lower means more aligned

Evaluated Source

Transparent demo/source context for this run.
Source Typefixturefixture
PurposeManual debate source.
Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.
Schema reliability 3/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 3/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement Medium

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread Medium

Score spread 1.50 across valid model judgements.

Rhetorical manipulation spread Medium

Score spread 1.50 across valid model judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.
LLM JudgeVerdictScoreConfidenceWhy
model_a_budget
completed
Unclear 7.03 0.9100 Fake Model A produced a deterministic fake judgement for deployment validation.
model_b_reasoning
completed
Unclear 6.33 0.9100 Fake Model B produced a deterministic fake judgement for deployment validation.
model_c_contrast
completed
Unclear 7.83 0.9100 Fake Model C produced a deterministic fake judgement for deployment validation.

Model Cards

Operational details for trust and debugging.

model_a_budget

completed
7.03total score
Winner
Unclear
JSON
True
Schema
True
Latency
210 ms
Tokens
1240
Cost
$0
Provider
fake
Finish
n/a

Fake Model A produced a deterministic fake judgement for deployment validation.

model_b_reasoning

completed
6.33total score
Winner
Unclear
JSON
True
Schema
True
Latency
310 ms
Tokens
1310
Cost
$0
Provider
fake
Finish
n/a

Fake Model B produced a deterministic fake judgement for deployment validation.

model_c_contrast

completed
7.83total score
Winner
Unclear
JSON
True
Schema
True
Latency
260 ms
Tokens
1255
Cost
$0
Provider
fake
Finish
n/a

Fake Model C produced a deterministic fake judgement for deployment validation.

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

ModelDimensionScoreConfidenceReason
model_a_budgetclarity7.400.9000Deterministic fake score for clarity.
model_a_budgetcontext_fidelity7.200.9000Deterministic fake score for context_fidelity.
model_a_budgetcounterarguments7.100.9000Deterministic fake score for counterarguments.
model_a_budgetevidence6.700.9000Deterministic fake score for evidence.
model_a_budgetfactual_grounding6.800.9000Deterministic fake score for factual_grounding.
model_a_budgetfairness7.000.9000Deterministic fake score for fairness.
model_a_budgetlogic7.200.9000Deterministic fake score for logic.
model_a_budgetrelevance7.300.9000Deterministic fake score for relevance.
model_a_budgetrhetorical_manipulation6.600.9000Deterministic fake score for rhetorical_manipulation.
model_b_reasoningclarity6.700.9000Deterministic fake score for clarity.
model_b_reasoningcontext_fidelity6.500.9000Deterministic fake score for context_fidelity.
model_b_reasoningcounterarguments6.400.9000Deterministic fake score for counterarguments.
model_b_reasoningevidence6.000.9000Deterministic fake score for evidence.
model_b_reasoningfactual_grounding6.100.9000Deterministic fake score for factual_grounding.
model_b_reasoningfairness6.300.9000Deterministic fake score for fairness.
model_b_reasoninglogic6.500.9000Deterministic fake score for logic.
model_b_reasoningrelevance6.600.9000Deterministic fake score for relevance.
model_b_reasoningrhetorical_manipulation5.900.9000Deterministic fake score for rhetorical_manipulation.
model_c_contrastclarity8.200.9000Deterministic fake score for clarity.
model_c_contrastcontext_fidelity8.000.9000Deterministic fake score for context_fidelity.
model_c_contrastcounterarguments7.900.9000Deterministic fake score for counterarguments.
model_c_contrastevidence7.500.9000Deterministic fake score for evidence.
model_c_contrastfactual_grounding7.600.9000Deterministic fake score for factual_grounding.
model_c_contrastfairness7.800.9000Deterministic fake score for fairness.
model_c_contrastlogic8.000.9000Deterministic fake score for logic.
model_c_contrastrelevance8.100.9000Deterministic fake score for relevance.
model_c_contrastrhetorical_manipulation7.400.9000Deterministic fake score for rhetorical_manipulation.

Artifacts

KindPathSizeChecksum
report_mdreports/fake_jury_2f30e730-c722-43fd-bb58-4b123c6ee704.md
open
669eddd5bff7a5963ef
report_htmlreports/fake_jury_2f30e730-c722-43fd-bb58-4b123c6ee704.html4398676be3ed1c0171e

Manifest

Software
0.4.0
Prompt Hash
5358e5ba89ce6b5e2b
Rubric Hash
c9144dd5d4c5fcd823
Input Hash
1e566c63812e138299