Debate Result · real

Disputatio Fake E2E Fixture 20260601T201940Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · 03e948ef-bc38-48cc-b6d3-f2f12e2f4c40

completed

Consensus WinnerUnclear3/3 valid judges

Model Agreement100.0%Divergence: Low

Closest To Consensusmistralai/mistral-small-3.2-24b-instructClosest to consensus among schema-valid judgements. Latency: 5109 ms.

Cost$0.0002351730 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

Consensus Score7.50median of valid judges

Disagreement0.78lower means more aligned

Evaluated Source

Transparent demo/source context for this run.

Source Typefixturefixture

PurposeManual debate source.

Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.

Schema reliability 3/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 3/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement Low

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread Low

Score spread 0.00 across valid model judgements.

Rhetorical manipulation spread Medium

Score spread 2.00 across valid model judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.

LLM Judge	Verdict	Score	Confidence	Why
google/gemini-2.5-flash-lite completed	Unclear	7.00	0.9000	Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.
meta-llama/llama-3.1-8b-instruct completed	Unclear	8.00	0.9000	Speaker A argues for speed, citing cost and trust implications, while Speaker B counters with long-term risk and cost of failure.
mistralai/mistral-small-3.2-24b-instruct completed	Unclear	7.50	0.8000	Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.

Model Cards

Operational details for trust and debugging.

google/gemini-2.5-flash-lite

completed

7.00total score

Winner: Unclear
JSON: True
Schema: True
Latency: 1502 ms
Tokens: 609
Cost: $0.000145
Provider: Google
Finish: stop

Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.

Open raw response

Response preview

{
  "summary": "Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.",
  "total_score": 7,
  "confidence": 0.9,
  "dimensions": {
    "logic": 8,
    "evidence": 6,
    "counterarguments": 7,
    "clarity": 8,
    "relevance": 9,
    "fairness": 8,

meta-llama/llama-3.1-8b-instruct

completed

8.00total score

Winner: Unclear
JSON: True
Schema: True
Latency: 9424 ms
Tokens: 553
Cost: $0.000018
Provider: DeepInfra
Finish: stop

Speaker A argues for speed, citing cost and trust implications, while Speaker B counters with long-term risk and cost of failure.

Open raw response

Response preview

{
  "summary": "Speaker A argues for speed, citing cost and trust implications, while Speaker B counters with long-term risk and cost of failure.",
  "total_score": 8,
  "confidence": 0.9,
  "dimensions": {
    "logic": 8,
    "evidence": 6,
    "counterarguments": 7,
    "clarity": 9,
    "relevance": 8,
    "fairness": 8,
    "factual_grounding": 7,
    "rhetorical_manipulation": 4,
    "context_fidelity": 8
  },
  "reasons": {
    "logic": "Both speakers provide coherent arguments, but Speake

mistralai/mistral-small-3.2-24b-instruct

completed

7.50total score

Winner: Unclear
JSON: True
Schema: True
Latency: 5109 ms
Tokens: 568
Cost: $0.000072
Provider: DeepInfra
Finish: stop

Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.

Open raw response

Response preview

{
  "summary": "Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.",
  "total_score": 7.5,
  "confidence": 0.8,
  "dimensions": {
    "logic": 8,
    "evidence": 6,
    "counterarguments": 7,
    "clarity": 8,
    "relevance": 9,
    "fairness": 8,
    "factual_grounding": 7,
    "rhetorical_manipulation": 2,
    "context_fidelity": 8
  },

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

Model	Dimension	Score	Confidence	Reason
google/gemini-2.5-flash-lite	clarity	8.00	0.9000	Score for clarity.
google/gemini-2.5-flash-lite	context_fidelity	9.00	0.9000	Score for context_fidelity.
google/gemini-2.5-flash-lite	counterarguments	7.00	0.9000	Score for counterarguments.
google/gemini-2.5-flash-lite	evidence	6.00	0.9000	Score for evidence.
google/gemini-2.5-flash-lite	factual_grounding	5.00	0.9000	Score for factual_grounding.
google/gemini-2.5-flash-lite	fairness	8.00	0.9000	Score for fairness.
google/gemini-2.5-flash-lite	logic	8.00	0.9000	The arguments present a clear dialectic on speed vs. thoroughness, with Speaker A's refinement adding nuance.
google/gemini-2.5-flash-lite	relevance	9.00	0.9000	All segments directly address the core tension between speed and review in decision-making.
google/gemini-2.5-flash-lite	rhetorical_manipulation	2.00	0.9000	Minimal rhetorical devices used; arguments are direct and focused on the issue at hand.
meta-llama/llama-3.1-8b-instruct	clarity	9.00	0.9000	Score for clarity.
meta-llama/llama-3.1-8b-instruct	context_fidelity	8.00	0.9000	Score for context_fidelity.
meta-llama/llama-3.1-8b-instruct	counterarguments	7.00	0.9000	Speaker B effectively counters Speaker A's point, but could have been more explicit.
meta-llama/llama-3.1-8b-instruct	evidence	6.00	0.9000	Both speakers rely on general principles, but lack specific data.
meta-llama/llama-3.1-8b-instruct	factual_grounding	7.00	0.9000	Score for factual_grounding.
meta-llama/llama-3.1-8b-instruct	fairness	8.00	0.9000	Score for fairness.
meta-llama/llama-3.1-8b-instruct	logic	8.00	0.9000	Both speakers provide coherent arguments, but Speaker B's counterargument is more nuanced.
meta-llama/llama-3.1-8b-instruct	relevance	8.00	0.9000	Score for relevance.
meta-llama/llama-3.1-8b-instruct	rhetorical_manipulation	4.00	0.9000	Neither speaker resorts to fallacious reasoning, but Speaker A's tone is slightly more persuasive.
mistralai/mistral-small-3.2-24b-instruct	clarity	8.00	0.8000	Score for clarity.
mistralai/mistral-small-3.2-24b-instruct	context_fidelity	8.00	0.8000	Score for context_fidelity.
mistralai/mistral-small-3.2-24b-instruct	counterarguments	7.00	0.8000	Speaker B effectively counters Speaker A's initial argument, and Speaker A responds with a reasonable clarification.
mistralai/mistral-small-3.2-24b-instruct	evidence	6.00	0.8000	Both speakers present reasonable points but lack specific evidence to support their claims.
mistralai/mistral-small-3.2-24b-instruct	factual_grounding	7.00	0.8000	Score for factual_grounding.
mistralai/mistral-small-3.2-24b-instruct	fairness	8.00	0.8000	Score for fairness.
mistralai/mistral-small-3.2-24b-instruct	logic	8.00	0.8000	Score for logic.
mistralai/mistral-small-3.2-24b-instruct	relevance	9.00	0.8000	Score for relevance.
mistralai/mistral-small-3.2-24b-instruct	rhetorical_manipulation	2.00	0.8000	No instances of rhetorical manipulation detected.

Artifacts

Kind	Path	Size	Checksum
model_raw_response	`jury/raw_openrouter_03e948ef-bc38-48cc-b6d3-f2f12e2f4c40_mistralai_mistral-small-3.2-24b-instruct.json` open	2275	`3467ca12f9321ae7`
model_raw_response	`jury/raw_openrouter_03e948ef-bc38-48cc-b6d3-f2f12e2f4c40_google_gemini-2.5-flash-lite.json` open	2372	`63cefce9fbc4e013`
model_raw_response	`jury/raw_openrouter_03e948ef-bc38-48cc-b6d3-f2f12e2f4c40_meta-llama_llama-3.1-8b-instruct.json` open	2283	`32244581419876a6`

Manifest

Software: 0.5.6
Prompt Hash: a4819bddab63bfc1a6
Rubric Hash: c9144dd5d4c5fcd823
Input Hash: b0f68a4170b289ea2d