Debate Result · real

Disputatio Fake E2E Fixture 20260601T201940Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · 5a39edb6-e53b-4a64-be0b-55b00dad12ed

partially_completed
Consensus WinnerUnclear2/3 valid judges
Model Agreement100.0%Divergence: Low
Closest To Consensusgoogle/gemini-2.5-flash-liteClosest to consensus among schema-valid judgements. Latency: 1304 ms.
Cost$0.0002681309 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

Consensus Score7.25median of valid judges
Disagreement0.44lower means more aligned

Evaluated Source

Transparent demo/source context for this run.
Source Typefixturefixture
PurposeManual debate source.
Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.
Schema reliability 2/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 2/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement Low

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread Low

Score spread 0.00 across valid model judgements.

Rhetorical manipulation spread Low

Score spread 1.00 across valid model judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.
LLM JudgeVerdictScoreConfidenceWhy
google/gemini-2.5-flash-lite
completed
Unclear 7.00 0.9000 Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.
mistralai/mistral-small-3.2-24b-instruct
completed
Unclear 7.50 0.8000 Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.
qwen/qwen3-next-80b-a3b-instruct:free
rate_limited
n/a n/a n/a Model response could not be parsed as valid judgement JSON.

Model Cards

Operational details for trust and debugging.

google/gemini-2.5-flash-lite

completed
7.00total score
Winner
Unclear
JSON
True
Schema
True
Latency
1304 ms
Tokens
609
Cost
$0.000145
Provider
Google
Finish
stop

Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.

Open raw response

Response preview
{
  "summary": "Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.",
  "total_score": 7,
  "confidence": 0.9,
  "dimensions": {
    "logic": 8,
    "evidence": 6,
    "counterarguments": 7,
    "clarity": 8,
    "relevance": 9,
    "fairness": 8,

mistralai/mistral-small-3.2-24b-instruct

completed
7.50total score
Winner
Unclear
JSON
True
Schema
True
Latency
27084 ms
Tokens
700
Cost
$0.000123
Provider
Venice
Finish
stop

Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.

Open raw response

Response preview
{
  "summary": "Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.",
  "total_score": 7.5,
  "confidence": 0.8,
  "dimensions": {
    "logic": 8,
    "evidence": 6,
    "counterarguments": 7,
    "clarity": 8,
    "relevance": 9,
    "fairness": 8,
    "factual_grounding": 7,
    "rhetorical_manipulation": 1,
    "context_fidelity": 8
  },
 

qwen/qwen3-next-80b-a3b-instruct:free

rate_limited
n/atotal score
Winner
n/a
JSON
False
Schema
False
Latency
n/a ms
Tokens
0
Cost
$0
Provider
Venice
Finish
429

Model response could not be parsed as valid judgement JSON.

Error: rate_limited · Retry-After: 30s

Open raw response

Error diagnostic
OpenRouter HTTP 429: Provider returned error

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

ModelDimensionScoreConfidenceReason
google/gemini-2.5-flash-liteclarity8.000.9000Score for clarity.
google/gemini-2.5-flash-litecontext_fidelity9.000.9000Score for context_fidelity.
google/gemini-2.5-flash-litecounterarguments7.000.9000Score for counterarguments.
google/gemini-2.5-flash-liteevidence6.000.9000Score for evidence.
google/gemini-2.5-flash-litefactual_grounding5.000.9000Score for factual_grounding.
google/gemini-2.5-flash-litefairness8.000.9000Score for fairness.
google/gemini-2.5-flash-litelogic8.000.9000The arguments present a clear dialectic on speed vs. thoroughness, with Speaker A's refinement adding nuance.
google/gemini-2.5-flash-literelevance9.000.9000All segments directly address the core tension between speed and review in decision-making.
google/gemini-2.5-flash-literhetorical_manipulation2.000.9000Minimal rhetorical devices used; arguments are direct and focused on the issue at hand.
mistralai/mistral-small-3.2-24b-instructclarity8.000.8000Both speakers express their points clearly and concisely.
mistralai/mistral-small-3.2-24b-instructcontext_fidelity8.000.8000The arguments stay within the context of the debate topic and do not introduce irrelevant points.
mistralai/mistral-small-3.2-24b-instructcounterarguments7.000.8000Speaker B effectively counters Speaker A's initial argument, and Speaker A responds with a relevant clarification.
mistralai/mistral-small-3.2-24b-instructevidence6.000.8000While both speakers present reasonable points, they do not provide specific evidence to support their claims.
mistralai/mistral-small-3.2-24b-instructfactual_grounding7.000.8000The arguments are generally well-grounded but lack specific factual support.
mistralai/mistral-small-3.2-24b-instructfairness8.000.8000Both speakers acknowledge the other's point and engage with it constructively.
mistralai/mistral-small-3.2-24b-instructlogic8.000.8000Both speakers present logical arguments with clear premises and conclusions.
mistralai/mistral-small-3.2-24b-instructrelevance9.000.8000All points made are directly relevant to the topic of speed versus thoroughness in decision-making.
mistralai/mistral-small-3.2-24b-instructrhetorical_manipulation1.000.8000No signs of rhetorical manipulation are present in the debate.

Artifacts

KindPathSizeChecksum
model_raw_responsejury/raw_openrouter_5a39edb6-e53b-4a64-be0b-55b00dad12ed_mistralai_mistral-small-3.2-24b-instruct.json
open
2958f98c279069a43a30
model_raw_responsejury/raw_openrouter_5a39edb6-e53b-4a64-be0b-55b00dad12ed_google_gemini-2.5-flash-lite.json
open
23725d13bfe6b1828dd9
model_raw_responsejury/raw_openrouter_5a39edb6-e53b-4a64-be0b-55b00dad12ed_qwen_qwen3-next-80b-a3b-instruct_free.json
open
833a006cab8297c04d3

Manifest

Software
0.5.5
Prompt Hash
a4819bddab63bfc1a6
Rubric Hash
c9144dd5d4c5fcd823
Input Hash
b0f68a4170b289ea2d