Only schema-valid judgements are used for consensus and score aggregation.
Debate Result · real
Disputatio Fake E2E Fixture 20260601T201940Z
Deterministic fixture debate for M4 fake jury validation.
Back to debate · 5a39edb6-e53b-4a64-be0b-55b00dad12ed
Why this result?
AI-assisted comparative judgement, not objective truth.Run a real jury to generate a winner rationale.
Evaluated Source
Transparent demo/source context for this run.LLM judgements are comparative signals, not objective truth.
Bias & Reliability Signals
Exploratory MVP signals, not causal bias proof.Counts valid JSON responses before stricter judgement-schema checks.
Derived from score divergence. This is not a causal provider-bias claim.
Score spread 0.00 across valid model judgements.
Score spread 1.00 across valid model judgements.
MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.
Jury Verdicts
Clear product view first; raw diagnostics remain below.| LLM Judge | Verdict | Score | Confidence | Why |
|---|---|---|---|---|
| google/gemini-2.5-flash-lite completed |
Unclear | 7.00 | 0.9000 | Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process. |
| mistralai/mistral-small-3.2-24b-instruct completed |
Unclear | 7.50 | 0.8000 | Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk. |
| qwen/qwen3-next-80b-a3b-instruct:free rate_limited |
n/a | n/a | n/a | Model response could not be parsed as valid judgement JSON. |
Model Cards
Operational details for trust and debugging.google/gemini-2.5-flash-lite
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 1304 ms
- Tokens
- 609
- Cost
- $0.000145
- Provider
- Finish
- stop
Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.
Response preview
{
"summary": "Speaker A argues for speed in decision-making when evidence is sufficient, citing cost and trust. Speaker B counters that rushing with weak evidence leads to greater long-term risk and cost. Speaker A then refines their point, advocating for review proportional to risk rather than an unlimited process.",
"total_score": 7,
"confidence": 0.9,
"dimensions": {
"logic": 8,
"evidence": 6,
"counterarguments": 7,
"clarity": 8,
"relevance": 9,
"fairness": 8,
mistralai/mistral-small-3.2-24b-instruct
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 27084 ms
- Tokens
- 700
- Cost
- $0.000123
- Provider
- Venice
- Finish
- stop
Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.
Response preview
{
"summary": "Speaker A argues for faster decision-making when evidence is sufficient, while Speaker B cautions against weak review. Speaker A then clarifies their stance, advocating for proportional review based on risk.",
"total_score": 7.5,
"confidence": 0.8,
"dimensions": {
"logic": 8,
"evidence": 6,
"counterarguments": 7,
"clarity": 8,
"relevance": 9,
"fairness": 8,
"factual_grounding": 7,
"rhetorical_manipulation": 1,
"context_fidelity": 8
},
qwen/qwen3-next-80b-a3b-instruct:free
rate_limited- Winner
- n/a
- JSON
- False
- Schema
- False
- Latency
- n/a ms
- Tokens
- 0
- Cost
- $0
- Provider
- Venice
- Finish
- 429
Model response could not be parsed as valid judgement JSON.
Error: rate_limited · Retry-After: 30s
Error diagnostic
OpenRouter HTTP 429: Provider returned error
Score Dimensions
Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.
| Model | Dimension | Score | Confidence | Reason |
|---|---|---|---|---|
| google/gemini-2.5-flash-lite | clarity | 8.00 | 0.9000 | Score for clarity. |
| google/gemini-2.5-flash-lite | context_fidelity | 9.00 | 0.9000 | Score for context_fidelity. |
| google/gemini-2.5-flash-lite | counterarguments | 7.00 | 0.9000 | Score for counterarguments. |
| google/gemini-2.5-flash-lite | evidence | 6.00 | 0.9000 | Score for evidence. |
| google/gemini-2.5-flash-lite | factual_grounding | 5.00 | 0.9000 | Score for factual_grounding. |
| google/gemini-2.5-flash-lite | fairness | 8.00 | 0.9000 | Score for fairness. |
| google/gemini-2.5-flash-lite | logic | 8.00 | 0.9000 | The arguments present a clear dialectic on speed vs. thoroughness, with Speaker A's refinement adding nuance. |
| google/gemini-2.5-flash-lite | relevance | 9.00 | 0.9000 | All segments directly address the core tension between speed and review in decision-making. |
| google/gemini-2.5-flash-lite | rhetorical_manipulation | 2.00 | 0.9000 | Minimal rhetorical devices used; arguments are direct and focused on the issue at hand. |
| mistralai/mistral-small-3.2-24b-instruct | clarity | 8.00 | 0.8000 | Both speakers express their points clearly and concisely. |
| mistralai/mistral-small-3.2-24b-instruct | context_fidelity | 8.00 | 0.8000 | The arguments stay within the context of the debate topic and do not introduce irrelevant points. |
| mistralai/mistral-small-3.2-24b-instruct | counterarguments | 7.00 | 0.8000 | Speaker B effectively counters Speaker A's initial argument, and Speaker A responds with a relevant clarification. |
| mistralai/mistral-small-3.2-24b-instruct | evidence | 6.00 | 0.8000 | While both speakers present reasonable points, they do not provide specific evidence to support their claims. |
| mistralai/mistral-small-3.2-24b-instruct | factual_grounding | 7.00 | 0.8000 | The arguments are generally well-grounded but lack specific factual support. |
| mistralai/mistral-small-3.2-24b-instruct | fairness | 8.00 | 0.8000 | Both speakers acknowledge the other's point and engage with it constructively. |
| mistralai/mistral-small-3.2-24b-instruct | logic | 8.00 | 0.8000 | Both speakers present logical arguments with clear premises and conclusions. |
| mistralai/mistral-small-3.2-24b-instruct | relevance | 9.00 | 0.8000 | All points made are directly relevant to the topic of speed versus thoroughness in decision-making. |
| mistralai/mistral-small-3.2-24b-instruct | rhetorical_manipulation | 1.00 | 0.8000 | No signs of rhetorical manipulation are present in the debate. |
Artifacts
| Kind | Path | Size | Checksum |
|---|---|---|---|
| model_raw_response | jury/raw_openrouter_5a39edb6-e53b-4a64-be0b-55b00dad12ed_mistralai_mistral-small-3.2-24b-instruct.jsonopen | 2958 | f98c279069a43a30 |
| model_raw_response | jury/raw_openrouter_5a39edb6-e53b-4a64-be0b-55b00dad12ed_google_gemini-2.5-flash-lite.jsonopen | 2372 | 5d13bfe6b1828dd9 |
| model_raw_response | jury/raw_openrouter_5a39edb6-e53b-4a64-be0b-55b00dad12ed_qwen_qwen3-next-80b-a3b-instruct_free.jsonopen | 833 | a006cab8297c04d3 |
Manifest
- Software
- 0.5.5
- Prompt Hash
a4819bddab63bfc1a6- Rubric Hash
c9144dd5d4c5fcd823- Input Hash
b0f68a4170b289ea2d