Only schema-valid judgements are used for consensus and score aggregation.
Debate Result · real
Disputatio Fake E2E Fixture 20260601T201940Z
Deterministic fixture debate for M4 fake jury validation.
Back to debate · 7d15ed5d-d83e-4788-8ebf-5ac8cffc3dc9
Why this result?
AI-assisted comparative judgement, not objective truth.Run a real jury to generate a winner rationale.
No aggregation result yet. At least two schema-valid judgements are needed.
Evaluated Source
Transparent demo/source context for this run.LLM judgements are comparative signals, not objective truth.
Bias & Reliability Signals
Exploratory MVP signals, not causal bias proof.Counts valid JSON responses before stricter judgement-schema checks.
Derived from score divergence. This is not a causal provider-bias claim.
Score spread 1.00 across valid model judgements.
Score spread 2.00 across valid model judgements.
MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.
Jury Verdicts
Clear product view first; raw diagnostics remain below.| LLM Judge | Verdict | Score | Confidence | Why |
|---|---|---|---|---|
| meta-llama/llama-3.1-8b-instruct completed |
Unclear | 8.00 | 0.9000 | Speaker A argues for speed in infrastructure projects, citing cost and public trust implications, while Speaker B counters that weak review can lead to long-term risks and higher costs. |
| mistralai/mistral-small-24b-instruct-2501 completed |
Unclear | 7.00 | 0.9000 | Speaker A argues for faster decision-making to control costs and maintain public trust, while Speaker B cautions against rushing to avoid long-term risks from poor evidence. Speaker A clarifies that review should be proportional to risk, not an endless process. |
| qwen/qwen3-next-80b-a3b-instruct:free rate_limited |
n/a | n/a | n/a | Model response could not be parsed as valid judgement JSON. |
Model Cards
Operational details for trust and debugging.meta-llama/llama-3.1-8b-instruct
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 10772 ms
- Tokens
- 699
- Cost
- $0.000025
- Provider
- Novita
- Finish
- stop
Speaker A argues for speed in infrastructure projects, citing cost and public trust implications, while Speaker B counters that weak review can lead to long-term risks and higher costs.
Response preview
{
"summary": "Speaker A argues for speed in infrastructure projects, citing cost and public trust implications, while Speaker B counters that weak review can lead to long-term risks and higher costs.",
"total_score": 8,
"confidence": 0.9,
"dimensions": {
"logic": 8,
"evidence": 6,
"counterarguments": 7,
"clarity": 9,
"relevance": 8,
"fairness": 9,
"factual_grounding": 7,
"rhetorical_manipulation": 4,
"context_fidelity": 8
},
"reasons": {
"logicmistralai/mistral-small-24b-instruct-2501
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 2915 ms
- Tokens
- 560
- Cost
- $0.000035
- Provider
- DeepInfra
- Finish
- stop
Speaker A argues for faster decision-making to control costs and maintain public trust, while Speaker B cautions against rushing to avoid long-term risks from poor evidence. Speaker A clarifies that review should be proportional to risk, not an endless process.
Response preview
{
"summary": "Speaker A argues for faster decision-making to control costs and maintain public trust, while Speaker B cautions against rushing to avoid long-term risks from poor evidence. Speaker A clarifies that review should be proportional to risk, not an endless process.",
"total_score": 7,
"confidence": 0.9,
"dimensions": {
"logic": 7,
"evidence": 5,
"counterarguments": 6,
"clarity": 8,
"relevance": 8,
"fairness": 7,
"factual_grounding": 6,
"rhetoricaqwen/qwen3-next-80b-a3b-instruct:free
rate_limited- Winner
- n/a
- JSON
- False
- Schema
- False
- Latency
- n/a ms
- Tokens
- 0
- Cost
- $0
- Provider
- Venice
- Finish
- 429
Model response could not be parsed as valid judgement JSON.
Error: rate_limited · Retry-After: 11s
Error diagnostic
OpenRouter HTTP 429: Provider returned error
Score Dimensions
Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.
| Model | Dimension | Score | Confidence | Reason |
|---|---|---|---|---|
| meta-llama/llama-3.1-8b-instruct | clarity | 9.00 | 0.9000 | Both speakers communicate their ideas clearly, but Speaker A's language is slightly more concise. |
| meta-llama/llama-3.1-8b-instruct | context_fidelity | 8.00 | 0.9000 | Both speakers stay within the debate's context, but Speaker A could have provided more context for their argument. |
| meta-llama/llama-3.1-8b-instruct | counterarguments | 7.00 | 0.9000 | Speaker B effectively counters Speaker A's argument, but could have done so more explicitly. |
| meta-llama/llama-3.1-8b-instruct | evidence | 6.00 | 0.9000 | Both speakers rely on general principles, but lack specific data to support their claims. |
| meta-llama/llama-3.1-8b-instruct | factual_grounding | 7.00 | 0.9000 | Both speakers rely on general knowledge, but lack specific facts to support their claims. |
| meta-llama/llama-3.1-8b-instruct | fairness | 9.00 | 0.9000 | Both speakers present balanced views, but Speaker A could have acknowledged potential risks of speed. |
| meta-llama/llama-3.1-8b-instruct | logic | 8.00 | 0.9000 | Both speakers present coherent arguments, but Speaker B's counterargument is more nuanced. |
| meta-llama/llama-3.1-8b-instruct | relevance | 8.00 | 0.9000 | Both speakers stay on topic, but Speaker A could have addressed Speaker B's concerns more directly. |
| meta-llama/llama-3.1-8b-instruct | rhetorical_manipulation | 4.00 | 0.9000 | Neither speaker engages in overt rhetorical manipulation. |
| mistralai/mistral-small-24b-instruct-2501 | clarity | 8.00 | 0.9000 | Score for clarity. |
| mistralai/mistral-small-24b-instruct-2501 | context_fidelity | 7.00 | 0.9000 | Score for context_fidelity. |
| mistralai/mistral-small-24b-instruct-2501 | counterarguments | 6.00 | 0.9000 | Speaker B effectively counters Speaker A's argument, but Speaker A also addresses Speaker B's concerns. |
| mistralai/mistral-small-24b-instruct-2501 | evidence | 5.00 | 0.9000 | Both speakers present reasonable points but lack specific data or examples to support their claims. |
| mistralai/mistral-small-24b-instruct-2501 | factual_grounding | 6.00 | 0.9000 | Score for factual_grounding. |
| mistralai/mistral-small-24b-instruct-2501 | fairness | 7.00 | 0.9000 | Score for fairness. |
| mistralai/mistral-small-24b-instruct-2501 | logic | 7.00 | 0.9000 | Score for logic. |
| mistralai/mistral-small-24b-instruct-2501 | relevance | 8.00 | 0.9000 | Score for relevance. |
| mistralai/mistral-small-24b-instruct-2501 | rhetorical_manipulation | 2.00 | 0.9000 | Score for rhetorical_manipulation. |
Artifacts
| Kind | Path | Size | Checksum |
|---|---|---|---|
| model_raw_response | jury/raw_openrouter_7d15ed5d-d83e-4788-8ebf-5ac8cffc3dc9_mistralai_mistral-small-24b-instruct-2501.jsonopen | 2137 | 0ee2bf2d7f98874e |
| model_raw_response | jury/raw_openrouter_7d15ed5d-d83e-4788-8ebf-5ac8cffc3dc9_meta-llama_llama-3.1-8b-instruct.jsonopen | 2878 | b1281e32dff144ed |
| model_raw_response | jury/raw_openrouter_7d15ed5d-d83e-4788-8ebf-5ac8cffc3dc9_qwen_qwen3-next-80b-a3b-instruct_free.jsonopen | 833 | 77d90b6c0d0ee4de |
Manifest
- Software
- 0.5.4
- Prompt Hash
a4819bddab63bfc1a6- Rubric Hash
c9144dd5d4c5fcd823- Input Hash
6644a1c7859bc24992