Only schema-valid judgements are used for consensus and score aggregation.
Debate Result · real
Disputatio Fake E2E Fixture 20260601T201940Z
Deterministic fixture debate for M4 fake jury validation.
Back to debate · fa5514b3-e0ef-480f-bde0-bf7347e25b39
Why this result?
AI-assisted comparative judgement, not objective truth.Run a real jury to generate a winner rationale.
No aggregation result yet. At least two schema-valid judgements are needed.
Evaluated Source
Transparent demo/source context for this run.LLM judgements are comparative signals, not objective truth.
Bias & Reliability Signals
Exploratory MVP signals, not causal bias proof.Counts valid JSON responses before stricter judgement-schema checks.
Derived from score divergence. This is not a causal provider-bias claim.
Need at least two schema-valid judgements.
Need at least two schema-valid judgements.
MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.
Jury Verdicts
Clear product view first; raw diagnostics remain below.| LLM Judge | Verdict | Score | Confidence | Why |
|---|---|---|---|---|
| meta-llama/llama-3.3-70b-instruct:free failed |
n/a | n/a | n/a | Model response could not be parsed as valid judgement JSON. |
| openai/gpt-oss-120b:free failed |
n/a | n/a | n/a | Model response could not be parsed as valid judgement JSON. |
| z-ai/glm-4.5-air:free completed |
Unclear | 8.00 | 0.8500 | A concise exchange on the pace of infrastructure projects. Speaker A argues for speed to save costs and maintain trust, while Speaker B counters that inadequate review risks costly failures. Speaker A concludes by advocating for proportional review rather than indefinite delays. |
Model Cards
Operational details for trust and debugging.meta-llama/llama-3.3-70b-instruct:free
failed- Winner
- n/a
- JSON
- False
- Schema
- False
- Latency
- n/a ms
- Tokens
- 0
- Cost
- $0
- Provider
- n/a
- Finish
- n/a
Model response could not be parsed as valid judgement JSON.
Error: RuntimeError
Error diagnostic
OpenRouter HTTP 429: {"error": {"message": "Provider returned error", "code": 429, "metadata": {"raw": "meta-llama/llama-3.3-70b-instruct:free is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations", "provider_name": "Venice", "is_byok": false, "retry_after_seconds": 17, "retry_after_seconds_raw": 16.007, "headers": {"Retry-After": "17"}}}, "user_id": "user_3Bl6LBShLIGou4GzxDLzm73U616"}openai/gpt-oss-120b:free
failed- Winner
- n/a
- JSON
- False
- Schema
- False
- Latency
- n/a ms
- Tokens
- 0
- Cost
- $0
- Provider
- n/a
- Finish
- n/a
Model response could not be parsed as valid judgement JSON.
Error: RuntimeError
Error diagnostic
OpenRouter HTTP 404: {"error": {"message": "No endpoints available matching your guardrail restrictions and data policy. Configure: https://openrouter.ai/settings/privacy", "code": 404}}z-ai/glm-4.5-air:free
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 10257 ms
- Tokens
- 1115
- Cost
- $0
- Provider
- Z.AI
- Finish
- stop
A concise exchange on the pace of infrastructure projects. Speaker A argues for speed to save costs and maintain trust, while Speaker B counters that inadequate review risks costly failures. Speaker A concludes by advocating for proportional review rather than indefinite delays.
Response preview
{
"summary": "A concise exchange on the pace of infrastructure projects. Speaker A argues for speed to save costs and maintain trust, while Speaker B counters that inadequate review risks costly failures. Speaker A concludes by advocating for proportional review rather than indefinite delays.",
"total_score": 8,
"confidence": 0.85,
"dimensions": {
"logic": 8,
"evidence": 4,
"counterarguments": 8,
"clarity": 9,
"relevance": 9,
"fairness": 9,
"factual_grounding"Score Dimensions
Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.
| Model | Dimension | Score | Confidence | Reason |
|---|---|---|---|---|
| z-ai/glm-4.5-air:free | clarity | 9.00 | 0.8500 | Score for clarity. |
| z-ai/glm-4.5-air:free | context_fidelity | 8.00 | 0.8500 | Score for context_fidelity. |
| z-ai/glm-4.5-air:free | counterarguments | 8.00 | 0.8500 | Score for counterarguments. |
| z-ai/glm-4.5-air:free | evidence | 4.00 | 0.8500 | The debate lacks specific empirical data, statistics, or concrete examples, relying entirely on general assertions about infrastructure and decision-making. |
| z-ai/glm-4.5-air:free | factual_grounding | 5.00 | 0.8500 | Score for factual_grounding. |
| z-ai/glm-4.5-air:free | fairness | 9.00 | 0.8500 | Score for fairness. |
| z-ai/glm-4.5-air:free | logic | 8.00 | 0.8500 | Both speakers present coherent and logical arguments. Speaker A's final segment effectively synthesizes the discussion by introducing the concept of proportional risk. |
| z-ai/glm-4.5-air:free | relevance | 9.00 | 0.8500 | Score for relevance. |
| z-ai/glm-4.5-air:free | rhetorical_manipulation | 9.00 | 0.8500 | Score for rhetorical_manipulation. |
Artifacts
| Kind | Path | Size | Checksum |
|---|---|---|---|
| model_raw_response | jury/raw_openrouter_fa5514b3-e0ef-480f-bde0-bf7347e25b39_z-ai_glm-4.5-air_free.jsonopen | 6448 | 70700ac0d820912e |
| model_raw_response | jury/raw_openrouter_fa5514b3-e0ef-480f-bde0-bf7347e25b39_openai_gpt-oss-120b_free.jsonopen | 290 | 268225afc7d0ed65 |
| model_raw_response | jury/raw_openrouter_fa5514b3-e0ef-480f-bde0-bf7347e25b39_meta-llama_llama-3.3-70b-instruct_free.jsonopen | 634 | 49e230862ac63c92 |
Manifest
- Software
- 0.5.0
- Prompt Hash
02c4e31fc6dc138683- Rubric Hash
c9144dd5d4c5fcd823- Input Hash
6644a1c7859bc24992