Debate Result · real

Disputatio Fake E2E Fixture 20260601T201940Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · fa5514b3-e0ef-480f-bde0-bf7347e25b39

failed

Consensus WinnerUnclear1/3 valid judges

Model Agreement100.0%Divergence: n/a

Closest To Consensusz-ai/glm-4.5-air:freeClosest to consensus among schema-valid judgements. Latency: 10257 ms.

Cost$0.0000001115 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

No aggregation result yet. At least two schema-valid judgements are needed.

Evaluated Source

Transparent demo/source context for this run.

Source Typefixturefixture

PurposeManual debate source.

Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.

Schema reliability 1/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 1/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement n/a

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread n/a

Need at least two schema-valid judgements.

Rhetorical manipulation spread n/a

Need at least two schema-valid judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.

LLM Judge	Verdict	Score	Confidence	Why
meta-llama/llama-3.3-70b-instruct:free failed	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.
openai/gpt-oss-120b:free failed	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.
z-ai/glm-4.5-air:free completed	Unclear	8.00	0.8500	A concise exchange on the pace of infrastructure projects. Speaker A argues for speed to save costs and maintain trust, while Speaker B counters that inadequate review risks costly failures. Speaker A concludes by advocating for proportional review rather than indefinite delays.

Model Cards

Operational details for trust and debugging.

meta-llama/llama-3.3-70b-instruct:free

failed

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: n/a ms
Tokens: 0
Cost: $0
Provider: n/a
Finish: n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic

OpenRouter HTTP 429: {"error": {"message": "Provider returned error", "code": 429, "metadata": {"raw": "meta-llama/llama-3.3-70b-instruct:free is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations", "provider_name": "Venice", "is_byok": false, "retry_after_seconds": 17, "retry_after_seconds_raw": 16.007, "headers": {"Retry-After": "17"}}}, "user_id": "user_3Bl6LBShLIGou4GzxDLzm73U616"}

openai/gpt-oss-120b:free

failed

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: n/a ms
Tokens: 0
Cost: $0
Provider: n/a
Finish: n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic

OpenRouter HTTP 404: {"error": {"message": "No endpoints available matching your guardrail restrictions and data policy. Configure: https://openrouter.ai/settings/privacy", "code": 404}}

z-ai/glm-4.5-air:free

completed

8.00total score

Winner: Unclear
JSON: True
Schema: True
Latency: 10257 ms
Tokens: 1115
Cost: $0
Provider: Z.AI
Finish: stop

A concise exchange on the pace of infrastructure projects. Speaker A argues for speed to save costs and maintain trust, while Speaker B counters that inadequate review risks costly failures. Speaker A concludes by advocating for proportional review rather than indefinite delays.

Open raw response

Response preview

{
  "summary": "A concise exchange on the pace of infrastructure projects. Speaker A argues for speed to save costs and maintain trust, while Speaker B counters that inadequate review risks costly failures. Speaker A concludes by advocating for proportional review rather than indefinite delays.",
  "total_score": 8,
  "confidence": 0.85,
  "dimensions": {
    "logic": 8,
    "evidence": 4,
    "counterarguments": 8,
    "clarity": 9,
    "relevance": 9,
    "fairness": 9,
    "factual_grounding"

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

Model	Dimension	Score	Confidence	Reason
z-ai/glm-4.5-air:free	clarity	9.00	0.8500	Score for clarity.
z-ai/glm-4.5-air:free	context_fidelity	8.00	0.8500	Score for context_fidelity.
z-ai/glm-4.5-air:free	counterarguments	8.00	0.8500	Score for counterarguments.
z-ai/glm-4.5-air:free	evidence	4.00	0.8500	The debate lacks specific empirical data, statistics, or concrete examples, relying entirely on general assertions about infrastructure and decision-making.
z-ai/glm-4.5-air:free	factual_grounding	5.00	0.8500	Score for factual_grounding.
z-ai/glm-4.5-air:free	fairness	9.00	0.8500	Score for fairness.
z-ai/glm-4.5-air:free	logic	8.00	0.8500	Both speakers present coherent and logical arguments. Speaker A's final segment effectively synthesizes the discussion by introducing the concept of proportional risk.
z-ai/glm-4.5-air:free	relevance	9.00	0.8500	Score for relevance.
z-ai/glm-4.5-air:free	rhetorical_manipulation	9.00	0.8500	Score for rhetorical_manipulation.

Artifacts

Kind	Path	Size	Checksum
model_raw_response	`jury/raw_openrouter_fa5514b3-e0ef-480f-bde0-bf7347e25b39_z-ai_glm-4.5-air_free.json` open	6448	`70700ac0d820912e`
model_raw_response	`jury/raw_openrouter_fa5514b3-e0ef-480f-bde0-bf7347e25b39_openai_gpt-oss-120b_free.json` open	290	`268225afc7d0ed65`
model_raw_response	`jury/raw_openrouter_fa5514b3-e0ef-480f-bde0-bf7347e25b39_meta-llama_llama-3.3-70b-instruct_free.json` open	634	`49e230862ac63c92`

Manifest

Software: 0.5.0
Prompt Hash: 02c4e31fc6dc138683
Rubric Hash: c9144dd5d4c5fcd823
Input Hash: 6644a1c7859bc24992