Debate Result · real

Disputatio Fake E2E Fixture 20260601T201940Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · 7cbaefe7-c406-4170-80ef-fb974f41bb1d

failed

Consensus WinnerUnclear0/3 valid judges

Model Agreementn/aDivergence: n/a

Closest To Consensusn/aNo schema-valid model judgement yet.

Cost$0.0000002184 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

No aggregation result yet. At least two schema-valid judgements are needed.

Evaluated Source

Transparent demo/source context for this run.

Source Typefixturefixture

PurposeManual debate source.

Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.

Schema reliability 0/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 0/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement n/a

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread n/a

Need at least two schema-valid judgements.

Rhetorical manipulation spread n/a

Need at least two schema-valid judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.

LLM Judge	Verdict	Score	Confidence	Why
meta-llama/llama-3.3-70b-instruct:free failed	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.
openai/gpt-oss-120b:free failed	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.
z-ai/glm-4.5-air:free invalid_schema	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.

Model Cards

Operational details for trust and debugging.

meta-llama/llama-3.3-70b-instruct:free

failed

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: n/a ms
Tokens: 0
Cost: $0
Provider: n/a
Finish: n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic

OpenRouter HTTP 429: {"error": {"message": "Provider returned error", "code": 429, "metadata": {"raw": "meta-llama/llama-3.3-70b-instruct:free is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations", "provider_name": "Venice", "is_byok": false, "retry_after_seconds": 18, "retry_after_seconds_raw": 17.199, "headers": {"Retry-After": "18"}}}, "user_id": "user_3Bl6LBShLIGou4GzxDLzm73U616"}

openai/gpt-oss-120b:free

failed

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: n/a ms
Tokens: 0
Cost: $0
Provider: n/a
Finish: n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic

OpenRouter HTTP 404: {"error": {"message": "No endpoints available matching your guardrail restrictions and data policy. Configure: https://openrouter.ai/settings/privacy", "code": 404}}

z-ai/glm-4.5-air:free

invalid_schema

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: 33052 ms
Tokens: 2184
Cost: $0
Provider: Z.AI
Finish: length

Model response could not be parsed as valid judgement JSON.

Error: invalid_schema

Open raw response

Response preview

{
  "summary": "The debate presents a balanced discussion on the tension between speed and thoroughness in decision-making processes. Speaker A advocates for faster decisions when evidence is sufficient, while Speaker B emphasizes the risks of inadequate review. Both sides make reasonable points, with Speaker A particularly effective in arguing for proportionality in review processes.",
  "total_score": 7.1,
  "confidence": 0.85,
  "dimensions": {
    "logic": 8,
    "evidence": 3,
    "countera

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

Model	Dimension	Score	Confidence	Reason
No dimension scores.

Artifacts

Kind	Path	Size	Checksum
model_raw_response	`jury/raw_openrouter_7cbaefe7-c406-4170-80ef-fb974f41bb1d_z-ai_glm-4.5-air_free.json` open	12228	`a31950ded96a61ab`
model_raw_response	`jury/raw_openrouter_7cbaefe7-c406-4170-80ef-fb974f41bb1d_openai_gpt-oss-120b_free.json` open	290	`268225afc7d0ed65`
model_raw_response	`jury/raw_openrouter_7cbaefe7-c406-4170-80ef-fb974f41bb1d_meta-llama_llama-3.3-70b-instruct_free.json` open	634	`203f03d46b54e07b`

Manifest

Software: 0.5.2
Prompt Hash: 782b115e9241a33b15
Rubric Hash: c9144dd5d4c5fcd823
Input Hash: 6644a1c7859bc24992