Debate Result · real

Disputatio Fake E2E Fixture 20260601T201940Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · 7cbaefe7-c406-4170-80ef-fb974f41bb1d

failed
Consensus WinnerUnclear0/3 valid judges
Model Agreementn/aDivergence: n/a
Closest To Consensusn/aNo schema-valid model judgement yet.
Cost$0.0000002184 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

No aggregation result yet. At least two schema-valid judgements are needed.

Evaluated Source

Transparent demo/source context for this run.
Source Typefixturefixture
PurposeManual debate source.
Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.
Schema reliability 0/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 0/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement n/a

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread n/a

Need at least two schema-valid judgements.

Rhetorical manipulation spread n/a

Need at least two schema-valid judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.
LLM JudgeVerdictScoreConfidenceWhy
meta-llama/llama-3.3-70b-instruct:free
failed
n/a n/a n/a Model response could not be parsed as valid judgement JSON.
openai/gpt-oss-120b:free
failed
n/a n/a n/a Model response could not be parsed as valid judgement JSON.
z-ai/glm-4.5-air:free
invalid_schema
n/a n/a n/a Model response could not be parsed as valid judgement JSON.

Model Cards

Operational details for trust and debugging.

meta-llama/llama-3.3-70b-instruct:free

failed
n/atotal score
Winner
n/a
JSON
False
Schema
False
Latency
n/a ms
Tokens
0
Cost
$0
Provider
n/a
Finish
n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic
OpenRouter HTTP 429: {"error": {"message": "Provider returned error", "code": 429, "metadata": {"raw": "meta-llama/llama-3.3-70b-instruct:free is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations", "provider_name": "Venice", "is_byok": false, "retry_after_seconds": 18, "retry_after_seconds_raw": 17.199, "headers": {"Retry-After": "18"}}}, "user_id": "user_3Bl6LBShLIGou4GzxDLzm73U616"}

openai/gpt-oss-120b:free

failed
n/atotal score
Winner
n/a
JSON
False
Schema
False
Latency
n/a ms
Tokens
0
Cost
$0
Provider
n/a
Finish
n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic
OpenRouter HTTP 404: {"error": {"message": "No endpoints available matching your guardrail restrictions and data policy. Configure: https://openrouter.ai/settings/privacy", "code": 404}}

z-ai/glm-4.5-air:free

invalid_schema
n/atotal score
Winner
n/a
JSON
False
Schema
False
Latency
33052 ms
Tokens
2184
Cost
$0
Provider
Z.AI
Finish
length

Model response could not be parsed as valid judgement JSON.

Error: invalid_schema

Open raw response

Response preview
{
  "summary": "The debate presents a balanced discussion on the tension between speed and thoroughness in decision-making processes. Speaker A advocates for faster decisions when evidence is sufficient, while Speaker B emphasizes the risks of inadequate review. Both sides make reasonable points, with Speaker A particularly effective in arguing for proportionality in review processes.",
  "total_score": 7.1,
  "confidence": 0.85,
  "dimensions": {
    "logic": 8,
    "evidence": 3,
    "countera

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

ModelDimensionScoreConfidenceReason
No dimension scores.

Artifacts

KindPathSizeChecksum
model_raw_responsejury/raw_openrouter_7cbaefe7-c406-4170-80ef-fb974f41bb1d_z-ai_glm-4.5-air_free.json
open
12228a31950ded96a61ab
model_raw_responsejury/raw_openrouter_7cbaefe7-c406-4170-80ef-fb974f41bb1d_openai_gpt-oss-120b_free.json
open
290268225afc7d0ed65
model_raw_responsejury/raw_openrouter_7cbaefe7-c406-4170-80ef-fb974f41bb1d_meta-llama_llama-3.3-70b-instruct_free.json
open
634203f03d46b54e07b

Manifest

Software
0.5.2
Prompt Hash
782b115e9241a33b15
Rubric Hash
c9144dd5d4c5fcd823
Input Hash
6644a1c7859bc24992