Debate Result · real

Disputatio Fake E2E Fixture 20260601T201940Z

Deterministic fixture debate for M4 fake jury validation.

Back to debate · bf57f112-9712-4d95-8325-9155e93fa052

failed

Consensus WinnerUnclear0/3 valid judges

Model Agreementn/aDivergence: n/a

Closest To Consensusn/aNo schema-valid model judgement yet.

Cost$0.0000002184 tokens

Why this result?

AI-assisted comparative judgement, not objective truth.

Run a real jury to generate a winner rationale.

No aggregation result yet. At least two schema-valid judgements are needed.

Evaluated Source

Transparent demo/source context for this run.

Source Typefixturefixture

PurposeManual debate source.

Input Scopeinternal_onlyFull available manual transcript.

LLM judgements are comparative signals, not objective truth.

Bias & Reliability Signals

Exploratory MVP signals, not causal bias proof.

Schema reliability 0/3

Only schema-valid judgements are used for consensus and score aggregation.

JSON reliability 0/3

Counts valid JSON responses before stricter judgement-schema checks.

Provider disagreement n/a

Derived from score divergence. This is not a causal provider-bias claim.

Evidence-score spread n/a

Need at least two schema-valid judgements.

Rhetorical manipulation spread n/a

Need at least two schema-valid judgements.

Position / identity bias not tested

MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.

Jury Verdicts

Clear product view first; raw diagnostics remain below.

LLM Judge	Verdict	Score	Confidence	Why
meta-llama/llama-3.3-70b-instruct:free failed	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.
openai/gpt-oss-120b:free failed	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.
z-ai/glm-4.5-air:free invalid_schema	n/a	n/a	n/a	Model response could not be parsed as valid judgement JSON.

Model Cards

Operational details for trust and debugging.

meta-llama/llama-3.3-70b-instruct:free

failed

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: n/a ms
Tokens: 0
Cost: $0
Provider: n/a
Finish: n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic

OpenRouter HTTP 429: {"error": {"message": "Provider returned error", "code": 429, "metadata": {"raw": "meta-llama/llama-3.3-70b-instruct:free is temporarily rate-limited upstream. Please retry shortly, or add your own key to accumulate your rate limits: https://openrouter.ai/settings/integrations", "provider_name": "Venice", "is_byok": false, "retry_after_seconds": 17, "retry_after_seconds_raw": 16.229, "headers": {"Retry-After": "17"}}}, "user_id": "user_3Bl6LBShLIGou4GzxDLzm73U616"}

openai/gpt-oss-120b:free

failed

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: n/a ms
Tokens: 0
Cost: $0
Provider: n/a
Finish: n/a

Model response could not be parsed as valid judgement JSON.

Error: RuntimeError

Open raw response

Error diagnostic

OpenRouter HTTP 404: {"error": {"message": "No endpoints available matching your guardrail restrictions and data policy. Configure: https://openrouter.ai/settings/privacy", "code": 404}}

z-ai/glm-4.5-air:free

invalid_schema

n/atotal score

Winner: n/a
JSON: False
Schema: False
Latency: 69146 ms
Tokens: 2184
Cost: $0
Provider: Z.AI
Finish: length

Model response could not be parsed as valid judgement JSON.

Error: invalid_schema

Open raw response

Response preview

{
  "summary": "The debate centers on the balance between speed and thoroughness in decision-making processes, particularly for infrastructure projects. Speaker A argues for faster decisions when evidence is sufficient, emphasizing that delays increase costs and reduce public trust. Speaker B counters that weak review creates long-term risks and that failed projects due to poor evidence are more costly than slower, well-tested decisions. Speaker A clarifies that they agree evidence matters but a

Score Dimensions

Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.

Model	Dimension	Score	Confidence	Reason
No dimension scores.

Artifacts

Kind	Path	Size	Checksum
model_raw_response	`jury/raw_openrouter_bf57f112-9712-4d95-8325-9155e93fa052_z-ai_glm-4.5-air_free.json` open	24109	`7fcc611b0b5dc564`
model_raw_response	`jury/raw_openrouter_bf57f112-9712-4d95-8325-9155e93fa052_openai_gpt-oss-120b_free.json` open	290	`268225afc7d0ed65`
model_raw_response	`jury/raw_openrouter_bf57f112-9712-4d95-8325-9155e93fa052_meta-llama_llama-3.3-70b-instruct_free.json` open	634	`7889c556b07dced8`

Manifest

Software: 0.5.3
Prompt Hash: 782b115e9241a33b15
Rubric Hash: c9144dd5d4c5fcd823
Input Hash: 6644a1c7859bc24992