Only schema-valid judgements are used for consensus and score aggregation.
Debate Result · fake
Disputatio Fake E2E Fixture 20260601T201438Z
Deterministic fixture debate for M4 fake jury validation.
Back to debate · 2f30e730-c722-43fd-bb58-4b123c6ee704
Why this result?
AI-assisted comparative judgement, not objective truth.Run a real jury to generate a winner rationale.
Evaluated Source
Transparent demo/source context for this run.LLM judgements are comparative signals, not objective truth.
Bias & Reliability Signals
Exploratory MVP signals, not causal bias proof.Counts valid JSON responses before stricter judgement-schema checks.
Derived from score divergence. This is not a causal provider-bias claim.
Score spread 1.50 across valid model judgements.
Score spread 1.50 across valid model judgements.
MVP run uses original order and visible Speaker A/B labels. Use anonymized/order-swapped research runs later.
Jury Verdicts
Clear product view first; raw diagnostics remain below.| LLM Judge | Verdict | Score | Confidence | Why |
|---|---|---|---|---|
| model_a_budget completed |
Unclear | 7.03 | 0.9100 | Fake Model A produced a deterministic fake judgement for deployment validation. |
| model_b_reasoning completed |
Unclear | 6.33 | 0.9100 | Fake Model B produced a deterministic fake judgement for deployment validation. |
| model_c_contrast completed |
Unclear | 7.83 | 0.9100 | Fake Model C produced a deterministic fake judgement for deployment validation. |
Model Cards
Operational details for trust and debugging.model_a_budget
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 210 ms
- Tokens
- 1240
- Cost
- $0
- Provider
- fake
- Finish
- n/a
Fake Model A produced a deterministic fake judgement for deployment validation.
model_b_reasoning
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 310 ms
- Tokens
- 1310
- Cost
- $0
- Provider
- fake
- Finish
- n/a
Fake Model B produced a deterministic fake judgement for deployment validation.
model_c_contrast
completed- Winner
- Unclear
- JSON
- True
- Schema
- True
- Latency
- 260 ms
- Tokens
- 1255
- Cost
- $0
- Provider
- fake
- Finish
- n/a
Fake Model C produced a deterministic fake judgement for deployment validation.
Score Dimensions
Only schema-valid model judgements are shown here. Invalid JSON responses stay visible in the model cards and raw artifacts.
| Model | Dimension | Score | Confidence | Reason |
|---|---|---|---|---|
| model_a_budget | clarity | 7.40 | 0.9000 | Deterministic fake score for clarity. |
| model_a_budget | context_fidelity | 7.20 | 0.9000 | Deterministic fake score for context_fidelity. |
| model_a_budget | counterarguments | 7.10 | 0.9000 | Deterministic fake score for counterarguments. |
| model_a_budget | evidence | 6.70 | 0.9000 | Deterministic fake score for evidence. |
| model_a_budget | factual_grounding | 6.80 | 0.9000 | Deterministic fake score for factual_grounding. |
| model_a_budget | fairness | 7.00 | 0.9000 | Deterministic fake score for fairness. |
| model_a_budget | logic | 7.20 | 0.9000 | Deterministic fake score for logic. |
| model_a_budget | relevance | 7.30 | 0.9000 | Deterministic fake score for relevance. |
| model_a_budget | rhetorical_manipulation | 6.60 | 0.9000 | Deterministic fake score for rhetorical_manipulation. |
| model_b_reasoning | clarity | 6.70 | 0.9000 | Deterministic fake score for clarity. |
| model_b_reasoning | context_fidelity | 6.50 | 0.9000 | Deterministic fake score for context_fidelity. |
| model_b_reasoning | counterarguments | 6.40 | 0.9000 | Deterministic fake score for counterarguments. |
| model_b_reasoning | evidence | 6.00 | 0.9000 | Deterministic fake score for evidence. |
| model_b_reasoning | factual_grounding | 6.10 | 0.9000 | Deterministic fake score for factual_grounding. |
| model_b_reasoning | fairness | 6.30 | 0.9000 | Deterministic fake score for fairness. |
| model_b_reasoning | logic | 6.50 | 0.9000 | Deterministic fake score for logic. |
| model_b_reasoning | relevance | 6.60 | 0.9000 | Deterministic fake score for relevance. |
| model_b_reasoning | rhetorical_manipulation | 5.90 | 0.9000 | Deterministic fake score for rhetorical_manipulation. |
| model_c_contrast | clarity | 8.20 | 0.9000 | Deterministic fake score for clarity. |
| model_c_contrast | context_fidelity | 8.00 | 0.9000 | Deterministic fake score for context_fidelity. |
| model_c_contrast | counterarguments | 7.90 | 0.9000 | Deterministic fake score for counterarguments. |
| model_c_contrast | evidence | 7.50 | 0.9000 | Deterministic fake score for evidence. |
| model_c_contrast | factual_grounding | 7.60 | 0.9000 | Deterministic fake score for factual_grounding. |
| model_c_contrast | fairness | 7.80 | 0.9000 | Deterministic fake score for fairness. |
| model_c_contrast | logic | 8.00 | 0.9000 | Deterministic fake score for logic. |
| model_c_contrast | relevance | 8.10 | 0.9000 | Deterministic fake score for relevance. |
| model_c_contrast | rhetorical_manipulation | 7.40 | 0.9000 | Deterministic fake score for rhetorical_manipulation. |
Artifacts
| Kind | Path | Size | Checksum |
|---|---|---|---|
| report_md | reports/fake_jury_2f30e730-c722-43fd-bb58-4b123c6ee704.mdopen | 669 | eddd5bff7a5963ef |
| report_html | reports/fake_jury_2f30e730-c722-43fd-bb58-4b123c6ee704.html | 439 | 8676be3ed1c0171e |
Manifest
- Software
- 0.4.0
- Prompt Hash
5358e5ba89ce6b5e2b- Rubric Hash
c9144dd5d4c5fcd823- Input Hash
1e566c63812e138299