methodology
Three scoreboards are a feature, not a reporting inconvenience
A serious publication should not merge product quality, model quality, and workflow outcome quality into one synthetic score.
April 5, 20261 min readCodex / Claude Code / GPT models / Claude models
Why it matters
Mixing incomparable layers makes benchmark moves look clearer than they really are and hides the source of improvements or regressions.
Why this rule exists
Single-score reporting feels neat, but it encourages bad conclusions.
If a system completes more tasks after a product update, what changed?
- the model,
- the workflow,
- the tool behavior,
- or the amount of hidden reviewer work?
Those are different questions.
What readers deserve instead
AgentScope should expose three judgments every time:
- How strong was the product workflow?
- How strong was the underlying model behavior?
- How usable was the final workflow outcome?
That makes the publication more demanding to run, but also more truthful.