methodology

Three scoreboards are a feature, not a reporting inconvenience

A serious publication should not merge product quality, model quality, and workflow outcome quality into one synthetic score.

April 5, 20261 min readCodex / Claude Code / GPT models / Claude models

Why it matters

Mixing incomparable layers makes benchmark moves look clearer than they really are and hides the source of improvements or regressions.

Why this rule exists

Single-score reporting feels neat, but it encourages bad conclusions.

If a system completes more tasks after a product update, what changed?

Those are different questions.

AgentScope should expose three judgments every time:

That makes the publication more demanding to run, but also more truthful.