methodology

Three scoreboards are a feature, not a reporting inconvenience

A serious publication should not merge product quality, model quality, and workflow outcome quality into one synthetic score.

April 5, 20261 min readCodex / Claude Code / GPT models / Claude models
Why it matters

Mixing incomparable layers makes benchmark moves look clearer than they really are and hides the source of improvements or regressions.

Why this rule exists

Single-score reporting feels neat, but it encourages bad conclusions.

If a system completes more tasks after a product update, what changed?

  • the model,
  • the workflow,
  • the tool behavior,
  • or the amount of hidden reviewer work?

Those are different questions.

What readers deserve instead

AgentScope should expose three judgments every time:

  1. How strong was the product workflow?
  2. How strong was the underlying model behavior?
  3. How usable was the final workflow outcome?

That makes the publication more demanding to run, but also more truthful.