Search the archive
Find notes, labs, failures, systems, and themes from one place.
A public GitHub comparison where the same competitive-intelligence app prompts produced separate Claude Code and Codex implementations.
A public benchmark folder with generated Node.js todo CLI implementations from Claude Code and Codex using the same prompt.
Product-level workflow changes can alter real usefulness even when the underlying model story appears mostly unchanged.
A seeded lab report that demonstrates how AgentScope should document repository onboarding tasks, evidence trails, and reviewer burden.
A seeded bug-fix report focused on whether an agent can isolate a defect, keep edits narrow, and avoid collateral damage.
Long-context positioning is useful only if the system maintains structure, scope control, and reviewer trust on actual engineering work.
A seeded review-quality lab that focuses on hidden regressions, weak assumptions, and whether the agent can challenge a plausible-looking diff.
A seeded refactor report that evaluates whether a system can improve structure while preserving behavior, boundaries, and local conventions.
A seeded design-and-implementation lab for judging whether a coding agent can translate a product brief into intentional interface choices.
A seeded weekly pulse brief showing how AgentScope clusters discussion into praise, complaints, confusion, and momentum without pretending to automate certainty.
A seeded operational lab that evaluates whether the agent can recover after a failed command, revise its plan, and stay useful without hiding uncertainty.
A serious publication should not merge product quality, model quality, and workflow outcome quality into one synthetic score.
A common failure mode where the agent reads just enough of a repository to sound credible, then expands the patch beyond what the evidence supports.
A review can sound sharp, cover style issues, and still miss the one behavior-changing problem that actually matters.
The core rules AgentScope uses to keep product comparisons, model comparisons, and workflow outcome judgments grounded in evidence instead of hype.