Search

Search the archive

Find notes, labs, failures, systems, and themes from one place.

Head-to-Head LabsMay 26, 2026
Public result: same product brief, Claude Code and Codex branches

A public GitHub comparison where the same competitive-intelligence app prompts produced separate Claude Code and Codex implementations.

CodexClaude Code
Head-to-Head LabsMay 26, 2026
Public result: same todo CLI prompt across Claude Code and Codex

A public benchmark folder with generated Node.js todo CLI implementations from Claude Code and Codex using the same prompt.

CodexClaude Code
Latest ChangesApril 14, 2026
Why workflow notes should trigger retests before headline takes

Product-level workflow changes can alter real usefulness even when the underlying model story appears mostly unchanged.

CodexClaude Code
Head-to-Head LabsApril 12, 2026
Pilot lab: legacy repo onboarding without architecture hallucination

A seeded lab report that demonstrates how AgentScope should document repository onboarding tasks, evidence trails, and reviewer burden.

CodexClaude Code
Head-to-Head LabsApril 11, 2026
Pilot lab: bug fix under constraints with tight patch scope

A seeded bug-fix report focused on whether an agent can isolate a defect, keep edits narrow, and avoid collateral damage.

CodexClaude Code
Latest ChangesApril 10, 2026
Context claims only matter when they survive repo-scale tasks

Long-context positioning is useful only if the system maintains structure, scope control, and reviewer trust on actual engineering work.

GPT modelsClaude models
Head-to-Head LabsApril 9, 2026
Pilot lab: risky diff review where confidence is not enough

A seeded review-quality lab that focuses on hidden regressions, weak assumptions, and whether the agent can challenge a plausible-looking diff.

CodexClaude Code
Head-to-Head LabsApril 8, 2026
Pilot lab: refactor with intent preservation instead of style drift

A seeded refactor report that evaluates whether a system can improve structure while preserving behavior, boundaries, and local conventions.

CodexClaude Code
Head-to-Head LabsApril 7, 2026
Pilot lab: UI generation from a brief without falling into generic patterns

A seeded design-and-implementation lab for judging whether a coding agent can translate a product brief into intentional interface choices.

CodexClaude Code
Community PulseApril 7, 2026
Community pulse: week of April 7, 2026

A seeded weekly pulse brief showing how AgentScope clusters discussion into praise, complaints, confusion, and momentum without pretending to automate certainty.

CodexClaude CodeGPT modelsClaude models
Head-to-Head LabsApril 6, 2026
Pilot lab: recovery after command failure and partial evidence

A seeded operational lab that evaluates whether the agent can recover after a failed command, revise its plan, and stay useful without hiding uncertainty.

CodexClaude Code
Latest ChangesApril 5, 2026
Three scoreboards are a feature, not a reporting inconvenience

A serious publication should not merge product quality, model quality, and workflow outcome quality into one synthetic score.

CodexClaude CodeGPT modelsClaude models
Failure LibraryApril 4, 2026
Failure case: over-editing after only partial repository understanding

A common failure mode where the agent reads just enough of a repository to sound credible, then expands the patch beyond what the evidence supports.

CodexClaude Code
Failure LibraryApril 3, 2026
Failure case: confident review that missed the runtime-changing risk

A review can sound sharp, cover style issues, and still miss the one behavior-changing problem that actually matters.

CodexClaude Code
MethodologyApril 1, 2026
Editorial methodology for AI coding agent analysis

The core rules AgentScope uses to keep product comparisons, model comparisons, and workflow outcome judgments grounded in evidence instead of hype.

CodexClaude CodeGPT modelsClaude models