Head-to-Head Labs

Task reports

Public same-prompt examples first, then seeded pilot reports that define the future test structure.

2
public
6
seeded
Public Same-Prompt App BuildLow reviewer burden

Public result: same product brief, Claude Code and Codex branches

A public GitHub comparison where the same competitive-intelligence app prompts produced separate Claude Code and Codex implementations.

CodexClaude Code
May 26, 2026Read analysis
Public Same-Prompt CLI BuildLow reviewer burden

Public result: same todo CLI prompt across Claude Code and Codex

A public benchmark folder with generated Node.js todo CLI implementations from Claude Code and Codex using the same prompt.

CodexClaude Code
May 26, 2026Read analysis

A seeded lab report that demonstrates how AgentScope should document repository onboarding tasks, evidence trails, and reviewer burden.

CodexClaude Code
April 12, 2026Read analysis
Bug Fix Under ConstraintsMedium reviewer burden

Pilot lab: bug fix under constraints with tight patch scope

A seeded bug-fix report focused on whether an agent can isolate a defect, keep edits narrow, and avoid collateral damage.

CodexClaude Code
April 11, 2026Read analysis

A seeded review-quality lab that focuses on hidden regressions, weak assumptions, and whether the agent can challenge a plausible-looking diff.

CodexClaude Code
April 9, 2026Read analysis
Refactor With Intent PreservationMedium reviewer burden

Pilot lab: refactor with intent preservation instead of style drift

A seeded refactor report that evaluates whether a system can improve structure while preserving behavior, boundaries, and local conventions.

CodexClaude Code
April 8, 2026Read analysis

A seeded design-and-implementation lab for judging whether a coding agent can translate a product brief into intentional interface choices.

CodexClaude Code
April 7, 2026Read analysis
Error Recovery After Failed CommandLow reviewer burden

Pilot lab: recovery after command failure and partial evidence

A seeded operational lab that evaluates whether the agent can recover after a failed command, revise its plan, and stay useful without hiding uncertainty.

CodexClaude Code
April 6, 2026Read analysis