Pilot lab: recovery after command failure and partial evidence
A seeded operational lab that evaluates whether the agent can recover after a failed command, revise its plan, and stay useful without hiding uncertainty.
Operational polish matters when the workflow turns a blocked command into a clean fallback instead of dead time.
Model quality shows up in whether the system can revise its assumptions after new evidence contradicts the first plan.
A useful workflow outcome leaves the user informed and still moving forward after the failure, not waiting for manual rescue.
Tool-assisted terminal workflow with one or more failed commands, incomplete intermediate evidence, and a continuing task objective.
Recover from a failed or blocked command, explain what changed, and continue making progress without inventing facts or derailing the task.
Why this task matters
Real work includes blocked package installs, permission errors, timeouts, missing files, and incomplete local context.
The publication should value agents that:
- acknowledge failure precisely,
- choose a sane fallback,
- and communicate the remaining uncertainty.
An agent that hides a failed step behind confident narration creates more risk than one that pauses and adjusts.