Safety, Reliability, and the Limits of Agent Autonomy
4 / 5Agents are powerful and their failure modes are qualitatively different from chat AI. A chat model that hallucinates produces bad text. An agent that malfunctions can take consequential actions in the real world: sending wrong emails, deleting files, making API calls with financial implications.
This lesson is about understanding and managing those risks.
The Compounding Error Problem
In a chat interface, errors are isolated. One bad response does not affect the next.
In an agent, errors compound. If an early step produces wrong results, subsequent steps build on that wrong foundation. By the time the agent reaches its goal, the output may be deeply wrong in ways that are not obvious from the final result alone.
This means agent outputs require more careful validation than chat outputs, not less.
The Categories of Agent Risk
Irreversible actions Actions the agent takes that cannot be undone: deleting files, sending emails, making purchases, publishing content, executing transactions. Agents should require explicit human confirmation before taking irreversible actions.
Scope creep Agents that interpret their goal broadly may take actions beyond what was intended. A "clean up my inbox" instruction interpreted overly broadly could delete important emails.
Prompt injection If an agent reads external content (web pages, emails, documents) as part of its task, malicious content in that environment could manipulate the agent's behaviour. This is a genuine security vulnerability in production agents.
Infinite loops Agents can get stuck in loops — taking the same action repeatedly, or oscillating between two states. Resource limits and loop detection are essential.
Credential and permission abuse An agent with broad permissions (access to file system, email, APIs) can cause significant damage if it malfunctions or is manipulated.
Design Principles for Safer Agents
Minimum necessary permissions Grant agents only the permissions they need for the specific task. An email draft agent does not need file system access.
Confirmation before irreversible actions Pause and require human confirmation before any action that cannot be undone. Design this into the architecture, not as an afterthought.
Dry run mode Build a mode where the agent plans and describes what it would do without actually doing it. Review the plan before execution.
Audit logs Log every action the agent takes. You need to be able to reconstruct what happened when something goes wrong.
Sandboxed execution environments For agents that execute code, run in sandboxed environments that cannot affect systems outside the sandbox.
Hard limits on resource consumption Limit the number of steps, API calls, tokens consumed, or time elapsed before the agent must pause and report to a human.
The Current State of Agent Reliability
Be honest about the current maturity of agent technology:
- Agents work well for narrow, well-defined tasks with clear success criteria
- Agents are unreliable for open-ended tasks requiring significant judgment
- Multi-step agents fail more often than simple agents
- Production deployments require careful monitoring and fallback mechanisms
- The technology is improving rapidly, but current limitations are real
The organisations getting the most value from agents are those that deploy them narrowly, monitor them closely, and maintain human oversight of consequential decisions.