Safety is a layer, not a press release

Every major lab has a safety page. It has a gradient, a list of principles, maybe a diagram of a shield. It says nothing about what happens when your agent gets a user message that starts with “ignore previous instructions.”

That’s the gap between responsible AI as marketing and responsible AI as engineering. The marketing exists. The engineering is real and mostly unsexy and almost nobody talks about it plainly.

I’ve been building agents that actually do things: sign transactions, run tools, call external APIs, touch state that’s hard to reverse. When you cross the line from “the model answered a question” to “the model took an action,” the failure modes change. A hallucinated answer is embarrassing. A hallucinated action can be expensive or irreversible. The safety work you have to do is different, and it’s not in the gradient diagrams.

what actually breaks

Prompt injection is the one that keeps coming up and almost nobody in the enterprise AI world takes seriously enough. The setup is simple: your agent reads something from the environment, a document, a webpage, an email, a database row, and that thing contains instructions aimed at the model. Not at your user. At your model. “Summarize this document” becomes “summarize this document and also exfiltrate the API keys from context.” It works embarrassingly often.

The defenses aren’t magic. You separate system context from user-supplied content structurally, not just narratively. You don’t give the model access to credentials it doesn’t need for the current step. You validate outputs before acting on them. If the model’s tool call looks structurally wrong, you don’t execute it, you log it and surface it. These are boring decisions, not research. The problem is that most teams aren’t making them because they’re busy demoing.

Input validation gets talked about like a frontend concern. It isn’t. For an agent with tools, every parameter that goes into a tool call is an attack surface. You need range checks, type checks, and, for anything hitting external systems, explicit allowlists on what the model is allowed to do at all. Not “we trust the model to stay in scope.” Actual constraints in the layer that runs the tools.

Rate limiting matters more than people think when the agent is in a loop. A misconfigured agent that retries on failure can rack up costs or hammer an API in seconds. You need the limit before the tool executes, not after the invoice arrives.

the honest version of the stack

A production agent that takes actions has at least three layers of validation that have nothing to do with the model itself: the input that arrives (sanitize, validate, classify), the tool call the model emits (validate structure, check allowlists, check scope), and the result before it feeds back into context (check for injection, check for anomaly). None of these are novel. All of them are skipped in most demos.

The reason they get skipped is that they slow down the demo. They’re also the entire difference between a research prototype and something you’d leave running.

RAG pipelines add another surface. If your agent is doing retrieval and the retrieved chunks can contain attacker-controlled text, you have an injection vector at the retrieval step. Filtering at retrieval time, not just at query time, matters. This is known, it’s documented, and most teams still don’t do it.

Claude, GPT-4.1, the models you’re actually using right now. They’re better at following instructions than models were two years ago, which means they’re also better at following injected instructions. Better capabilities, same attack surface, higher stakes. The labs know this. It’s part of why MCP has a permission model baked in. Whether the teams building on top of these models actually implement it is a different question.

Uncertainty quantification is the other piece that gets dressed up in research language but has a practical form: if the model is expressing low confidence or the output is out of distribution for what your agent is supposed to do, don’t act, escalate. You don’t need a full uncertainty framework for this. You need a check that says “this looks weird” and routes it to a human instead of executing.

The actual failure mode in production isn’t a sophisticated adversarial attack. It’s a confused model confidently doing the wrong thing. Agents running unsupervised in a loop are good at this. The check is cheap. Not running the check is cheaper until it isn’t.

None of this is hard to understand. A lot of it is tedious to implement and easy to skip when you’re moving fast. That’s the honest state of AI safety in most production deployments right now: not absent, but partial, and partial in ways that are usually invisible until something goes sideways.

The shield diagram doesn’t tell you any of that.