Build vs buy
Building Human-in-the-Loop Approval for AI Agents Yourself: The Hidden Costs
The hard part of human-in-the-loop for AI agents is not pausing the agent — it is everything around the pause. Here is what you are actually signing up to build and maintain, and the point where buying becomes the rational call.
In short
Building your own human-in-the-loop approval for AI agents is cheap to prototype and expensive to operate. A framework primitive like LangGraph’s interrupt() handles the pause in an afternoon. Production then asks for durable resume, cross-channel routing, timeouts and fallbacks, idempotency, a declarative policy engine, and a tamper-evident audit trail — the layer you build once, get audited on, and maintain forever. Build it if you have the senior-engineer time and no opportunity cost. Otherwise, buy.
Key takeaways
- The pause is the easy 10%. The durable, multi-channel, policy-driven, auditable control plane around it is the other 90%.
- The audit trail ships last and weakest in almost every DIY build — and it is exactly what a security review or an EU AI Act audit asks for.
- DIY is rational when you need total customization and your senior-engineer time has no opportunity cost. For most teams shipping product, it does.
- Buying is not “less control” — a good control plane is declarative policy plus open SDKs, so you keep the control and shed the maintenance.
The pause is the easy part
Every human-in-the-loop demo looks the same and looks easy: the agent reaches a risky step, you suspend it, you ping a human, you resume on their reply. Frameworks now ship this pause as a primitive — LangGraph has interrupt(), the Vercel AI SDK and the OpenAI and Claude agent SDKs have their own approval hooks. You can wire a believable proof-of-concept before lunch.
That demo is a trap, because the pause is maybe 10% of the work. The other 90% is everything that has to be true for the pause to survive contact with production traffic, a real on-call rotation, and an auditor. This is an honest accounting of that 90% — from someone who builds it for a living.
from langgraph.types import interrupt
def approve_node(state):
# The graph suspends here and waits for a resume value.
decision = interrupt({"tool": "issue_refund", "args": state["args"]})
if decision != "approve":
raise ToolRejected()
return state
# ...and, somewhere else, you POST a Slack message, catch the reply,
# and resume the graph with Command(resume="approve"). Looks done. It isn't.What production actually demands
Here is the work that the demo hides. None of it is exotic; all of it is load-bearing, and skipping any one of it is how the 3am incident happens.
1. Durable state and resume. The pause can outlive the process. Your agent runs on serverless or restarts on deploy; the human replies twenty minutes later. You need the paused call, its arguments, and its context persisted somewhere durable, keyed so the right reply resumes the right run — not held in memory that a cold start erases.
2. Cross-channel routing and the Slack plumbing. “Post to Slack” is a project, not a line. OAuth install and token storage, token refresh, interactivity endpoints, signature verification, backing off 429s, handling the message that was never answered, mapping a Slack user back to an authorized approver. Then the same logic again for a web inbox, and again when someone asks for email.
3. Timeouts, fallbacks, and a fail posture. What happens when no one approves in time? You need a per-call timeout and a deliberate decision: fail closed (the tool never runs — safer for high-risk actions) or fail open (run anyway and reconcile later). Most DIY builds have no answer here until the first stuck pipeline forces one.
4. Idempotency and race conditions. A resumed run must not execute the tool twice. Two approvers must not both decide. A retry must not re-charge the card. This is the class of bug that does not show up in the demo and does show up in the incident review.
5. A real policy layer, not an if/else. You do not want a human approving every $3 refund. You want the safe majority auto-approved and only the rest routed to a person. That means declarative, versioned rules — and the moment policy logic is code, “which rule, which version, decided this?” becomes a question you cannot answer after the fact.
6. Edited-arguments execution and provenance. Approvers do not only say yes or no — they edit. “Approve, but cap it at €500.” Your gate must execute the edited arguments, never the originals, and record which is which. And for every action that ran without a human, you need to record exactly why: a policy, a tool flag, a standing grant, or a named person.
7. A tamper-evident audit trail. This is the one that ships last and weakest. A plain INSERT into a logs table is not evidence — a row you can UPDATE proves nothing to an auditor. Doing this right means append-only enforcement at the database, hash-chaining each event, and a way to verify the chain. Most teams discover this gap during their first security review, which is the worst time to discover it.
8. Multi-tenant isolation, approver ergonomics, and observability. Row-level isolation so one customer never sees another’s approvals. Delegation so authority survives a vacation. An off-shift posture so the 3am DMs stop. And a way to see whether your approvers — not your agents — are the bottleneck, per rule.
DIY vs a control plane, honestly
The fair comparison is not “a webhook” vs “a vendor.” It is the full layer you would have to build and own, vs a control plane that ships and maintains it for you.
| Dimension | Build it yourself | Control plane (e.g. Pliuz) |
|---|---|---|
| Pause the agent | interrupt() / framework hook — easy | One decorator wraps the call — easy |
| Durable resume | You design persistence + keying | Held server-side; long-poll resume |
| Routing | Build Slack OAuth, retries, web inbox separately | Slack + web inbox + REST, one model |
| Policy | if/else in code, unversioned | Declarative JSONLogic, versioned, shadow mode |
| Timeouts / fail posture | Roll your own, usually after an incident | Per-call timeout; fail-closed by default |
| Audit trail | Mutable logs table; weak under audit | Append-only, SHA-256 hash-chained, verifiable |
| Maintenance | Yours forever (Slack, frameworks, regs) | Absorbed by the vendor |
| Customization | Total — every line is yours | Declarative policy + open-source SDKs |
When building it yourself is the right call
This is not a one-sided pitch. Building in-house is rational when a few things are true at once: you need customization a declarative policy engine cannot express, your team genuinely has the senior-engineer time to absorb the audit-chain design and the regulatory mapping, and that time has no opportunity cost — there is no higher-value feature it displaces. If your approval logic is your product, own it.
For most teams shipping an AI product, those conditions do not hold. The approval layer is undifferentiated heavy lifting: necessary, hard to get right, and worth zero to your customers as a thing you built rather than bought. That is the textbook case for buying.
What buying actually looks like
The fear with “buy” is losing control. A good control plane inverts that: the control lives in declarative policy you own and open-source SDKs you can fork, while the plumbing and the audit chain are maintained for you. With Pliuz, the gate is one decorator, framework-agnostic, with no LLM in the critical path:
from pliuz import gated
@gated(policy="finance-approvals", redact=["recipient.iban"], timeout_s=300)
def issue_refund(customer_id: str, amount_cents: int):
return stripe.refunds.create(...)
# Pauses, evaluates your policy, routes to Slack or the web inbox, runs the
# approved (or edited) args, and chains the decision into a verifiable audit log.The decorator works the same whether your agent runs on LangGraph, CrewAI, the Vercel AI SDK, the OpenAI Agents SDK, the Claude Agent SDK, or a plain HTTP runner. Your policy decides what auto-approves; everything else routes to a human who can approve, edit, or reject; and every decision lands in an append-only, hash-chained audit trail you can verify yourself. That is the 90% you were about to build — already built and maintained.
The bottom line
Pausing an AI agent is a solved, one-line problem. Governing what happens during and after the pause — across channels, under load, with a record you can defend — is the real engineering, and it never stops being maintained. Build it if approval is your product and you have the time to spare. If you would rather spend that time on the product your customers actually pay for, that is exactly the work a human-in-the-loop control plane exists to take off your plate.
Sources & further reading
Frequently asked questions
Can I just use LangGraph interrupt() for human-in-the-loop?
interrupt() handles the pause — it suspends the graph and waits for a resume value. It does not give you the rest of a production approval layer: durable state across restarts and serverless cold-starts, cross-channel routing to Slack with retries and token refresh, approval timeouts and a defined fail-open or fail-closed behavior, idempotency so a resumed run does not double-execute the tool, a declarative policy engine to auto-approve the safe majority, and a tamper-evident audit trail. interrupt() is the right primitive if you are wiring everything else yourself; it is not, by itself, a control plane.
How long does it take to build a human approval gate for AI agents?
A working demo — pause the agent, post to Slack, resume on reply — is an afternoon. A version you would put in front of production traffic is several senior-engineer weeks, and then it is never finished: durable resume, idempotency, timeouts and fallbacks, a real policy layer, multi-tenant isolation, approver ergonomics, and a defensible audit trail each add scope, and Slack API changes, new agent frameworks, and regulatory updates turn it into permanent maintenance.
What is the difference between an approval gate and an audit trail?
An approval gate is a runtime control: it pauses an agent before a sensitive action and requires a human (or a policy) to approve, edit, or reject it, so the action only runs once authorized. An audit trail is the after-the-fact record of what happened — who approved what, and when. A gate blocks before execution; an audit trail proves after execution. You need both, and building both well is harder than building either alone.
Is human-in-the-loop for AI agents the same as RLHF or data labeling?
No. The older, larger meaning of human-in-the-loop is data labeling and model training — humans annotate data or review model predictions to improve a model (this includes RLHF). Human-in-the-loop for AI agents is different: a person authorizes an agent action at runtime, before it executes a real-world tool call such as a payment or a deletion. This article is about the runtime-approval meaning, not the labeling one.
Does the EU AI Act require human oversight of AI agents?
For high-risk AI systems, yes. Article 14 of the EU AI Act (Regulation (EU) 2024/1689) requires that high-risk AI systems be designed so they can be effectively overseen by humans, including the ability to intervene or interrupt the system. Article 12 requires automatic record-keeping (logs), and Article 26 sets obligations for deployers. Whether your agent is high-risk depends on its use case; a runtime approval gate plus a tamper-evident log is a direct, demonstrable way to meet the oversight and record-keeping mechanisms.
Keep reading
The audit layer is the part DIY teams under-scope most. Here is how to build one an auditor will accept — and how to verify it yourself.
What LangGraph, CrewAI, the Vercel AI SDK, and others give you natively — and the gap they all share.
Once you decide to buy, here is the honest map of the categories and how to choose.
The honest side-by-side: what you trade for the few weeks it takes to ship an in-house approval layer.