Prompt, Harness, and Loop Engineering for General-Purpose Autonomous LLM Agents
TL;DR
The most important lesson from 2025–2026 agent engineering is not “use a more complicated framework.” It is almost the opposite: push complexity into the model and the context, not into brittle scaffolding. Anthropic’s Building effective agents argues that successful production systems tend to use “simple, composable patterns” and recommends starting with the simplest solution before adding orchestration.
A useful mental model separates agent engineering into three layers:
- Prompt engineering: instructions, tool descriptions, examples, output schemas.
- Harness engineering: tools, sandboxes, memory, state, files, permissions, observability.
- Loop engineering: the repeated think/action/observe/verify cycle.
In 2025, this vocabulary increasingly collapsed into a broader discipline: context engineering. Anthropic defines context engineering as curating and maintaining the optimal tokens used during inference, including prompts, tools, MCP resources, external data, and message history; it frames context as a finite resource with diminishing returns.
The agent loop can be written as:
\[a_t \sim \pi_\theta(\cdot \mid C_t), \qquad o_t = \mathcal{E}(a_t), \qquad C_{t+1} = g(C_t, a_t, o_t)\]where \(C_t\) is the context at step \(t\), \(a_t\) is the model-selected action, \(o_t\) is the observation returned by the environment or tool, and \(g\) is the harness logic that decides what gets preserved, summarized, cleared, or retrieved for the next step.
The core engineering problem is therefore:
\[\max_{C_t \subseteq \mathcal{I}_t} ; \mathbb{E}[\text{task success} \mid C_t] \quad \text{s.t.} \quad |C_t| \le B\]where \(\mathcal{I}_t\) is all potentially relevant information and \(B\) is the model’s context budget.
1. Why “context engineering” replaced “prompt engineering”
Classic prompt engineering optimized a single input: role, task, constraints, examples, and output format. That still matters, but agents run over many turns. They call tools, read documents, create files, make mistakes, recover, and accumulate history. In that setting, the hard question is no longer just “what should the system prompt say?” It is:
What should the model see right now so that the next action is most likely to be useful?
Anthropic’s Effective context engineering for AI agents explicitly frames the shift this way: building with language models is less about finding perfect wording and more about deciding what configuration of context will produce the desired behavior.
A practical context can be decomposed as:
\[C_t = [S, U_t, H_t, T_t, R_t, M_t, A_t]\]where:
- \(S\): system prompt and policy instructions
- \(U_t\): current user request
- \(H_t\): relevant conversation history
- \(T_t\): tool definitions and schemas
- \(R_t\): retrieved documents or search results
- \(M_t\): persistent memory, notes, summaries, files
- \(A_t\): artifacts created so far, such as code, plans, tables, or drafts
The danger is that more context is not always better. Anthropic notes that as context grows, agents may lose focus before they hit the hard token limit; this is often described as context rot.
A simple heuristic is:
\[\text{useful context} = \frac{\text{signal}}{\text{tokens} + \text{staleness} + \text{ambiguity}}\]The goal is not to maximize tokens. The goal is to maximize decision quality per token.
2. Prompt engineering: instructions, examples, schemas, and tools
Prompt engineering still matters, but for agents it should be treated as interface design.
A good system prompt should specify:
- the agent’s role
- the task boundary
- available tools
- decision heuristics
- safety constraints
- output format
- when to stop
- when to ask for help
Anthropic’s Building effective agents distinguishes workflows, where LLMs and tools follow predefined code paths, from agents, where LLMs dynamically direct their own tool use and process. This distinction matters because prompts for deterministic workflows can be narrower, while prompts for agents must teach decision-making.
Tool descriptions are prompts
Every tool name, parameter, and description becomes part of the model’s context. Anthropic’s Writing effective tools for agents emphasizes that tool descriptions and specs are loaded into the agent’s context and steer tool-calling behavior. It recommends meaningful tool results, token-efficient responses, clear parameters, pagination, truncation, and actionable error messages.
Bad tool interface:
search(query: string)
Better tool interface:
search_recent_docs(
query: string,
source_type: "docs" | "issues" | "pull_requests",
max_results: integer
)
The model should not have to infer whether user means username, user ID, email, or profile object. In agent systems, ambiguity compounds over steps.
Structured outputs
For actions, structured outputs are usually better than free text. A function call or JSON schema reduces parsing ambiguity:
{
"action": "search",
"query": "Anthropic Building effective agents simple composable patterns",
"max_results": 5
}
But schemas should not become a second programming language. If the model has to fill an over-engineered object with many nullable fields, the harness has shifted complexity away from the model’s reasoning and into brittle serialization.
Prompt caching
Long agent prompts often contain stable blocks: system instructions, tool definitions, policy text, repo summaries, or large documents. Anthropic’s prompt caching documentation explains that caching reuses prompt prefixes and can reduce processing time and cost for prompts with repeated structure. Anthropic’s May 2025 API announcement also states that extended prompt caching can reduce costs by up to 90% and latency by up to 85% for long prompts.
A cache-friendly context layout is:
[stable system prompt]
[stable tool definitions]
[stable project background]
[semi-stable memory / summary]
[dynamic user request]
[dynamic tool results]
The rule of thumb:
\[\text{cache benefit} \propto \text{stable prefix length} \times \text{reuse frequency}\]3. Harness engineering: tools, state, memory, files, and permissions
The harness is everything around the model that lets it act in the world. This includes:
- tool execution
- function calling
- code sandboxes
- file systems
- memory stores
- retrieval systems
- permissions
- approval checkpoints
- logging
- tracing
- retries
- rate limits
- state persistence
The simplest agent loop is tiny:
while not done:
message = build_context(state)
action = model(message)
observation = run_tool_or_respond(action)
state = update_state(state, action, observation)
Most production complexity lives around this loop.
Anthropic’s Effective harnesses for long-running agents argues that long-running agents need ways to bridge context windows. Their Claude Agent SDK approach uses artifacts such as initialization scripts, progress files, git history, and incremental clean-state work so that later sessions can resume without guessing what happened.
Memory is not just chat history
For agents, memory should be externalized. A useful distinction:
| Memory type | Scope | Example |
|---|---|---|
| Short-term state | current run | current plan, open subtasks, recent tool results |
| Working memory | current project | todo.md, progress.txt, scratch files |
| Long-term memory | cross-session | user preferences, durable facts, past decisions |
| Retrieval memory | large corpus | docs, codebase, tickets, papers |
LangGraph’s persistence layer, for example, separates short-term thread-scoped memory via checkpoints from long-term cross-thread memory via stores.
Compaction, clearing, and memory
Long-running agents eventually face context pressure. Anthropic’s Claude cookbook on memory, compaction, and tool clearing describes three complementary strategies:
- Compaction: summarize the active context into a smaller high-fidelity state.
- Tool-result clearing: remove bulky, re-fetchable tool outputs while preserving the fact that the tool call happened.
- Memory: write durable notes into external storage.
A useful compaction objective is:
\[\text{summary}^* = \arg\min_s |s| \quad \text{s.t.} \quad I(s; \text{future actions}) \approx I(H; \text{future actions})\]In plain English: make the summary as short as possible while preserving the information needed for future decisions.
MCP as the tool interface layer
The Model Context Protocol became the default vocabulary for connecting agents to external tools and data. Anthropic introduced MCP in November 2024 as a standard protocol for sharing resources, tools, and prompts across data sources; The Verge summarized it as a way to avoid custom integrations for every dataset.
Anthropic later added an MCP connector to the Claude API, letting developers connect Claude to remote MCP servers without writing custom client code. OpenAI’s Agents SDK also lists MCP server tool calling as a built-in feature.
The architectural direction is clear:
LLM agent
↕
agent harness
↕
MCP tools / APIs / files / databases / browsers / code execution
MCP does not remove the need for security. It standardizes connection, but it also expands the attack surface.
4. Loop engineering: think, act, observe, verify
The core agent loop is a generalization of ReAct. The original ReAct paper proposed interleaving reasoning traces and actions so that language models can update plans based on external observations.
A production-grade loop usually has these stages:
- Context assembly: choose the prompt, memory, tools, and retrieved data.
- Action selection: ask the model to respond or call a tool.
- Execution: run the tool, code, search, browser, or API.
- Observation: return structured results to the model.
- Verification: check whether the result satisfies the task.
- State update: write memory, update files, compact, or clear.
- Termination: stop, ask for human approval, or continue.
Mathematically:
\[\begin{aligned} C_t &= \text{assemble}(S, H_t, M_t, T_t, R_t) \ a_t &= \text{LLM}*\theta(C_t) \ o_t &= \text{execute}(a_t) \ v_t &= \text{verify}(a_t, o_t, \text{goal}) \ H*{t+1}, M_{t+1} &= \text{update}(H_t, M_t, a_t, o_t, v_t) \end{aligned}\]Termination is part of the loop
A weak agent does not know when to stop. A production loop should define explicit stop conditions:
- final answer produced
- test suite passes
- verifier approves
- max iterations reached
- token budget reached
- confidence threshold met
- human approval required
- irreversible action requested
For long tasks, termination should be hierarchical:
task done?
├─ if yes: final answer
├─ if no but blocked: ask human / escalate
├─ if no and budget remains: continue
└─ if no and budget exhausted: summarize partial progress
Reflection and self-correction
Reflection can improve quality, especially for code, math, and verifiable tasks. The Reflexion framework showed that agents can improve by writing verbal feedback into memory rather than updating model weights.
But reflection is not free. It adds latency, cost, and sometimes fake confidence. A practical pattern is:
\[\text{reflect only if} \quad P(\text{error}) \times \text{cost(error)} > \text{cost(reflection)}\]Use reflection for:
- code before merge
- data analysis before reporting
- math or logic-heavy answers
- irreversible actions
- security-sensitive workflows
Do not reflexively add reflection to every step.
5. Simple vs. complex loops
The dominant debate in 2025–2026 is whether to build elaborate orchestration or let stronger models handle more of the planning internally.
Anthropic’s position in Building effective agents is pragmatic: start simple, and increase complexity only when needed. It notes that agentic systems trade cost and latency for better task performance.
This is the agent version of the “bitter lesson”:
\[\text{capability} \approx f(\text{model quality}, \text{compute}, \text{context}, \text{tools})\]Hand-built orchestration helps when it provides:
- deterministic control
- safety boundaries
- observability
- state persistence
- human approval
- domain-specific tool routing
It hurts when it creates:
- duplicated state
- hidden prompts
- conflicting planners
- excessive tool choices
- brittle routing logic
- debugging opacity
A good escalation rule is:
\[\text{add scaffolding only if} \quad \Delta \text{success} > \Delta \text{cost} + \Delta \text{latency} + \Delta \text{maintenance}\]6. Single-agent vs. multi-agent
The multi-agent debate became sharply defined in 2025.
Anthropic’s multi-agent research system reported that a Claude Opus 4 lead agent with Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on an internal research evaluation. The same post says multi-agent systems are especially useful for breadth-first research queries that can pursue many independent directions in parallel.
But Anthropic also gives the caveat: agents used about 4× more tokens than chat interactions, and multi-agent systems used about 15× more tokens than chats. It also notes that domains requiring shared context or many dependencies are not good fits for multi-agent systems today.
Cognition’s Don’t Build Multi-Agents argues the opposite default for production work: share context, because actions carry implicit decisions and conflicting decisions create bad results.
A concise decision rule:
\[\text{use multi-agent} \iff \text{parallelism gain} > \text{coordination cost} + \text{context loss}\]Use multi-agent when:
- the task is read-heavy
- subtasks are independent
- breadth matters more than deep consistency
- separate context windows are an advantage
- outputs can be merged cleanly
Use single-agent when:
- actions depend on previous actions
- shared mutable state matters
- coding consistency matters
- decisions are hard to merge
- the agent must maintain one coherent plan
In short:
Research breadth → multi-agent can help.
Codebase surgery → single shared context usually wins.
7. CodeAct and executable actions
Another important design split is whether the model should emit structured tool calls or executable code.
Traditional tool calling:
{
"tool": "search",
"arguments": {
"query": "agent context engineering"
}
}
CodeAct-style action:
results = search("agent context engineering", max_results=5)
summary = summarize(results)
Hugging Face’s smolagents emphasizes minimalist agents, and its code-agent guide says code agents generate Python tool calls as actions, making action representations efficient, expressive, and accurate.
The advantage of code actions is composability:
\[\text{one code action} \approx \text{many JSON tool calls}\]This can reduce loop iterations and make complex transformations easier. The downside is security: executable code must run in a sandbox with strict permissions.
Use CodeAct when:
- the task involves data transformation
- tool calls need composition
- intermediate variables matter
- Python is a natural substrate
- sandboxing is available
Avoid CodeAct when:
- tools are destructive
- permissions are unclear
- execution cannot be isolated
- deterministic audit trails matter more than flexibility
8. Framework landscape
The agent framework ecosystem is crowded, but the durable options map to different needs.
| Framework / tool | Best for | Notes |
|---|---|---|
| LangGraph | stateful, auditable workflows | persistence, checkpoints, stores, human-in-the-loop |
| OpenAI Agents SDK | lightweight production agents | agents, tools, handoffs, guardrails, sessions, tracing, MCP |
| Claude Agent SDK | Claude-centric long-running agents | compaction, tools, coding workflows, progress artifacts |
| Google ADK | Gemini / Google Cloud agent systems | open framework, context management, deployment paths |
| A2A | agent-to-agent interoperability | protocol for collaborative agents across platforms |
| Pydantic AI | type-safe Python agent apps | type safety, validation, evals, observability |
| smolagents | minimal open-source code agents | small abstraction layer, CodeAct style |
| DSPy | prompt/program optimization | “program, don’t prompt”; signatures and optimizers |
OpenAI’s Agents SDK describes itself as a lightweight, production-ready upgrade from Swarm with a small primitive set: agents, handoffs, guardrails, tracing, sessions, function tools, MCP support, and human-in-the-loop mechanisms. OpenAI’s tracing docs also note that the SDK records LLM generations, tool calls, handoffs, guardrails, and custom events.
Google’s ADK documentation says ADK can work with almost any generative AI model and manages context through sessions, memory, tool outputs, artifacts, summarization, lazy-loading, and token tracking. Google’s A2A updates add native A2A support in ADK and describe A2A as a protocol for interoperable agent collaboration.
DSPy is different from most agent frameworks. It is not mainly an orchestration harness; it is a way to express LLM tasks as structured signatures and optimize them programmatically rather than hand-writing prompts.
9. Evaluation and observability
Agents cannot be improved if their trajectories are invisible.
A useful evaluation stack has three levels:
- End-state evals: Did the final answer or artifact solve the task?
- Trajectory evals: Did the agent take reasonable steps?
- Tool evals: Did each tool call use valid arguments and produce useful results?
End-state metrics often matter most:
\[\text{score} = \alpha \cdot \text{task success} * \beta \cdot \text{quality} - \gamma \cdot \text{cost} - \delta \cdot \text{latency} - \lambda \cdot \text{risk}\]For coding agents, the strongest eval is usually objective:
tests pass?
lint passes?
build succeeds?
diff is minimal?
security checks pass?
For research agents, use:
claims supported?
sources credible?
dates correct?
contradictions handled?
uncertainty stated?
For workflow agents, use:
task completed?
right tools used?
no unauthorized actions?
human approval respected?
state recoverable?
Observability should log:
- model inputs and outputs
- tool calls
- tool results
- errors
- retries
- token usage
- latency
- cost
- approval events
- state transitions
- final artifacts
A trace is not just a debug convenience. It is the audit trail of the agent’s reasoning environment.
10. Security: prompt injection, excessive agency, and tool risk
Agent security is harder than chatbot security because agents can act.
OWASP lists Prompt Injection as LLM01 and describes it as crafted inputs that can lead to unauthorized access, data breaches, and compromised decision-making. The same OWASP list includes insecure output handling, insecure plugin design, excessive agency, overreliance, and model theft.
A safe agent harness should assume:
all retrieved content is untrusted
all webpage instructions are untrusted
all tool outputs may be adversarial
all generated code may be wrong
all irreversible actions need approval
A basic security policy:
\[\text{permission}(a_t) = \begin{cases} \text{allow} & \text{read-only and low risk} \\ \text{review} & \text{write, send, delete, pay, deploy} \\ \text{deny} & \text{credential exfiltration or policy violation} \end{cases}\]Practical defenses:
- tool allowlists
- scoped credentials
- read/write separation
- sandboxed code execution
- human approval for destructive actions
- content provenance
- output validation
- least privilege
- audit logs
- rate limits
- secrets redaction
- MCP server trust review
The key idea is privilege separation. The model may decide what it wants to do, but the harness decides what it is allowed to do.
11. Practical architecture recommendations
Recommendation 1: Start with one agent
Start with a single ReAct-style loop:
assemble context → model action → tool execution → observation → update state
Only add planners, routers, subagents, or graph workflows after you observe a specific failure mode.
Recommendation 2: Treat tools as product surfaces
A tool is not just an API wrapper. It is an interface for a probabilistic user. Good tools have:
- obvious names
- strict schemas
- clear descriptions
- useful errors
- pagination
- filtering
- concise responses
- permission metadata
- stable semantics
Recommendation 3: Externalize state
Do not rely on conversation history alone. Use:
notes.md
todo.md
progress.txt
artifacts/
logs/
state.json
The best long-running agents leave a workspace that another agent—or a human—can understand.
Recommendation 4: Compact before context rot
Do not wait until the context window is full. Compact when:
\[\frac{|C_t|}{B} > \tau\]where \(\tau\) might be 0.5–0.7 depending on task complexity.
Compaction should preserve:
- goal
- constraints
- decisions made
- failed attempts
- current plan
- open questions
- artifact locations
- next action
Recommendation 5: Use multi-agent only when the task shape fits
Multi-agent systems are powerful when the work is parallelizable. But they are expensive and coordination-heavy.
Use this checklist:
Can subtasks be done independently?
Can outputs be merged without conflict?
Is breadth more important than consistency?
Is the task valuable enough for 10x+ token cost?
Does each subagent need only scoped context?
If not, stay single-agent.
Recommendation 6: Build evals before scaling
Before adding more agents, add tests.
A minimal eval set should include:
- 20 easy cases
- 20 normal cases
- 20 edge cases
- 10 adversarial cases
- 10 regression cases from real failures
For each, record:
\[(\text{input}, \text{expected behavior}, \text{allowed tools}, \text{success criteria})\]Recommendation 7: Delete scaffolding periodically
As models improve, some scaffolding becomes obsolete. Every model generation, run an experiment:
old harness + old model
old harness + new model
simpler harness + new model
raw model + minimal tools
If the simpler version matches or beats the complex version, delete code.
12. The bottom line
The durable agent stack is not a giant maze of orchestrators. It is a small loop surrounded by excellent context management, safe tools, state persistence, observability, and evals.
The core formula is:
\[\text{agent quality} = f(\text{model}, \text{context}, \text{tools}, \text{loop}, \text{evals}, \text{permissions})\]Prompt engineering teaches the model what to do.
Harness engineering gives it a safe world to act in.
Loop engineering decides how it keeps going.
Context engineering ties all three together.
The best practical advice remains:
Start simple. Make context excellent. Add complexity only when measured failures justify it.