Agentic AI: Architectures, Taxonomies, and Evaluation
Arunkumar V, Gangadharan G.R., Rajkumar Buyya · 2026 ·
arXiv:2601.12560
LLMs as cognitive controllers: perception, memory, planning, action, tool use, and collaboration. The full agent stack built on top of frozen weights.
What this covers
Wrap an LLM in a loop with tools, memory, and feedback, and it stops being a text generator. It becomes a cognitive controller: perceive observations, update memory, plan next steps, select actions. The weights stay frozen. All adaptation happens through the context window, external memory, or verbal self-critique.
The control loop
Four named functions run each cycle:
- Φ (Perceive): ground multimodal input. Text, screenshots, DOM, coordinates, audio, video, point clouds.
- μ (Memory update): write observations to persistent state. Retrieval, structured storage, summarization, pruning.
- Ψ (Plan): reason about what to do next. Chain, tree, or hierarchical decomposition.
- π (Act): select and execute an action. API call, code execution, tool invocation, motor command.
The cycle repeats. Each action produces an observation that feeds the next perception. Reflection can interrupt the cycle to revise the plan.
Core components
| Component | What it does | Parts bin cell |
|---|---|---|
| Perception | Multimodal grounding: text, vision, DOM, coordinates. Evolving from text-only to screenshots to video to 3D. | Perceive |
| Memory | Persistent state across episodes. Retrieval, structured storage, summarization, decay, pruning. | Cache + Remember |
| Planning | Reasoning topologies: linear chains (ReAct), branching trees (ToT), hierarchical decomposition, inference-time budgets. | Attend |
| Action + Tools | Execution: API calls → code-as-action → agent-computer interfaces → computer-use → embodied VLA. | Remember (output) |
| Reflection | Self-critique without weight updates. Store natural-language lessons, condition future attempts. | Consolidate (verbal) |
| Collaboration | Multi-agent coordination: chain, star, mesh topologies. Role-playing, debate, verification. | Attend (distributed) |
Planning topologies
| Topology | Mechanism | Example |
|---|---|---|
| Linear chain | Interleave reasoning and action. One step at a time. | ReAct |
| Branching tree | Treat thoughts as search nodes. Explore alternatives, backtrack. | Tree of Thoughts |
| Hierarchical | Decompose goal into subgoals. Each subgoal gets its own plan. | HuggingGPT, TaskWeaver |
| Internal search | Inference-time compute budgets. Search happens inside the model. | o1, o3 |
Action space evolution
| Paradigm | What the agent can do | Constraint level |
|---|---|---|
| API-based | Call predefined functions with typed arguments | Most constrained |
| Code-as-action | Generate and execute arbitrary code | Less constrained |
| Agent-computer interface | Curated shell: file system, terminal, browser | Moderate |
| Computer-use | Mouse, keyboard, screenshots: the raw desktop | Minimal |
| Embodied VLA | Continuous motor primitives from vision-language-action models | Unconstrained |
Memory architectures
| System | Mechanism | Parts bin cell |
|---|---|---|
| Generative Agents | Natural language stream with reflection and summarization | Remember × sequence |
| MemoryBank | Hierarchical clusters with exponential decay | Cache × tree |
| ChatDB | Symbolic SQL tables for structured state | Cache × graph |
| MemGPT | Paged long-term memory with explicit controller-driven retrieval | Attend × sequence |
| MemInsight | Convert episodic traces into semantic insights via compression | Consolidate × sequence |
| MemAgent | Learn what to discard: policy-driven pruning | Filter × sequence |
Multi-agent topologies
| Topology | Pattern | Example |
|---|---|---|
| Chain | Sequential waterfall: each agent passes deliverables to the next | MetaGPT, ChatDev |
| Star | Hub-and-spoke: coordinator delegates to specialized workers | AutoGen, Swarm |
| Mesh | Decentralized: agents communicate dynamically, debate, simulate | CAMEL, Generative Agents |
Reflection and feedback
The frozen-weights constraint means agents cannot learn by updating parameters. Instead, they learn verbally:
| Framework | What it does |
|---|---|
| ReAct | Interleave reasoning traces with actions. Each observation feeds the next thought. Linear, no backtracking. |
| Reflexion | Store natural-language critiques of failures. Condition future attempts on these lessons. Verbal reinforcement. |
| Tree of Thoughts | Explore alternative reasoning paths. Evaluate and backtrack. Global search over thought space. |
| MAKER | Hierarchical verification: verifier agents challenge worker outputs. Near-zero error on million-step chains. |
Evaluation: the CLASS framework
The paper argues that single success-rate metrics mask critical reliability issues. Their proposed replacement:
| Dimension | What to measure |
|---|---|
| Cost | Token spend, API calls, compute budget per task |
| Latency | Time to first action, end-to-end task completion |
| Accuracy | Task success, failure severity distribution (benign vs. catastrophic) |
| Security | Prompt injection resistance, trust boundaries, audit logging |
| Stability | Run-to-run variance, infinite loop detection, error propagation |
The memory problem
Every memory system in this survey reinvents what Soar built architecturally.
MemGPT's paged memory is Soar's working memory + semantic memory, split into tiers. The controller pages facts in and out of the context window the way Soar's retrieval pulls from SMEM into WM. Soar's retrieval uses activation (recency + frequency + spreading from context). MemGPT uses the LLM itself to decide what to page in. One is a mechanism. The other is a hope.
The consolidation pipeline that Soar lacks and the
prescription proposed? MemInsight already builds it. Read episodes, detect regularities, write compressed knowledge. MemInsight uses an LLM summarizer. The prescription uses temporal graph coarsening. Same problem: episodes accumulate, you need to extract what generalizes.
Then there's forgetting. MemAgent learns a pruning policy over memory entries, which is what Minsky called censors. Minsky's version suppresses the thought preceding a bad action. Soar's truth maintenance auto-retracts structures whose justification no longer holds. Three approaches to the same problem at increasing levels of architectural integration.
Reasoning is search
ReAct is Soar's decision cycle, linearized. Observe, think, act, observe. Soar runs the same loop with parallel rule firing, staged preferences, and the impasse mechanism for recursive subgoaling. ReAct has none of that. The minimal viable agent loop: one thought, one action, no backtracking.
What about backtracking? Tree of Thoughts is MCTS for reasoning. Treat intermediate thoughts as search nodes. Evaluate. Backtrack. Explore alternatives. Soar's impasse mechanism does this natively: when the decision procedure can't select an operator, a substate opens and the same cycle runs recursively. ToT reimplements this in prompt space because the transformer has no native backtracking.
And learning from failure? Reflexion is verbal chunking. Soar's chunking backtraces a dependency chain and writes a production rule. Reflexion writes a natural-language critique and appends it to the prompt. Both compile deliberation into reusable knowledge. One produces executable rules. The other produces text that the LLM might or might not attend to. The gap between "compiles to code" and "appends to prompt" is the gap between architectural and verbal Consolidate.
Hierarchical verification (MAKER) takes A-brain / B-brain to scale. Worker agents act. Verifier agents challenge. Near-zero error on million-step chains. Minsky's reflection hierarchy, distributed across multiple agents. Also how RLHF works: one model generates, another judges. Minsky named the pattern forty years before anyone built it.