A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang et al. · 2023→2026 ·
arXiv:2303.18223
A 144-page living survey of the LLM stack: pre-training, alignment, utilization, evaluation. Updated through 19 revisions. The closest thing to a textbook for the field as it moves.
Why this survey
The transformer paper gives you the architecture. This survey gives you everything built on top: how models are trained, aligned, prompted, evaluated, compressed, and deployed. Four pillars, each a stage in the lifecycle of an LLM.
The four pillars
| Pillar | What it covers | Section |
|---|---|---|
| Pre-training | Data collection, architecture choices (encoder-only vs. decoder-only vs. encoder-decoder), training objectives, scaling laws, distributed training | §4 |
| Post-training | Instruction tuning (SFT), alignment tuning (RLHF and alternatives), parameter-efficient adaptation (LoRA, adapters) | §5 |
| Utilization | Prompt engineering, in-context learning, chain-of-thought reasoning, planning and agentic frameworks | §6 |
| Evaluation | What LLMs can do (generation, knowledge, reasoning, tool use, alignment), how to measure it, empirical findings | §7 |
Section map
The survey is large. These are the sections worth reading in full, versus those to skim for references.
| § | Topic | Read or skim |
|---|---|---|
| 2 | Overview & GPT evolution | Read:traces the key transitions from GPT-1 through GPT-4 |
| 3 | Resources | Skim:model checkpoints, corpora, datasets, libraries |
| 4 | Pre-training | Read:architecture choices, scaling laws, training infrastructure |
| 5 | Post-training | Read:instruction tuning, RLHF, DPO, LoRA |
| 6 | Utilization | Read:prompting, ICL, CoT, planning |
| 7 | Capacity & evaluation | Skim:benchmarks, empirical results |
| 8 | Applications | Skim:domain-specific applications |
| 9 | Advanced topics | Read:long context, agents, compression, RAG, hallucination |
Three eras
The survey traces an evolution through three stages:
- Statistical language models. N-gram counts, smoothing, Markov assumptions. The insight: predict the next word from a fixed window of preceding words.
- Neural language models. Word embeddings, RNNs, LSTMs. The insight: learn distributed representations that capture semantic similarity.
- Pre-trained / Large language models. Transformers trained on internet-scale corpora. The insight: scale the architecture and the data together, and capabilities emerge that were not explicitly trained.
Within the LLM era, the survey distinguishes three phases that map cleanly to the cognitive architecture lens:
| Phase | LLM term | Architecture role |
|---|---|---|
| Pre-training | Next-token prediction on massive corpora | Consolidate: offline weight updates |
| Post-training | SFT, RLHF, DPO: align behavior to intent | Consolidate: policy refinement |
| Inference | Prompting, ICL, CoT, tool use, agents | Forward pass only: Perceive through Remember, no Consolidate |
The architectural gap is visible: during inference, the model runs a forward pass with frozen weights. There is no learning. In-context learning looks like learning but is attention over the prompt. Nothing persists beyond the context window.
The Consolidate stack
Nearly everything interesting in this survey is a Consolidate operation. The LLM lifecycle is a stack of increasingly refined backward passes.
Scaling laws are Consolidate's planning function: predict how much loss a given budget will buy before spending it. No other architecture has this. Soar has no way to predict how many chunks a given amount of experience will produce.
RLHF is Minsky's A-brain / B-brain made differentiable. The policy model acts. The reward model judges. PPO optimizes the policy against the judgment. DPO collapses the two into one step: skip the reward model, optimize directly on preference pairs. Soar's RL does something similar with numeric preferences, but the reward comes from the task, not from a human rater.
LoRA is the cheapest backward pass. Freeze the base model. Train two small matrices whose product approximates the weight update. Merge at inference. Consolidate with a compression constraint: the update must fit in rank r. Soar's chunking has a different constraint. The update must be a single production rule derived from a dependency trace.
Here's what catches people: in-context learning and chain-of-thought look like Consolidate but aren't. They're Attend. Attention over the prompt. Nothing persists. Nothing compiles. Agent frameworks tried to fix this with Reflexion (store verbal critiques) and MemInsight (compress episodes into insights). Consolidate workarounds in text space, because the weights are frozen.
Perceive, not Remember
RAG retrieves documents and prepends them to the prompt, injecting external information at the input. That's Perceive. Soar's semantic memory does the same: a queryable store that feeds the forward pass. Soar retrieves by activation (recency + frequency + spreading); RAG retrieves by embedding similarity. Both are read operations on external memory.
The frozen forward pass
The forward pass is read-only. Soar's chunking compiles deliberation into rules on every impasse. Reflexion simulates this in text. Neither updates the model.
The context window is volatile. Soar stores every decision cycle. Generative Agents bolt on a natural-language episode stream. LLMs forget everything when the context clears.
Pre-training and RLHF are scheduled by engineers, not triggered by the agent's own experience. Soar's chunking fires automatically on impasse resolution. The
diagnosis is the same: sealed Consolidate, frozen Remember.