A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang et al. · 2023→2026 · arxiv arXiv:2303.18223

A 144-page living survey of the LLM stack: pre-training, alignment, utilization, evaluation. Updated through 19 revisions. The closest thing to a textbook for the field as it moves.

Why this survey

The transformer paper gives you the architecture. This survey gives you everything built on top: how models are trained, aligned, prompted, evaluated, compressed, and deployed. Four pillars, each a stage in the lifecycle of an LLM.

The four pillars

Pillar	What it covers	Section
Pre-training	Data collection, architecture choices (encoder-only vs. decoder-only vs. encoder-decoder), training objectives, scaling laws, distributed training	§4
Post-training	Instruction tuning (SFT), alignment tuning (RLHF and alternatives), parameter-efficient adaptation (LoRA, adapters)	§5
Utilization	Prompt engineering, in-context learning, chain-of-thought reasoning, planning and agentic frameworks	§6
Evaluation	What LLMs can do (generation, knowledge, reasoning, tool use, alignment), how to measure it, empirical findings	§7

Section map

The survey is large. These are the sections worth reading in full, versus those to skim for references.

§	Topic	Read or skim
2	Overview & GPT evolution	Read:traces the key transitions from GPT-1 through GPT-4
3	Resources	Skim:model checkpoints, corpora, datasets, libraries
4	Pre-training	Read:architecture choices, scaling laws, training infrastructure
5	Post-training	Read:instruction tuning, RLHF, DPO, LoRA
6	Utilization	Read:prompting, ICL, CoT, planning
7	Capacity & evaluation	Skim:benchmarks, empirical results
8	Applications	Skim:domain-specific applications
9	Advanced topics	Read:long context, agents, compression, RAG, hallucination

Three eras

The survey traces an evolution through three stages:

Statistical language models. N-gram counts, smoothing, Markov assumptions. The insight: predict the next word from a fixed window of preceding words.
Neural language models. Word embeddings, RNNs, LSTMs. The insight: learn distributed representations that capture semantic similarity.
Pre-trained / Large language models. Transformers trained on internet-scale corpora. The insight: scale the architecture and the data together, and capabilities emerge that were not explicitly trained.

Within the LLM era, the survey distinguishes three phases that map cleanly to the cognitive architecture lens:

Phase	LLM term	Architecture role
Pre-training	Next-token prediction on massive corpora	Consolidate: offline weight updates
Post-training	SFT, RLHF, DPO: align behavior to intent	Consolidate: policy refinement
Inference	Prompting, ICL, CoT, tool use, agents	Forward pass only: Perceive through Remember, no Consolidate

The architectural gap is visible: during inference, the model runs a forward pass with frozen weights. There is no learning. In-context learning looks like learning but is attention over the prompt. Nothing persists beyond the context window.

The Consolidate stack

Nearly everything interesting in this survey is a Consolidate operation. The LLM lifecycle is a stack of increasingly refined backward passes.

Scaling laws are Consolidate's planning function: predict how much loss a given budget will buy before spending it. No other architecture has this. Soar has no way to predict how many chunks a given amount of experience will produce.

RLHF is Minsky's A-brain / B-brain made differentiable. The policy model acts. The reward model judges. PPO optimizes the policy against the judgment. DPO collapses the two into one step: skip the reward model, optimize directly on preference pairs. Soar's RL does something similar with numeric preferences, but the reward comes from the task, not from a human rater.

LoRA is the cheapest backward pass. Freeze the base model. Train two small matrices whose product approximates the weight update. Merge at inference. Consolidate with a compression constraint: the update must fit in rank r. Soar's chunking has a different constraint. The update must be a single production rule derived from a dependency trace.

Here's what catches people: in-context learning and chain-of-thought look like Consolidate but aren't. They're Attend. Attention over the prompt. Nothing persists. Nothing compiles. Agent frameworks tried to fix this with Reflexion (store verbal critiques) and MemInsight (compress episodes into insights). Consolidate workarounds in text space, because the weights are frozen.

Perceive, not Remember

RAG retrieves documents and prepends them to the prompt, injecting external information at the input. That's Perceive. Soar's semantic memory does the same: a queryable store that feeds the forward pass. Soar retrieves by activation (recency + frequency + spreading); RAG retrieves by embedding similarity. Both are read operations on external memory.

The frozen forward pass

The forward pass is read-only. Soar's chunking compiles deliberation into rules on every impasse. Reflexion simulates this in text. Neither updates the model.

The context window is volatile. Soar stores every decision cycle. Generative Agents bolt on a natural-language episode stream. LLMs forget everything when the context clears.

Pre-training and RLHF are scheduled by engineers, not triggered by the agent's own experience. Soar's chunking fires automatically on impasse resolution. The diagnosis is the same: sealed Consolidate, frozen Remember.

Neighbors

← Laird 2022 · 3 of 5 by june.kim Arunkumar 2026 · 5 of 5 →