← back to index

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang et al. · 2023→2026 · arxiv arXiv:2303.18223

A 144-page living survey of the LLM stack: pre-training, alignment, utilization, evaluation. Updated through 19 revisions. The closest thing to a textbook for the field as it moves.

Pre-training next-token on internet-scale data Post-training SFT, RLHF, DPO align to intent Utilization prompting, ICL CoT, agents Evaluation benchmarks capabilities Consolidate Consolidate Forward pass Measurement no learning at inference the LLM lifecycle: two Consolidate phases, then frozen forward pass

Why this survey

The transformer paper gives you the architecture. This survey gives you everything built on top: how models are trained, aligned, prompted, evaluated, compressed, and deployed. Four pillars, each a stage in the lifecycle of an LLM.

The four pillars

Pillar What it covers Section
Pre-trainingData collection, architecture choices (encoder-only vs. decoder-only vs. encoder-decoder), training objectives, scaling laws, distributed training§4
Post-trainingInstruction tuning (SFT), alignment tuning (RLHF and alternatives), parameter-efficient adaptation (LoRA, adapters)§5
UtilizationPrompt engineering, in-context learning, chain-of-thought reasoning, planning and agentic frameworks§6
EvaluationWhat LLMs can do (generation, knowledge, reasoning, tool use, alignment), how to measure it, empirical findings§7
Section map

The survey is large. These are the sections worth reading in full, versus those to skim for references.

§ Topic Read or skim
2Overview & GPT evolutionRead:traces the key transitions from GPT-1 through GPT-4
3ResourcesSkim:model checkpoints, corpora, datasets, libraries
4Pre-trainingRead:architecture choices, scaling laws, training infrastructure
5Post-trainingRead:instruction tuning, RLHF, DPO, LoRA
6UtilizationRead:prompting, ICL, CoT, planning
7Capacity & evaluationSkim:benchmarks, empirical results
8ApplicationsSkim:domain-specific applications
9Advanced topicsRead:long context, agents, compression, RAG, hallucination

Three eras

The survey traces an evolution through three stages:

  1. Statistical language models. N-gram counts, smoothing, Markov assumptions. The insight: predict the next word from a fixed window of preceding words.
  2. Neural language models. Word embeddings, RNNs, LSTMs. The insight: learn distributed representations that capture semantic similarity.
  3. Pre-trained / Large language models. Transformers trained on internet-scale corpora. The insight: scale the architecture and the data together, and capabilities emerge that were not explicitly trained.

Within the LLM era, the survey distinguishes three phases that map cleanly to the cognitive architecture lens:

Phase LLM term Architecture role
Pre-trainingNext-token prediction on massive corporaConsolidate: offline weight updates
Post-trainingSFT, RLHF, DPO: align behavior to intentConsolidate: policy refinement
InferencePrompting, ICL, CoT, tool use, agentsForward pass only: Perceive through Remember, no Consolidate

The architectural gap is visible: during inference, the model runs a forward pass with frozen weights. There is no learning. In-context learning looks like learning but is attention over the prompt. Nothing persists beyond the context window.

The Consolidate stack

Nearly everything interesting in this survey is a Consolidate operation. The LLM lifecycle is a stack of increasingly refined backward passes.

Scaling laws are Consolidate's planning function: predict how much loss a given budget will buy before spending it. No other architecture has this. Soar has no way to predict how many chunks a given amount of experience will produce.

RLHF is Minsky's A-brain / B-brain made differentiable. The policy model acts. The reward model judges. PPO optimizes the policy against the judgment. DPO collapses the two into one step: skip the reward model, optimize directly on preference pairs. Soar's RL does something similar with numeric preferences, but the reward comes from the task, not from a human rater.

LoRA is the cheapest backward pass. Freeze the base model. Train two small matrices whose product approximates the weight update. Merge at inference. Consolidate with a compression constraint: the update must fit in rank r. Soar's chunking has a different constraint. The update must be a single production rule derived from a dependency trace.

Here's what catches people: in-context learning and chain-of-thought look like Consolidate but aren't. They're Attend. Attention over the prompt. Nothing persists. Nothing compiles. Agent frameworks tried to fix this with Reflexion (store verbal critiques) and MemInsight (compress episodes into insights). Consolidate workarounds in text space, because the weights are frozen.

Perceive, not Remember

RAG retrieves documents and prepends them to the prompt, injecting external information at the input. That's Perceive. Soar's semantic memory does the same: a queryable store that feeds the forward pass. Soar retrieves by activation (recency + frequency + spreading); RAG retrieves by embedding similarity. Both are read operations on external memory.

The frozen forward pass

The forward pass is read-only. Soar's chunking compiles deliberation into rules on every impasse. Reflexion simulates this in text. Neither updates the model.

The context window is volatile. Soar stores every decision cycle. Generative Agents bolt on a natural-language episode stream. LLMs forget everything when the context clears.

Pre-training and RLHF are scheduled by engineers, not triggered by the agent's own experience. Soar's chunking fires automatically on impasse resolution. The jkdiagnosis is the same: sealed Consolidate, frozen Remember.

Neighbors