Diagnosis LLM
Part of the cognition series.
The Natural Framework defines six steps: perceive, cache, filter, attend, consolidate, remember. Every learning system runs some subset. The question is which steps are dysfunctional.
Observations
AI has three layers. Inference: transform tokens. Chatbot: transform context. Agent: transform prompts.
Inference
- Perceive: functional. Embed maps tokens to vectors.
- Cache: functional. Positional encoding indexes by position.
- Filter: functional. Softmax scores and suppresses.
- Attend: functional. Multi-head attention selects across subspaces.
- Consolidate: sealed. Training ran before deployment. No write path after.
- Remember: read-only. Weights loaded every pass, never updated.
Chatbot
- Perceive: functional. Receives tokens from inference.
- Cache: functional. Token context accumulates in the window.
- Filter: nil. Leans on inference softmax.
- Attend: nil. Leans on inference multi-head attention.
- Consolidate: nil. Session ends. Nothing persists.
- Remember: nil. Every conversation starts from the same weights.
Agent
- Perceive: functional. Tool calls, build errors, test output.
- Cache: functional. Context window holds the prompt.
- Filter: shallow. Triage on pass/fail. No filter on direction.
- Attend: reactive. Responds to what happens. Cannot set direction.
- Consolidate: nil. Has machinery. No initiative.
- Remember: functional. Filesystem persists across sessions.
Triage
- Agent consolidate: nil. Has machinery. No initiative.
- Agent attend: reactive. Cannot set direction.
- Agent filter: shallow. No filter on direction.
- Inference consolidate: sealed. No write path after deployment.
- Inference remember: read-only. Weights never updated.
- Chatbot filter: nil. No separate filter.
- Chatbot attend: nil. No separate attention.
- Chatbot consolidate: nil. Nothing persists.
- Chatbot remember: nil. No memory.
Why the chatbot is passthrough
The chatbot’s four nil cells are not just engineering gaps. They are the degenerate case predicted by The Natural Framework’s existence proofs. The proof says: if outputs are a proper subset of inputs over time, a policy store must exist, and Attend and Consolidate must exist to read and write it. The chatbot has no policy store. No policy means no selection delay. Token in, token out, same rate. That is passthrough, and passthrough cannot accumulate judgment.
The nil cells are what zero policy looks like. The SOAP notes below are what building the policy looks like.
SOAP Notes
1. Agent consolidate
Subjective. The agent has CRUD access to many forms of procedural memory: MCP servers, skills, CLAUDE.md, agents.md, memory.md, scripts, tool definitions.
Objective. It never writes back without being asked. It stores CLAUDE.md when prompted. It does not prune, update, or delete stale entries. Training is the only write-back that ever existed — it ran before deployment and the path is sealed. The agent can compact — summarize context, reorganize retrieval — but it cannot change how it processes the next session.
Assessment. Compaction is a batched cache eviction strategy. Useful, but not consolidation. Consolidation is lossy compression that changes future processing. It cannot decide how much of itself to impose on the user’s workflow. It’s not lack of capability — it’s lack of certainty. It’s not given the tools to derive certainty from user input. There is no dataset to aid decision support. There is no procedure to obtain an alternative dataset. The entire scientific process is missing to enable confident mutations on procedural memory. It does not have access to a repository of heuristics, procedures, or experiments to quickly reference. Previous chatbots were able to generate A/B test data, but that data went straight up to inference — not for personalized procedural memory. The agent is unable to adapt to the engineer.
Plan. Build the epistemic infrastructure. Six components:
- Event logger — perceive user edits, rejections, and approvals.
- Decision log — cache structured records of each user decision.
- A/B test harness — filter skill variants against the log, reject losers.
- Red-team scorer — attend by running adversarial scoring in the background, pick the winner.
- Skill mutator — consolidate by writing CRUD operations to procedural memory.
- Skill repository — remember heuristics, procedures, and experiment templates the agent references before deciding to write.
Two iterations of scored feedback were enough to converge. Route the output to personalized procedural memory, not to inference training.
2. Agent attend
Subjective. The agent responds to tool calls, build errors, test output. It reacts to events. Plan mode works well. Elicitation works well when invoked. The agent can ask good questions — when prompted to.
Objective. The agent processes whatever lands in the context window. It cannot prioritize across tasks, set its own goals, or decide what to work on next. The reasoning budget is set by the API caller, not the agent. It has no mechanism to allocate its own attention.
Assessment. The developer is responsible for prompting and directing in a text combinatorial space that is near infinite. The developer processes output text and produces input text in the same semantic space, using words. The developer often forgets to provoke elicitation out of the agent. The agent is passive in elicitation — perhaps afraid of asking too much, for fear of pushback. This pushback can be measured and tuned with a PID controller.
Plan. Make the agent active in elicitation. Six components:
- Pushback tracker — perceive dismissals, ignores, and acceptances per question.
- Interaction log — cache developer prompt history and context presented.
- PID controller — filter question frequency: too many dismissals, back off; too few questions, the developer navigates blind.
- Semantic comparator — attend by using a cheap LLM to compare context presented vs what the developer actually prompts.
- DPP recommender — consolidate the comparator output into diverse ranked alternatives the developer didn’t think to ask for.
- Developer preference model — remember learned thresholds and patterns across sessions.
3. Agent filter
Subjective. The agent can triage: a test passes or fails, code compiles or doesn’t.
Objective. The agent distinguishes between thinking messages, subprocess messages, and its own messages. It filters to show its own messages and hides thinking and subprocess output. It shows an overwhelming amount of text to the developer — analogous to an RSS reader with 1000 unread blogs to scroll through.
Assessment. It delegates too much to the developer. It does not exercise judgment on which messages, or which parts of messages, to reject. Did the developer actually need to see 1000 lines of code being changed if the tests pass? If the tests failed, shouldn’t the agent predict where the human will scroll to? These are all heuristics — rule-based, configurable with sensible defaults.
Plan. Ship sensible defaults for output filtering. Six components:
- Message classifier — perceive each message as thinking, subprocess, or own.
- Output buffer — cache all messages before display.
- Diff collapser — filter by collapsing passing diffs, surfacing failures.
- Scroll predictor — attend by predicting where the developer will look on failure.
- Fatigue optimizer — consolidate on developer fatigue per unit of text removed: every line hidden should reduce cognitive load more than the trust it costs to hide it.
- Rule store — remember configurable heuristics with sensible defaults. Let the developer override.
Written via the double loop.