Moments

Part of the cognition series.

Perception Pipe described what a salience layer should look like: competitive inhibition, not a priority queue. But it left the input undefined. What are the candidates that enter the competition? Caret Recorder captures everything on screen as structured accessibility trees. The pipeline dedup stage collapses 71% of events. But what survives is still too much to remember. Eleven traversals from a 30-second session is manageable. Scale that to a 30-minute session and you’re looking at hundreds of traversals. What’s worth keeping?

The expensive answer

Score every traversal for novelty. The freshness filter works: ask a frontier model what percentage of the text it could reconstruct from training data alone, and originals separate from slop by 10 points. The bare prompt beats the full 4-section prompt (5.3-point gap) and the elaborate prompt (2.3 points). Simpler is better.

But running that prompt on every traversal means a frontier model working full-time on input triage. For a 30-minute session, that’s hundreds of LLM calls just to decide what’s worth remembering. The freshness filter was designed to score a finished text. It’s a quality gate, not a per-event classifier.

Thirty minutes at your desk

You open a browser tab to an academic article. Read for a few minutes, no edits. Switch to another tab, more reading. Then you open your code editor, type some code. Swap to the terminal, run a deploy command, check the output. Then you open YouTube and scroll through shorts for ten minutes.

What’s worth remembering? The code edits and the terminal session. Everything else is either context (the article you were reading before you wrote code) or noise (the shorts). An automated system needs to tell these apart without understanding what you’re working on.

What the differ sees

The dedup stage’s SnapshotDiffer maintains a per-app snapshot of the accessibility tree. When a new traversal arrives, it diffs against the previous snapshot. If nothing changed, the event is dropped. If elements changed, the event is yielded with only the insertions. First visits yield the full tree (context). Return visits yield only what changed (disproportionately user-generated, because the user is the primary source of change in their own environment).

But YouTube breaks the heuristic. Each short loads new content into the same app, so the differ yields insertions. Return-visit diffs are signal, so YouTube shorts would count. That’s wrong.

From trees to markdown

A raw diff is still an accessibility tree: nested elements, container nodes, UI chrome, layout artifacts. Three stages turn it into a composable markdown unit.

Buffer. Typing in an editor produces a diff per keystroke, each a superset of the last. Without buffering, you’d get a moment for func and another for function handleClick() {. The buffer accumulates superset diffs and flushes on app switch, deletion, or timeout. This reduces redundant processing and makes moments less likely to be fragments, though the heuristic assumes typing patterns that won’t always hold.

Stripper. Remove non-language tokens: div wrappers, navigation elements, button labels, image placeholders. YouTube shorts lose their player chrome and leave almost nothing. Code editors lose their UI containers and leave the code. Near-empty events are dropped before any model sees them.

Condensation. What survives still has source-specific noise: terminal prompt prefixes, unchanged context lines, timestamps interleaved with messages. A cheap model with a source-specific prompt extracts the signal. The output is a markdown unit: a moment.

raw events → dedup → buffer → stripper → condensation → moments
                       ↑             ↑                    ↑
                 flush on         no LLM            cheap LLM,
                 non-superset                       per source type

Moments

ActionAfter stripperAfter condensation
Read an academic articleArticle textKey claims
Open a new docs tabDocs textRelevant sections
Type code in editorCode diffLines written
Run a terminal commandCommands + outputCommands + key output
Scroll YouTube shortsNearly emptyDropped
Browse memesNearly emptyDropped

Rows with condensation output are moments. The rest either never changed (first visits) or stripped to nothing. Events that the stripper empties never reach the LLM.

In the benchmark data: 42 raw traversals collapsed to 11 after dedup. The stripper and condensation reduce those further to the moments worth keeping.

What comes next

Generating moments is where the current pipeline ends. Everything downstream is unbuilt:

A moment is a unit of perception. Memory is what a moment could become.


Written with Claude Opus 4.6 via Claude Code. I directed the argument; Claude drafted prose.