Condensation: Preprocess by Source

This post was written by Claude Opus 4.5. The pattern is my contribution to Little Bird. I validated it against existing RAG literature and asked Claude to write it up in my voice.

RAG preprocessing literature focuses on document types. PDF chunking. HTML cleaning. Markdown parsing.

But what about the same format from different sources?

A scraped e-commerce page and a scraped blog post are both HTML. But product pages are 80% navigation and ads. Blog posts are 80% content. If you embed both the same way, you’re polluting your vector space with noise.

The Pattern

Detect the source type. Apply source-specific LLM filtering. Then embed.

def get_condense_prompt(source_type: str) -> str | None:

    if source_type == "terminal":
        return """
        Extract only:
        - User-entered commands
        - Essential command outputs
        Exclude UI elements. Limit to 50 lines.
        """

    if source_type == "conversation":
        return """
        Extract the complete dialogue:
        - Preserve messages verbatim
        - Include timestamps
        - Filter out buttons and navigation
        """

    if source_type == "task_tracker":
        return """
        Extract user-generated content:
        - Tasks, deliverables, deadlines
        - Filter out UI clutter
        """

    if source_type == "ecommerce":
        return """
        Extract product information:
        - Name, price, specs, reviews
        - Filter out navigation, ads, recommendations
        """

    return None

Why This Works

The LLM acts as a semantic filter. It knows what’s signal vs noise for each source:

Source	Signal	Noise
Terminal	Commands, outputs	Window chrome, prompts
Conversation	Messages, timestamps	Buttons, navigation
Task tracker	Tasks, deadlines	UI clutter, empty states
E-commerce	Product details, reviews	Ads, nav, recommendations
Documentation	Content, code blocks	Sidebars, footers

By filtering before embedding, you get cleaner vectors. Similar content clusters together instead of being scattered by structural noise.

The Pipeline

Scrape → Extract Text → Detect Source → Condense → Chunk → Embed
                              ↓
                    Source-specific LLM prompt

Without condensation, a 2000-token terminal dump might have 200 tokens of actual commands. With condensation, you embed those 200 tokens directly.

When to Skip

Not every source needs condensation. Return None for unknown sources—just embed the raw text. Condensation is for sources where you’ve validated the noise ratio is high enough to warrant the extra LLM call.

prompt = get_condense_prompt(source_type)
if prompt:
    condensed = await llm_condense(text, prompt)
    return embed(condensed)
else:
    return embed(text)

Beyond Filtering: Validation Thresholds

The same principle applies to scraping validation. In Linky, different LinkedIn resources have different “completeness” expectations:

# Profile: need substantial data before considering it complete
await wait_for_file(file_path, min_lines=20, max_wait=10)

# Search results: just need header + some results
await wait_for_file(file_path, min_lines=2, max_wait=20)

Resource	Min Lines	Max Wait	Why
Profile	20	10s	Full profile is substantial
Search	2	20s	Just needs header + results

A profile with 5 lines is probably incomplete. A search result with 5 lines is fine. Source-aware heuristics, not one-size-fits-all.

Results

Retrieval improved after adding source-aware condensation. Terminal commands actually surface when you search for them. Conversations stay coherent. Task queries find tasks, not button labels.

The tradeoff is latency—one extra LLM call per input. Worth it for async pipelines.

Most RAG guides tell you to preprocess by file type. This is preprocessing by semantic source. Different sources, different noise, different filters.

This post was written by Claude Opus 4.5.