The problem Link to heading

A long-running agent accumulates messages over time: user prompts, model responses, reasoning traces, tool calls, tool outputs. The model’s context window is finite — from a hundred thousand tokens on the smaller end to a million or more on the largest current frontier models. Once the conversation no longer fits, continuing without intervention triggers a hard error or silent truncation by the provider.

A compaction algorithm shrinks the conversation in place. The interesting strategies share a single shape: pick a moment on the conversation’s timeline (the split point), collapse everything older into a single summary message, place the summary at the split point, and keep everything newer as-is.

The conversation as a timeline Link to heading

A conversation is a sequence ordered in time. Older messages on the left, newer ones on the right. The model reads the array left-to-right on every turn — the leftmost item is the oldest thing it remembers, the rightmost is the most recent thing it has to respond to. New user turns extend the right edge.

When the conversation outgrows the window, a strategy answers one question: which moment on the timeline is the split point? The three strategies below differ entirely in how they pick it.

Strategy 1: Preserve user messages Link to heading

Split-point rule: walking newest-first, keep every user message and drop everything else until a max-token cap (~20K) is reached. Kept user messages are re-ordered chronologically; the summary sits at the split point, just after the latest preserved user message.

Strategy 1: Preserve user messages

Strategy 2: Preserve a recent token fraction Link to heading

Split-point rule: walk backward from the present, accumulating tokens, until you reach p × total_tokens (observed: p = 0.3); snap to the next user-message boundary so the preserved tail starts on a user turn. Everything earlier than that snapped point collapses; everything in the recent fraction stays as-is regardless of role.

Strategy 2: Preserve a recent token fraction

Strategy 3: Preserve recent turns Link to heading

Split-point rule: a “turn” is one user message plus everything generated in response (assistant text, tool calls, tool outputs) up to the next user message. Keep the last N complete turns (default N=2); collapse everything older into the summary. A per-turn token budget caps each preserved turn (observed defaults: 25% of usable context, clamped to [2K, 8K]).

The summary sits at the split point — chronologically immediately after the last preserved turn.

Strategy 3: Preserve recent turns

Comparison Link to heading

StrategySplit-point ruleWhat survives the cut
Preserve user messagesevery user message is on the “keep” side; agent items are on the “drop” sideevery user message (within a ~20K token budget)
Preserve recent token fractionwalk back until p × total_tokens collected, snap to user-msg boundaryeverything in the recent fraction, regardless of role
Preserve recent turnsthe user-msg boundary that starts the (M − N + 1)ᵗʰ turnthe last N complete turns (within per-turn budget)

When to choose which Link to heading

The choice depends on two properties of the workload:

  • Role asymmetry. Are user prompts the durable record (with agent output reproducible from them), or do both sides feed later turns equally?
  • Recency. Do the last few exchanges drive the next decision, or does the cumulative trail of decisions matter more?

A common pre-trim step Link to heading

All three strategies do something before the LLM call to keep the input itself small. The reason is the same — the model that’s going to write the summary is also bound by a context window, and feeding it 100% of an overflowing conversation just moves the overflow into the summarization call. Concrete pre-trim moves seen in practice:

  • Iteratively dropping the oldest items and retrying when the summarization call itself overflows.
  • Walking newest-first and dropping agent-generated items until the input fits, leaving user messages intact.
  • Truncating individual tool outputs to their last few dozen lines, with the original spilled to a temp file.
  • Hard-capping the total token budget for tool-response content at some fraction of the model’s window (e.g., 50K tokens).
  • Marking older tool outputs as “compacted” with a timestamp so they render as [Old tool result content cleared] to the LLM but stay queryable in the host’s storage.

The output of pre-trim is what actually goes to the summarization call. The strategy’s preservation rule applies to the original timeline (which messages survive as-is into the post-compaction state), not to what the summarizer sees.

Triggers Link to heading

Triggers fall into three kinds:

  • Budget trigger. Fires when a measurement crosses a threshold (e.g. total_tokens >= limit). Common forms: a fixed fraction of the window (50% is widely seen); a fixed-size headroom buffer (20K of slack for windows over 200K, 20% for smaller ones); a per-model auto-compact limit. Reactive overflow — running compaction after a context-window-exceeded error — is a budget trigger that fired late.
  • Structural trigger. Fires on a change to the shape or contents of the conversation: “a new message arrived,” “a tool produced output,” “compaction just completed,” “the user invoked compaction.” No measurement — the fact that it happened is the trigger. Used for eager work that keeps pressure from building (e.g. distilling a fresh tool output so it’s already small by the time a budget trigger fires) and for explicit user-invoked compaction.
  • Temporal trigger. Fires on a clock — “N milliseconds have elapsed since the last fire.” No measurement, no event, just time. Used for periodic maintenance with no natural event hook (cache GC, idle cleanup, polling-style work).
Trigger kindQuestionExample
Budget“Have we exceeded a limit?”total >= 80_000
Structural“Did this thing happen?”on(new_message)
Temporal“Has time elapsed?”every(5 minutes)

Triggers compose. An inline auto-compact in practice is budget AND structural: total >= auto_compact_limit && (model_needs_follow_up || has_pending_input). The structural predicate guards against compacting at the wrong moment — e.g. just as a turn was about to end naturally.

Most non-trivial systems mix all three: a budget trigger to react to pressure, structural triggers to do work eagerly so pressure builds slower, and temporal triggers for housekeeping that nothing else hooks.

Measurement Link to heading

Triggers depend on token counts, and token counts depend on estimates.

A crude back-of-envelope formula: tokens ≈ characters / 4.

The shared shape Link to heading

Every strategy reduces to the same two artifacts the host stores: a generated summary message, and a tail-start pointer marking where the preserved messages begin on the original timeline. On the next turn, a history loader splices them: […messages-from-pointer-onwards, summary].

StrategyTail-start pointer
Preserve user messagesfirst kept user message
Preserve recent token fractionfirst message of the recent fraction
Preserve recent turnsfirst user message of the (M − N + 1)ᵗʰ turn

Swapping strategies is a question of which pointer value gets written, not of storage shape — the host ships one replacement mechanism and applies it uniformly.

Terms Link to heading

TermDefinition
compactionShrinking the conversation in place when it no longer fits the context window: collapse older content into a summary, preserve recent content as-is.
split pointA position on the conversation timeline. Everything older than it collapses into the summary; everything newer is preserved as-is. Strategies differ only in how they pick this position.
tailThe contiguous suffix of messages preserved as-is — from the split point to the present.
summary messageA single LLM-generated message condensing everything older than the split point. The host stores it alongside the tail.
tail-start pointerThe stored marker (a message ID or array index) identifying where the tail begins on the original timeline. One value gets written per compaction.
turnOne user message plus every agent-generated item produced in response, up to the next user message.
agent-generated itemAny conversation item not authored by the user: assistant text, tool calls, tool outputs, reasoning traces.
triggerThe condition that initiates compaction. Three kinds: budget (a measurement crosses a threshold), structural (an event happened), temporal (time elapsed).
pre-trimToken-reduction applied to the input of the summarization call itself so the summarizer doesn’t overflow. Distinct from the strategy’s preservation rule, which describes the post-compaction timeline.
hostThe runtime that owns the conversation state and runs compaction (the agent process or session manager).