# Bartłomiej Krupa — Agentic Engineering · LLM Optimization · AI Cost Reduction > Bartłomiej Krupa engineers the harness around LLMs — agentic systems built for reliability and low cost. Deep dives, atomic notes, and a curated radar on LLM optimization, AI cost reduction, harness engineering, and agentic infrastructure. Site: https://bartlomiejkrupa.dev Author: Bartłomiej Krupa ## Deep Dives ### Pick the right Claude tier for agent workflows URL: https://bartlomiejkrupa.dev/articles/claude-model-tier-comparison Published: 2026-06-30 Tags: claude, opus, sonnet, haiku, pricing, agents, model-selection > Drafted by AI from sourced research, reviewed by a human before publish. Model IDs, prices, and limits are cited to Anthropic's model documentation; benchmark figures are attributed to Anthropic's launch announcements with their model versions named. Anthropic's current stack is three tiers, and the choice is an economics problem, not a benchmark one. **Opus 4.8** (`claude-opus-4-8`, $5 / $25 per million input/output tokens) is the orchestrator for long-horizon, high-autonomy work. **Sonnet 4.6** (`claude-sonnet-4-6`, $3 / $15) is the default daily driver. **Haiku 4.5** (`claude-haiku-4-5`, $1 / $5) is the worker you fan out across parallel sub-agents at roughly a third of Sonnet's token cost. Match the tier to the job and you stop paying Opus rates for work Haiku finishes faster. ## Definitions **Opus 4.8** — Anthropic's most capable Opus-tier model: highest autonomy, best long-horizon agentic execution, 1M-token context, and up to 128K output tokens. Adaptive thinking only; effort defaults to `high`. **Sonnet 4.6** — the best balance of speed and intelligence. Same 1M-token context, 64K max output, adaptive thinking. The default model on Claude's Free and Pro tiers. **Haiku 4.5** — the fast, cheap tier for high-volume and latency-sensitive work. 200K-token context, 64K max output. Built to run as a sub-agent under a more capable orchestrator. ## The tiers at a glance | | Opus 4.8 | Sonnet 4.6 | Haiku 4.5 | | --- | --- | --- | --- | | API ID | `claude-opus-4-8` | `claude-sonnet-4-6` | `claude-haiku-4-5` | | Input / output ($/M tokens) | $5 / $25 | $3 / $15 | $1 / $5 | | Context window | 1M | 1M | 200K | | Max output | 128K | 64K | 64K | | Thinking | Adaptive (effort `low`–`max`, default `high`) | Adaptive (effort `low`–`max`) | No effort parameter | | Fast mode | Yes (≈2.5× output speed, premium pricing) | — | — | | Best fit | Long-horizon agents, codebase-scale migrations, orchestration | Daily coding, document Q&A, design-heavy frontend | Parallel sub-agents, real-time chat, high-volume ops | Prices, context windows, and output caps are from Anthropic's model documentation. The effort parameter is Opus- and Sonnet-tier only — it returns an error on Haiku 4.5, so Haiku is configured through the older thinking controls, not `effort`. ## Pick by job, not by benchmark Benchmark deltas between tiers are real but they are not the decision. The decision is which role the model plays in your workflow and what that role costs per token. Three roles: | Role in the workflow | Tier | Why | | --- | --- | --- | | Orchestrator — plans, delegates, holds the long thread | Opus 4.8 | Highest autonomy and long-horizon coherence; Claude Code's dynamic workflows fan out hundreds of parallel sub-agents under it | | Default worker — most coding and knowledge tasks, solo | Sonnet 4.6 | Closes most of the gap to Opus on everyday work at 60% of the input cost | | Parallel worker — fast, scoped, many at once | Haiku 4.5 | ≈⅓ of Sonnet's token cost and up to 4–5× faster than Sonnet 4.5, so you can run many in parallel without the bill exploding | The expensive mistake is running every step at Opus rates. The other expensive mistake is forcing a deep, long-horizon migration through a tier that loses the thread halfway. Tier selection is the lever; the sections below are when to pull it. Encoded as a config your scripts and sub-agent launchers can read: ```bash ORCHESTRATOR=claude-opus-4-8 # long-horizon, high autonomy DEFAULT=claude-sonnet-4-6 # daily driver WORKER=claude-haiku-4-5 # parallel sub-agents ``` ## When to reach for Opus 4.8 Reach for Opus 4.8 when the task is long-horizon, hard to fully specify up front, and expensive to get wrong: a multi-day refactor, a codebase-scale migration, or an orchestrator that delegates to a fleet of workers and has to keep the plan straight across hundreds of tool calls. The capability case is concrete. On Anthropic's launch numbers, Opus 4.8 is roughly **4× less likely than Opus 4.7 to leave a code flaw unremarked**, and scores **84% on Online-Mind2Web** for browser and computer use — the kind of self-verifying, multi-step work where a cheaper tier compounds small errors into a wrong result. Two operational notes: - **Give it the whole spec up front and run at high effort.** Opus 4.8's long-horizon strength comes from reasoning more at each step. A complete first turn plus `effort: "high"` or `"xhigh"` produces more efficient *and* more accurate runs than drip-feeding context. - **Fast mode is an Opus-tier lever.** When latency matters, Opus 4.8 runs the same model at roughly 2.5× output speed at premium pricing — a knob the cheaper tiers don't have. ## When Sonnet 4.6 is the right default Sonnet 4.6 is the model you should reach for first and downgrade *from* only with a reason. It carries the same 1M-token context as Opus, supports adaptive thinking, and at $3 / $15 costs 40% less on input than Opus for the bulk of coding and document work. At its launch, Claude Code testers preferred Sonnet 4.6 to the previous Sonnet generation in about **70% of head-to-heads** — and to Opus 4.5, the prior flagship tier, in **59%**. That is a genuine same-tier upgrade, not a marketing delta. Stay on Sonnet 4.6 for interactive coding, document Q&A, and design-heavy frontend work. Escalate to Opus 4.8 only when the task is genuinely long-horizon or autonomy-critical — and remember that a bigger window is not what makes the harder tier worth it. The discipline that actually keeps agents reliable is [context engineering](/articles/context-engineering-beats-a-bigger-window), and it applies at every tier. ## When Haiku 4.5 wins Haiku 4.5 is the sub-agent economy. At $1 / $5 it is roughly a third of Sonnet's per-token cost, and on Anthropic's numbers it lands **73.3% on SWE-bench Verified** while matching Sonnet 4-class coding and computer-use quality at up to **4–5× the speed of Sonnet 4.5**. That combination — cheap, fast, capable enough — is exactly what a parallel worker needs. Use it for real-time chat, high-volume classification and extraction, and any architecture that fans many scoped tasks out at once. The one constraint to design around: Haiku's context window is 200K, not 1M, so keep each worker's job small and self-contained rather than handing it a sprawling history. ## The orchestrator–worker pattern The tiers compose. The pattern that gets the most out of all three is an Opus 4.8 orchestrator that delegates scoped sub-tasks to Haiku 4.5 workers, with Sonnet 4.6 as the solo default when no orchestration is needed. Symptom — a single big model reads a 30K-token API reference or a pile of source files, and that clutter stays in context for the rest of the run, dragging quality down. Cause — heavy reads done in the main loop persist on every subsequent turn. Solution — push the read into a sub-agent that runs in its own window and returns only the conclusion. Spawn those workers on Haiku and you get the isolation *and* the cheapest possible token rate for the grunt work. See [subagent context isolation](/garden/subagent-context-isolation). This is why the per-token spread matters more than the benchmark spread: in a fan-out architecture, the worker tier runs far more tokens than the orchestrator, so a 3× cheaper worker is a 3× cheaper bill on the dominant cost. ## Bottom line There is no single best Claude model — there is a best model per role. Default to Sonnet 4.6, escalate to Opus 4.8 for long-horizon and high-autonomy work, and delegate the parallel grunt work to Haiku 4.5. Pick by the job and the token bill, not the leaderboard, and the stack pays for itself. --- ### Context engineering beats a bigger context window URL: https://bartlomiejkrupa.dev/articles/context-engineering-beats-a-bigger-window Published: 2026-06-30 Tags: claude-code, context-window, context-engineering, tokens, subagents > Drafted by AI from sourced research, reviewed by a human before publish. Every figure below is cited to Anthropic's own documentation. Claude Code now runs up to a 1M-token context window on flagship models — but a bigger window does not fix a cluttered one. Every turn resends the entire conversation, and answer quality degrades before the hard limit is ever reached. The lever that matters is context engineering, not window size. ## Definition **Context window** — the maximum number of tokens a model can consider at once, covering the system prompt, the full conversation history, tool results, and the model's own replies. **Context engineering** — deliberately curating what enters that window each turn, instead of trusting a larger window to absorb the clutter. ## The window got bigger; the failure mode didn't Anthropic's current paid-plan limits (verified against Anthropic support, June 2026): | Surface | Models | Context limit | | --- | --- | --- | | Claude chat (paid) | Opus 4.8 / 4.7 / 4.6, Sonnet 4.6 | 500K | | Claude chat (paid) | other models | 200K (~500 pages) | | Claude Code | Opus 4.8 / 4.7 / 4.6 | 1M | | Claude Code | Sonnet 4.6 | 1M | On Pro, the 1M window for Opus and Sonnet 4.6 in Claude Code requires enabling usage credits (usage-based Enterprise plans are exempt). A 1M window sounds like it ends the problem. It doesn't, for two reasons. ## Cause: every turn replays, and the middle gets lost **Replay.** An agent turn is not just your latest prompt. The model re-reads the entire state every time: prior messages, its own previous responses, tool outputs, file reads, and any docs it fetched. A single large file read or doc fetch keeps costing tokens for the rest of the session. **Lost in the middle.** Recall is strongest at the very start and very end of the window and weakest in between. So a fact buried mid-context can be ignored while the token count is still far under the cap — quality drops before the hard limit, not at it. The result: throughput and accuracy fall as the window fills with low-signal content, regardless of how high the ceiling is. ## Solution: engineer the context Symptom — the agent forgets earlier instructions, repeats work, or references files it was never shown. Cause — a bloated, low-density window. Solution — four moves: - **Keep `CLAUDE.md` lean.** It loads at session start and costs tokens every turn. Put only universally applicable facts in it; pass task-specific detail in the prompt. See [keep CLAUDE.md to universal instructions](/garden/claude-md-universal-only). - **`/compact` proactively.** Summarize the conversation before it sprawls, and append an instruction telling it what to preserve (for example, the auth flow you're mid-change on). - **`/clear` between tasks.** When you switch to unrelated work, reset to zero tokens instead of dragging the old context along. - **Plan outside, paste the result in.** Do exploratory back-and-forth somewhere cheap, then inject only the final plan — not the dead ends that produced it. The two reset commands, in their official form: ```bash /compact focus on the auth flow I'm mid-change on /clear # reset to zero tokens between unrelated tasks ``` ## Push heavy reads into a subagent The highest-leverage move for large reads is delegation. A subagent runs in its own context window and returns only a summary to the main agent. Reading an MCP server's API reference or a pile of source files can cost tens of thousands of tokens — in one demo run, an API planner reading Stripe's docs through the Context7 MCP server consumed ~54K (illustrative of the scale, not a benchmark). Done in the main window, that clutter persists for the rest of the session; done in a subagent, the parent receives just the conclusion. Reach for it when a task needs a large doc or many files read but the main loop only needs the answer. A subagent is also where you drop to a cheaper, faster tier — Haiku 4.5 reads the pile while Opus or Sonnet keeps the plan — so the context isolation and the token savings compound. See [subagent context isolation](/garden/subagent-context-isolation) and [pick the right Claude tier](/articles/claude-model-tier-comparison). ## Bottom line The window will keep growing. The discipline that makes agents reliable — [context engineering](/garden/context-engineering) — does not change with it. Treat the window as a budget to spend deliberately, not a bucket to fill. --- ### Why agents ignore your CLAUDE.md URL: https://bartlomiejkrupa.dev/articles/why-agents-ignore-your-claude-md Published: 2026-06-30 Tags: claude-code, claude-md, context-engineering, agents-md, progressive-disclosure > Drafted by AI from sourced research, reviewed by a human before publish. Official guidance is cited to Anthropic; third-party claims are attributed to their source. If your agent skips instructions you put in `CLAUDE.md`, the file is probably too long and too specific. The fix is not louder emphasis — it is a shorter, more universal file, with everything else moved behind progressive disclosure. ## Definition **`CLAUDE.md`** — a project file (named `AGENTS.md` in some other tools) that Claude Code loads at the start of every session as standing instructions. It is the agent's onboarding surface: agents start each session with no memory of your codebase. ## What it costs `CLAUDE.md` is not free background knowledge. It loads at session start and stays in context on every turn, so each line competes for attention with your actual task. Length is a tax, paid continuously. A working target: HumanLayer keeps theirs under 60 lines, and ~300 lines is the common ceiling practitioners cite. Past that, expect the agent to start discounting the file. ## Why it gets ignored Two failure modes, both made worse by length: - **Relevance filtering.** When much of the file reads as irrelevant to the task at hand, the agent is more likely to discount the whole thing — including the parts that did matter. Practitioners writing about Claude Code (HumanLayer, Builder.io) consistently report that non-universal, task-specific content raises the ignore rate. - **Instruction budget.** Every instruction you add dilutes the others. Practitioners peg reliable adherence at roughly 150–200 instructions for frontier thinking models, and Claude Code's own system prompt already spends about 50 before your file loads — so padding `CLAUDE.md` with situational rules burns a finite attention budget on low-value lines. The through-line: task-specific content in a file that loads for *every* task is the problem. ## What to include (and exclude) Anthropic's best-practices guidance is to put in only what the agent can't infer: | Include | Exclude | | --- | --- | | Non-guessable shell commands | Anything obvious from the code or types | | The project's test runner | Restating existing docs | | Code style that differs from defaults | One-off, task-specific steps | | Repo etiquette and architecture decisions | Long examples better kept in a linked doc | | Environment quirks and gotchas | Style rules a formatter can enforce | A useful test: if a competent new contributor could figure it out from the repo in a minute, it doesn't belong in `CLAUDE.md`. See [keep CLAUDE.md to universal instructions](/garden/claude-md-universal-only). A lean file that survives the relevance filter looks like this: ```markdown # CLAUDE.md ## Commands - Test: - Build: ## Conventions - ## Architecture - @docs/architecture.md # pulled in only when referenced ``` ## Move the rest behind progressive disclosure Everything that isn't universal still has a home — just not one that loads every turn: - **`@path/to/file` imports** — pull in a doc only where referenced. - **Skills** — package a domain workflow the agent loads when the task calls for it. - **Child-directory `CLAUDE.md`** — scope instructions to the subtree they apply to, loaded on demand. - **Separate reference docs** — keep the big architecture map or PRD out of the always-on file. This keeps the always-loaded surface small and high-signal, while the detail stays reachable exactly when it's needed. ## Don't use the model as a linter Style and formatting rules are cheaper and more reliable enforced by tooling — formatters, a Stop hook, or a slash command — than by instructions the agent has to remember every turn. Reserve `CLAUDE.md` for what only prose can convey. ## Bottom line A `CLAUDE.md` the agent actually follows is short, universal, and ruthlessly pruned. Treat it as the one always-loaded surface and push everything situational behind progressive disclosure — the same [context-engineering](/garden/context-engineering) discipline that [governs the rest of the window](/articles/context-engineering-beats-a-bigger-window). --- ## Garden — Atomic Notes ### Keep CLAUDE.md to universal instructions URL: https://bartlomiejkrupa.dev/garden/claude-md-universal-only Category: rule | Updated: 2026-06-30 Summary: Put only universally applicable instructions in CLAUDE.md; task-specific content raises the odds an agent treats the whole file as noise and ignores it. Put only universally applicable instructions in `CLAUDE.md`. Task-specific or situational content raises the chance the agent treats the whole file as noise and ignores it. ## Include - Non-guessable shell commands and the project's test runner. - Code style that differs from language defaults. - Repository etiquette, architecture decisions, and environment gotchas. ## Exclude - Anything the code, types, or existing docs already make clear. - One-off, task-specific steps — pass those in the prompt or via progressive disclosure (`@imports`, skills, child-directory files) instead. ## Why `CLAUDE.md` loads at the start of every session and consumes context on every turn. A short, universal file stays high-signal; a bloated one dilutes the instructions that matter. --- ### Context engineering URL: https://bartlomiejkrupa.dev/garden/context-engineering Category: concept | Updated: 2026-06-30 Summary: Context engineering is curating what enters the model's context window each turn, rather than relying on a bigger window to absorb the clutter. Context engineering is the practice of curating what enters a model's context window each turn — instead of relying on a larger window to absorb clutter. Input quality drives output quality; a bigger window does not. ## Why it matters Every turn in an agent loop resends the full state: prior messages, model responses, tool results, file reads, and fetched docs — not just the latest prompt. Tokens accumulate, and answer quality degrades before the hard limit (see [lost in the middle](/garden/lost-in-the-middle)). ## The moves - **Lean instructions** — keep `CLAUDE.md` to universal facts only (see [keep CLAUDE.md to universal instructions](/garden/claude-md-universal-only)). - **`/compact`** — summarize the conversation; append an instruction for what to preserve. - **`/clear`** — reset to zero tokens when switching to unrelated work. - **Plan outside, paste in** — do exploratory back-and-forth elsewhere; inject only the final plan. - **Subagents** — push heavy reads into isolated context and return a summary (see [subagent context isolation](/garden/subagent-context-isolation)). --- ### Lost in the middle URL: https://bartlomiejkrupa.dev/garden/lost-in-the-middle Category: concept | Updated: 2026-06-30 Summary: Language models use information at the start and end of a long context far more reliably than the middle — recall drops well before the token limit. "Lost in the middle" is the observed failure of language models to reliably use information placed in the middle of a long context. Recall is strongest for content at the very start and very end of the window and weakest in between. ## Consequence A larger window does not guarantee a fact buried mid-context is used. Quality can drop while the token count is still well under the hard limit — which is why [context engineering](/garden/context-engineering) matters more than raw window size. ## What to do - Put the most important instructions and data near the top or bottom of the prompt. - Trim irrelevant middle content rather than trusting the model to ignore it. - Prefer ranked, deduplicated retrieval over dumping whole files. --- ### Subagent context isolation URL: https://bartlomiejkrupa.dev/garden/subagent-context-isolation Category: concept | Updated: 2026-06-30 Summary: A subagent runs in its own context window and returns only a summary, keeping heavy intermediate tokens out of the parent conversation. A subagent runs in its own separate context window and returns only a summary to the main agent. The heavy intermediate tokens — large file reads, MCP documentation, search output — never enter the parent conversation. ## Why use it Reading external docs (for example, an MCP server's API reference) can consume tens of thousands of tokens. Done in the main window, that clutter persists for the rest of the session and crowds out the task (see [lost in the middle](/garden/lost-in-the-middle)). Done in a subagent, the parent receives only the distilled result. ## When to reach for it - A task needs a large doc or many files read, but the main loop only needs the conclusion. - You are surveying or researching an area before editing — fan the lookup out, keep the answer. - Output would otherwise be verbose and stay irrelevant after the immediate question is answered. This is one of the core moves of [context engineering](/garden/context-engineering). ---