Claude Code

Stop Burning Your Tokens: How I Learned to Love Context Management in Claude Code

Hamza Musa

04 May 2026 — 5 min read

I was reviewing a pull request with Claude Code last week. It was a routine task, or so I thought. I sent two, maybe three prompts. Then I glanced at the usage meter and froze. I was already at 60% of my monthly limit.

I am on the Max plan. I stared at the screen and thought, "What percent?!"

It felt like getting cheated. I signed up for 20x capacity, but it felt like I was getting 5x. If this sounds familiar, you are not alone. That was me, too. But after digging into how these models actually work under the hood, I fixed it. Here is what I learned, and how you can stop wasting money and start getting more done.

The Hidden Cost of "Memory"

The first thing you need to understand is that Claude does not have persistent memory between turns. Did you know that? Every single time you send a message, the entire conversation gets resent from scratch. Your prompts, Claude’s responses, and every single tool result. All of it. Every time.

It is not reading a log file. It is literally re-processing the full history on every turn. That is how transformers work.

So, when we talk about token count, we are not talking about "what you just typed." We are talking about the running total of everything that has ever been said or read in the session, resent repeatedly.

Here is why that PR review exploded my bill. When I asked Claude to review the PR, it called a tool to fetch the diff. That diff, which was thousands of lines long, got injected into the messages as a tool result. Then, to understand the context, Claude called twenty more tools to read individual files. Each file read injected that entire file’s content into the message history.

When I argued with it or asked for a tweak, all of that data got resent again. Plus my new message. Plus Claude’s new response. By the time I wanted to send a cute "Thank you, you are awesome" message, I had burned thousands of tokens for a pleasantry. I finished the PR, tried to switch to a dev task in the same session, and hit the limit. My reset wasn’t until 4 PM.

So, what should you do? Here are ten practical tips to take control.

1. Master CLAUDE.md

Most people skip this, but it is the single highest-leverage thing you can do before writing a single line of code. Simply run /init to generate a CLAUDE.md file in your project root. Claude reads this automatically at the start of every session.

Think of it as your standing instructions. These are the things you would otherwise explain from scratch every single time, like your tech stack, coding standards, or architectural constraints.

However, be careful with the cost model. CLAUDE.md is loaded on every session and persists in the context window for the entire session. It is not lazy-loaded. A 2,000-token CLAUDE.md costs you 2,000 tokens whether you send two messages or two hundred. If your file is bloated, you are reducing your effective working context. Keep it lean, around 300 to 600 tokens. I treat mine like an eslint.config.js file—it sets invariant rules, not task-specific details.

2. Use /context to See the Receipts

You can run /context, and Claude will send you a breakdown of every item occupying its context window. This includes open files, attached documents, tool definitions, and conversation turns. It shows you the token counts per element and your cumulative usage versus the window ceiling.

I use this as a memory profiler. It helps me spot files that got pulled into context but are no longer needed. It helps me identify when a conversation thread has grown too long, allowing me decide between compacting the session or starting fresh.

3. Compact Your Session Proactively

The practical fix is simpler than it sounds. When a session runs long, summarize what matters and carry just that forward. To do this, run /compact.

This command summarizes the entire conversation into a structured representation, capturing decisions made, code written, open questions, and current task state. It then continues from that summary as the new baseline.

Note that this is lossy by design. It preserves architectural decisions and current state but discards intermediate reasoning chains and raw tool outputs. A common mistake is using /compact reactively, only after Claude starts forgetting things. Do not do that. Run /compact when you finish a distinct phase. A healthy session produces a better summary than a degraded one. If you are done with a task entirely, just use /clear to wipe the context completely.

4. Stop Repeating Yourself with /commands

Define named aliases for multi-step instruction sequences using /commands. When invoked, Claude executes the full sequence without re-parsing intent from natural language.

Natural language prompts are probabilistic; the same prompt can produce slightly different behavior each run. Commands are deterministic. I use them for running tests, fixing type errors, linting in sequence, or generating components with my exact folder conventions. It saves tokens because Claude does not have to "think" about what you mean every time.

5. Turn Off Reasoning Mode When You Do Not Need It

Before Claude gives you any response, it runs an extended internal reasoning process. It works through the problem, considers approaches, and weighs trade-offs. This happens silently in the background, whether your problem is huge or tiny.

If you just want to rename a variable, you do not need this. Turn it off. Save the heavy lifting for complex architectural problems.

6. Use /btw to Avoid Interrupting the Flow

Sometimes you have a side question or a cool idea while Claude is working. Do not interrupt the main task. Use /btw to open a parallel inference channel.

This runs against Claude’s current session knowledge, but the response is never injected into the main conversation history. The main task continues uninterrupted, keeping your token usage for the primary goal clean.

7. Choose the Right Model Intentionally

Most people open Claude Code, leave it on the default, and never think about it again. On the Max plan, the default is often Opus. That is overkill for most tasks.

Here is my rule of thumb: Use Opus only for planning hard problems. Use Haiku for simple questions or quick lookups. Use Sonnet for day-to-day feature implementation, refactoring, and code reviews. Sonnet is the sweet spot for most coding tasks. Use /model to swap between them deliberately.

8. Stop Pasting Everything

Stop using the "Copy for LLM" buttons that paste entire files or logs into the chat. That content immediately becomes dead weight in the conversation history, traveling with every subsequent message.

Instead, use Claude Code’s @file reference system. Split reusable information into standalone .md or .yaml files and load them on demand with @filename.md. The file gets pulled in exactly when you need it, rather than sitting in your context window forever.

9. Write Specific Prompts, Not Vague Requests

Stop being lazy. Do not just type "Please fix this. Make no mistakes."

How you phrase a prompt directly affects how many tokens come back. Vague prompts invite verbose responses as Claude tries to figure out what you mean. If you have a hint about what you need to do, include it. Write prompts like: "Fix the [BUG] in @file that causes [Unexpected Outcome] instead of [Expected Outcome]." Precision saves tokens.

10. Audit Your MCP Servers

MCP (Model Context Protocol) servers are powerful, but they are not Pokemon. You do not have to catch them all.

Every connected MCP server loads its full tool definitions and schema into your context window at the start of every session, whether you use it or not. These definitions are not small. Stack a few together, and you can consume thousands of tokens just on setup. Remove the servers you do not need, or use "mcp funnels" to manage them.

A Final Note on the Interface

If you are using the desktop app to code, consider switching to the terminal version. The terminal shows you the context and token usage on every task. What you can see, you can manage.

I hope this saves you some tokens and some frustration. The key is to remember that context is a finite resource. Treat it with respect, keep it clean, and you will get far more value out of your AI assistant.