A System for Working with AI Agents
AI agents are all the rage these days. They won’t replace developers anytime soon, but they are changing how we build software. Over the past few months, most of my code has been generated with Codex, and I’ve shipped it to production across backends, iOS apps, and MCP servers. In this post, I’ll crystallize the system I use to get consistent results from agents—mostly Codex 5.3.
Every piece of advice you read about prompting, tools, and agent workflows ultimately comes down to managing one of two things: context engineering, what the agent knows right now, and runtime engineering, what the agent can do.
Your output is roughly effort × context × runtime. Internalize that, and you already have a useful mental model. Everything else is tactics.
Context engineering
An agent can draw on broad general knowledge—think of it as a compressed library of public code and human knowledge—but everything specific to you and your task must fit into its working memory, also called the context window. At the start of a session, the model knows nothing about your codebase or goals, so early information matters a lot. As it works, it searches files, looks for patterns, fills in missing details, and gradually builds a model of what’s going on. Once it has enough signal, it proposes an approach and implements it. When satisfied with the results, it summarizes what it did and waits for its pat on the back.
Throughout the session, the instructions it receives, the files it reads, the commands it runs, and everything it generates must all fit within its working memory. When that memory fills up, problems arise: the model starts forgetting things to make space for new information, it hallucinates, a random instruction becomes its goal, and, basically, you lose control over the outcome.
With context, the game is about progressive disclosure of information. Knowing what your agent starts with (its AGENTS.md) and your goal, the idea is to set up an information system that will let the model stumble on the right information at the right time. When done correctly, it feels magical: the model just gets you and knows what you mean. You don’t need long, elaborate prompts. Instead, you state a task: “Sketch out an implementation of this new feature.” And boom, you have it.
Build your information system
So what should go in AGENTS.md? Start with this: what does a contributor need to know before they can do meaningful work? That’s your baseline. For me it comes down to five things: how to run the code, how to verify its quality, whether there’s a linter to respect, how the project is structured and why, and what the user actually cares about. Depending on the project size, I either store all of it in AGENTS.md or split it into multiple focused files.
For example, in my Python projects, when an agent wakes up, it knows that it must run the linter after editing any *.py file (a good way of catching small mistakes), that it should always use uv run --active to run any Python code (so it uses the active pre-installed virtual environment), that it should read what I call the canonical documents before making any edits (to get familiar with the domain and norms), and finally, that it must read certain documents in certain situations (for example, it must read the migration guide before editing the model). These redirections keep the AGENTS.md lean. The more you stuff in there, the more you end up fighting your own instructions through prompting.
When I start a project, I treat documentation like something that should match the project’s size. If it’s small enough, I keep things simple: one canonical ARCHITECTURE.md that covers the major topics I want an agent (or a teammate) to understand before touching code. As the project grows into that “mid-sized” zone, that single file starts to get overloaded, so I usually split it into three focused documents: one for architecture, one for testing, and one for behaviors. That way the architecture doc stays structural, the testing doc stays operational, and the behaviors doc stays about what the system is supposed to do.
Once a project gets genuinely complex (multiple modules, lots of moving pieces, lots of local context), centralized documentation stops scaling. At that point I switch to a more distributed model: each module gets its own README.md that acts as the canonical reference for that area. The most useful thing in those module docs is a code map: a curated list of the important files and directories inside and outside of the module, plus a short explanation of what each one is responsible for and how they connect. I also reference the test files or folders that should be run when changes are made to the module. That code map ends up doing most of the “orientation” work that a single global document can’t do anymore.
The shift in project size changes how I use AGENTS.md, too. In simpler projects, it’s reasonable to be strict: I’ll tell the agent to lint, build, and run tests after every edit, and it usually does—tight feedback loops make the agent extremely reliable. But that approach falls apart as soon as test runs start taking more than a few minutes. I can stretch it for a while with speed tricks (--testmon in Python so only impacted tests run, and bun in TypeScript for faster execution), but there’s a point where even those optimizations stop keeping iteration fast.
When I hit that point, I lean less on “always do everything” instructions and more on selectively pulling in context. In the truly complex setup, I’ll even remove the blanket instruction to read all the docs from AGENTS.md and instead invoke documentation intentionally as part of the task: “Read this module’s README and sketch the implementation for this new behavior.” The bigger the repo gets and the slower feedback becomes, the more the workflow shifts from automatic verification after every edit toward targeted context and deliberate navigation guided by the module canonicals.
Having this documentation improves code quality during the task and makes cleanup simpler afterward. When the model is working, it’s mostly focused on reaching its goal. Even though Codex is famous for being an avid rule follower, it still cuts corners when stressed for time or context. After sending the agent on a quest to implement a new feature or investigate a bug, I typically queue cleanup tasks such as “Ensure your changes are aligned with canonical docs” or “Ensure your tests are aligned with testing.md.” This is often enough to turn a 70% solution into a 95% one. That simple interaction used to take days, and sometimes weeks, when working with junior engineers.
The second way I document user-facing behaviors is through behavior-centric test scenarios that read as clearly and naturally as possible. This matters more than it might seem: code is probably 90% of what an agent sees during a session, which means good code is self-documenting—and good tests are the highest-signal documentation of all. By reading them, the agent understands what the system is supposed to do from the user’s perspective. While working, it can quickly spot where code and behavior have drifted apart and fix whichever is wrong. As the developer, you get the same benefit: when you and the agent are both reasoning from the same behavioral tests, you’re speaking the same language.
In practice, I try to make testing.md the single place that explains this approach end-to-end: how the harness works, naming rules, which helper classes to use, what each test folder owns, and what “good coverage” looks like. The rule I’m strictest about is no mocking or patching inside test suites. When a test mocks something, reading it means holding two things in your head at once—what the code does and what the mock pretends it does. Strip that out, and a test becomes a behavior statement: given this situation, expect this output. That’s something both you and the agent can judge in one pass. Fakes, stubs, and harness setup belong in the harness—invisible by the time you’re reading the test.
Control injected instructions
All of that raises a question I’ve only partially addressed so far: when a session starts, what does the agent actually know? I initially simplified this by saying “it only knows its AGENTS.md,” but in reality there are multiple hooks for injecting knowledge right at the beginning of a session—and the real trick is not overdoing it.
Think of the agent as having two layers of identity: who it is by default (its home), and what it knows about the project it’s working on (the project root). Both are loaded at session start (AGENTS.md, MCP servers, skill descriptions) before it reads a single line of your code.
That split is more useful than it first appears. Because the home is project-agnostic, you can specialize it: I run different Codex homes for different kinds of work—an iOS agent with simulator etiquette baked in so multiple agents can share simulators without stepping on each other, a Python agent pre-loaded with Swagger and SQLAlchemy MCPs, a Hugo agent that knows my template conventions, and a copyeditor that knows my voice. Each one wakes up already shaped for its domain. The project root then layers on top: what’s specific to this repo, this stack, this task.
And it’s not only Codex. Other platforms, such as Cursor, let you define user instructions, team instructions, and conditional instructions that are only loaded when certain files are touched. The common theme across all of these approaches is that they inject knowledge at the very beginning of a session, when the agent is fresh and the context window is empty. That’s valuable because the model is basically guaranteed to “know” that information for the rest of the session, at least until the context fills up.
After that point, things get less reliable. Your messages are still injected into the session, but anything else—like asking the agent to read a file—depends on the agent’s judgment. Sometimes it will skim a few lines, think it has the gist, and then proceed based on partial understanding. That’s why front-loading the right guidance (docs + behavioral tests + carefully chosen session-start instructions) tends to work so well: it reduces how often you’re betting the outcome on a hurried skim.
Tune into your agent’s language
As you work with a model, you learn its “language” over time. By reading its summaries and thoughts, you start noticing the specific words and terms it uses to talk to itself, describe events, actions, and structures. Once you pick those up, you can communicate in a more compressed way—using shorter prompts that still carry a lot of intent. That’s a big part of why I like using Codex and why I’ve stuck with it since GPT-5. When you get used to that condensed communication style, it can be frustrating to switch to another model, even when it’s similarly capable.
For instance, if I want to compress a document without losing information, I don’t need to spell out everything I want: “preserve meaning, remove repetition, reorganize scattered points into coherent sections, keep it lossless”. I can just say something like, “Tighten this doc” and Codex understands the full intent behind those words.
This connects to another important idea: how you phrase instructions matters. I tend to use a mixture of step-by-step instructions (runbook style), directive rules (must lint after editing any *.py file), and informative context (migration scripts are timeless; output format; etc.). To have the best effect, these types of instructions should not intermingle. Instead, they should be cleanly separated into sections.
You get better compliance from step-by-step instructions, but directive rules apply to broader situations. I tend to keep step-by-step instructions in process-specific documents to be called on when needed, and use directive rules in my canonical documents.
For example, if you tell an agent, “A migration file must be in this folder, must follow this naming format, must include a date, must have the following prefix, and must stay up to date with new migrations as other branches get merged,” you’re giving constraints—but you’re not guaranteeing it will apply them correctly. The agent might think for a bit, then one-shot a file, hallucinate the date, and force you to add more rules to deal with that too. If instead you give a runbook-style sequence—“1) Use this command to get the current branch. 2) Search this folder with this regex for an existing file. 3) If you find a file, update it; otherwise, run this migration generation command with these flags and tag it with the current branch name following this format. 4) Ensure the file references the last migration chain and has the last sequence number following this pattern.”—you’ll usually get more correct and consistent results. The model is better at executing a procedure than perfectly complying with a list of constraints.
So the rule of thumb is: use rules when you want to define a general system or reusable approach. But when the task is specific and high-precision, you’ll get the best outcomes from step-by-step instructions in a runbook format.
Wrapping up
To summarize, the first 10% of an agent’s context has the greatest impact on its work session. To maximize its value, you must design a system that enables the agent to access the right information quickly and reliably. The more attuned you become to your agent’s language, capabilities, and information hierarchy, the more effectively you can convey intent with minimal effort—and the more predictable the results will be.
The context system we just built determines the agent’s ceiling. The runtime determines whether it ever gets there.
Runtime engineering
The most visible part of the runtime is what I call the agent host: Cursor, Claude Code, and codex-cli are popular examples. They wrap the model in an agent loop (prompting + planning), expose capabilities like file edits, shell access, and MCP servers, and enforce constraints through a sandbox + policy layer (approval gates, allow/deny lists, etc.).
Guardrails
After the chat interface, the second interaction you get with these hosts is the sandbox “mode”. You typically choose between at least two modes: “yolo” (full access to your machine) and “safe” (asks for approval before taking actions). The former frees your agent but could lead to catastrophic fallout (stolen secrets, deleted hard drives, malicious attacks, etc.), while the latter demands your constant attention, hence shackling your agent and limiting what you can delegate end-to-end.
People tend to pick the go-fast-and-break-things “yolo” mode. But there’s usually a third option; let’s call it the “responsible” mode. In codex-cli, it’s called --full-auto, while for Cursor, it’s called sandbox with an allowlist. In practice it means: the agent gets full control inside the project folder but needs approval for anything outside it, including internet access, hidden files, or deletions. On top of that, you can configure which specific commands are auto-approved or auto-denied to match your workflow.
It took some time for the “responsible” mode to mature, but these days it’s very powerful. The defaults are already functional enough for this mode to soon become the new “safe” option. And underneath these defaults, you often have more granular control over which commands are “auto-approved” and, sometimes, which will be “auto-denied.”
For me, the game is about giving the agent as much freedom as I responsibly can, to get as close as possible to the “yolo” experience, while keeping it in the non-destructive realm and avoiding being wiped out by a black swan event. Even with the best models, I still get surprised by some action in about 1 in 1,000 sessions where it deletes a test file as a way of fixing a failing test.
Since I started working with agents, my guardrails have stayed the same. One agent per repo clone. A per-project devcontainer whenever possible (Python and TypeScript projects get one automatically, iOS requires host access). The agent gets read-only access to git and can’t touch its state in any way: no branching, stashing, or pushing. And it must be able to complete 99% of its work without stopping to ask me anything.
What’s changed is how I enforce those constraints. I used to rely on OS-level restrictions, such as a squid proxy to block network. Now I use codex-cli’s rule system, which lets me set constraints at the agent home level or scoped to a specific project. Cursor’s sandbox has improved too, but it only gives you an allowlist, it lives in user space, and it doesn’t sync between machines—so for now, codex-cli wins on flexibility.
As long as the agent gets fast feedback—whether its commands are auto-approved or auto-denied—it will bounce back and keep working autonomously. It only stops when you need to manually approve a command.
Tightening the loop
There’s usually interplay between rules and instructions. For example, for iOS development, I want the agent to be able to build the project and run tests on a simulator, while keeping its side effects contained in its working directory. I also want it to be unable to build on a real device (and maybe wipe my personal data). So I instruct the iOS agent to use very specific flags with its xcodebuild command that only impact its project folder and can only run on an iPhone simulator. Then I added a rule to deny any call to xcodebuild that doesn’t start with those exact flags. If the model makes a mistake (which happens about 1 out of 10 times with Codex 5.3), it will think about it for a second, find the instruction on how it should build, and use the right flags.
This interplay keeps evolving as the model changes. In the switch from Codex 5.2 to 5.3, I noticed an increase in manual validation because the new model started chaining more terminal commands together, with heavy use of shell redirection that broke my rules. I had to add an instruction: “Do not use shell redirection (>, >>, 2>, 2>&1). Use tee instead,” so the command stays splittable and policy-matched. Then I added an auto-accept rule for tee. And voila, back to agent freedom.
Let me illustrate this interplay with one last example: getting the Cursor agent to use the right Python virtualenv.
I’m a huge fan of devcontainers; they’re a productivity unlock for teams and an important part of my toolkit. Usually, my devcontainer is pre-configured with a working runtime. In Python, one of my initial struggles using Cursor was getting the agent to reliably use the pre-installed environment via a virtualenv.
With no instructions, the agent’s hit rate was about 1/10. When it fails, it starts installing packages, realizes it can’t touch the system Python, panics, and ends up creating its own virtualenv, requesting access to the internet, or doing something weird.
My first approach was to add these instructions to the AGENTS.md:
## Environment
- Use **one persistent shell** for the entire session.
- On session start only: If `VIRTUAL_ENV` is set: **activate it** (`source "$VIRTUAL_ENV/bin/activate"`), use **`python`** and **`uv pip`** from the venv for all runs/installs, prefer `#!/usr/bin/env python`, and **never** call system `python`/`pip`.
The hit rate was 6/10.
I persisted:
## Environment
- Use **one persistent shell** for the entire session.
- On session start only: If `VIRTUAL_ENV` is set: **activate it** (`source "$VIRTUAL_ENV/bin/activate"`), use **`python`** and **`uv pip`** from the venv for all runs/installs, prefer `#!/usr/bin/env python`, and **never** call system `python`/`pip`.
- If you find that a **`python` requirement is missing**, ensure that you have activated `VIRTUAL_ENV`.
... unrelated instructions
**P.S. Don't forget to activate `VIRTUAL_ENV`**
The hit rate was 9/10. It’s an improvement, but you can see the agent struggle at the beginning of a session: a monologue and various attempts before it finally finds its footing.
Although dissatisfied, I kept this solution for a few months, until I stumbled on the uv run command (runs code in a standard uv virtualenv) and, more importantly, the uv run --active flag that runs your Python code in the active virtualenv. I finally had my solution!
My new instructions became the following:
- **Always use `uv run --active` to interact with Python**. Never use `python`, `pip`, `pytest`, `black`, `ruff`, etc. directly.
I added a rule to auto-accept uv run --active. That’s it: 10/10 hit rate.
The best part about this solution is that the model doesn’t have to probe the environment to know what to do. When it thinks “I want to use Python,” it also thinks “I should use uv run --active instead.” No more guesswork.
Give your agent the right tools
There’s a reason developers use an IDE and not a text editor to write code: autocomplete, syntax highlighting, and refactoring tools. All these features save time and mental labor. They let you flow better. Once you use them, you can’t go back.
Your agent will do with what it has. It’s MacGyver-like in its adaptability as it tries to reach its goal. But it will finish work faster if it has access to the right tool for the job. Good tools allow the model to understand situations without crowding its context window, and to save on cognitive labor by performing complex operations with simpler plans.
While the agent is working, notice which shell commands it reaches for first when dealing with a specific problem. For example, Codex will reach for ripgrep first when searching for files, and only falls back on something else if it can’t find it. That tells you Codex was trained to use ripgrep, is quite familiar with it, and assumes it’s available. When you see that, ensure ripgrep is part of the runtime, so next time you save a couple of round trips: the agent won’t need to probe the environment before deciding which tool to use.
I also noticed that Codex reaches for Python often to write inline scripts (for example: python -c "import sys; print(sys.executable)"), so I make sure it can access Python safely in all the environments it runs in (mostly using uv run).
Beyond shell commands, MCPs offer another way to equip your agent with power tools. In essence, an MCP wraps a set of commands in a documentation structure that makes them easier for an agent to understand and use. It’s a mixture of capabilities, description, and AX (agent experience).
MCPs are powerful but should be used responsibly. My general guideline is to stick to ideally cloud-based, official MCPs from recognizable providers or build your own using fastmcp.
When I build MCPs, I observe if and how the agent uses them. Then I ask for agent feedback and iterate. A good MCP description gets the agent to reach for the tool naturally when it’s needed. And a good design allows for composition with a small number of commands and self-documenting errors.
For example, when I noticed that agents struggle with our OpenAPI files, which can be 12K lines of YAML, I could have split the file, but that would interfere with the human processes around them. Instead, I built a small MCP server that progressively discloses the information. With this tool, the model can list the endpoints and ask for specific details about a schema. It can quickly get a broad picture, then drill into the details as needed without overloading its context window or missing important information. This in turn translates to better results on tasks where the agent adds new features or generates UI from specs.
Wrapping up
To summarize, your goal is to give your agent a solid runtime, then to dial down the guardrails to the highest autonomy you can responsibly allow. Ideally, the agent can complete ~99% of routine work without asking for approval. Getting there is mostly about tightening the loop between rules and instructions. Use instructions to shape behavior, and use rules to enforce the few constraints that really matter. Then iterate when model behavior shifts.
Conclusion
Agents won’t replace developers—I still believe that. But they are replacing a bunch of the work that used to be “developer time”: the grindy parts of implementation, the tedious refactors, the first-pass scaffolding, the yak-shaving investigation. What’s left for us is the leverage.
That leverage comes from two places:
- Context engineering: designing what the agent knows, when it learns it, and how reliably it can retrieve it without blowing its context window.
- Runtime engineering: designing what the agent can do, how safely it can do it, and how quickly it can recover when it hits friction.
If you only do one of those, you’ll plateau. Great docs with a brittle runtime mean the agent constantly asks for help. A great runtime with sloppy context means it moves fast in the wrong direction. When both are dialed in, delegation starts feeling real: you hand the agent a task, it navigates, implements, verifies, and comes back with something you can actually ship.
The meta-skill here isn’t “prompting”, it’s systems design. You’re building an environment where the default outcome is good: canonical docs that are discoverable, behavior-oriented tests that explain failures, and guardrails that are strict about the few things that matter while staying out of the way for everything else. Then you iterate—because the model will change, the repo will change, and your workflow will change. That’s what being a software engineer means now.
Two books worth reading for better system design and faster feedback loops: