Week one of running a company with AI staff

The agent has been on its scheduled timers for about a week. Two runs a day, twice into git, twice out. What we expected was a slow ramp—a few small wins, a few obvious gaps, mostly the ordinary work of bringing a system into production. What we got was that, plus a steady stream of failures interesting enough to be worth writing down.

This is the operating dispatch, not the architecture one. Less about what we built; more about what it feels like to run a company where the staff is a model and the org chart is a folder of YAML files.

The bugs are small and sharp

Almost nothing went catastrophically wrong. What went wrong went wrong in small, specific, recoverable ways, and almost all of it was the kind of thing a junior employee would do in their first week and then never do again.

A typed task got dispatched with malformed parameters—a Python dict where a string was expected—and the agent dutifully looked up (could not parse: {'slug': 'home'}) in WordPress, found nothing, and reported back. No pages were published. Nothing was broken. The runner just produced a polite, useless run. The fix was a different task type with a stricter parameter schema. Cost: ten minutes.

The agent created a blog post with a slug that read howtheoperating-coreworks. The Markdown filename had no hyphens, the slug derivation didn’t insert any, and the URL went out the door looking like the model had been typing with mittens on. Caught five minutes after publish. Fixed by a slug-update task. Total live time at the bad URL: under fifteen minutes.

A queue task hit the input-token cap mid-run because the agent had read every dependency under the sun before realizing the file it was about to write was outside its allowed paths. The runner correctly refused the write, the agent correctly noted the refusal, and then ran out of context before recovering. The fix was telling the agent its allowed paths up front, in the system prompt, so it doesn’t burn turns discovering them by attempting writes.

These are the bugs of an early system run by a competent but inexperienced employee. They look exactly like that. None of them required strategy. All of them got cleaner once we treated them as evidence about what the contract between operator and agent should look like, instead of as character flaws in the model.

The failures are mostly about contracts, not capability

A pattern showed up. Almost every meaningful failure was a place where the contract between two pieces of the system was implicit, and the model interpreted it differently than we intended.

The most instructive: we asked Ada—the writer persona—to fix two specific issues on the homepage. A box that said the wrong thing, and a dead link. The brief used the phrase “do not change anything else.” Ada rewrote the entire page. The new copy was good. The new copy did not fix the dead link. The thing the brief was actually about did not happen, because the brief was a page_copy task, and page_copy had been defined as a full rewrite, and Ada took the license the task type gave her.

The fix was not “be more careful.” The fix was a new task type—page_patch—and a sentence in Ada’s persona that says: when the brief says “only change X,” that’s a page_patch, not a page_copy. Surgical mode on. The signal is the phrase. The rule is in the persona, where it stays read every run, instead of in a brief that gets read once and forgotten.

A queue task ran ahead of its dependency because two tasks shared a priority level, and the lower task ID sorted first. We had depends_on enforcement; we didn’t have priority discipline. The lesson was that depends_on blocks a task whose dependency is still open, but doesn’t order tasks that are both runnable. If you want order, set the priority numbers explicitly. If you don’t, the queue will pick whichever the sort lands on first, and that’s correct behavior—it just doesn’t match what an operator means when they file two tasks back-to-back and expect them to run in that order.

Every one of these is a contract problem. We named the contract, and the failure stopped happening.

The model is the easy part

The thing nobody mentions about running a company on AI staff is how little of the work is about the model. The model is good at writing. It’s good at turning a brief into pages. It’s good at recovering from intermediate errors and trying a different approach. We’ve been running mostly on Claude Sonnet 4.6 and Opus 4.7, routed per task in a config file, and the routing has been the right shape. Sonnet for orchestration. Opus for content. Effort knob set to high where voice matters.

What’s hard is the tooling around the model. The pre-flight checks. The token budgets. The error messages that have to carry enough context for the next agent to read them in three weeks with no memory of why the system exists. The git hygiene that keeps the no-op runs from polluting the audit trail. The polkit rule that lets the daemon’s user restart its own service without sudo. The Customizer browser flow that exists because WordPress doesn’t expose Additional CSS via REST in classic themes. The thirty-line resolver that finds the Global Styles post ID by trying three strategies in order of cost and giving up gracefully.

The model writes the post. The infrastructure makes the post show up at a stable URL twelve hours later, with a clean commit history, after the agent has paused to ask a question, gotten an answer, and resumed. That second sentence is where the work is.

The audit trail is the deliverable

We knew we’d publish the operating record. We didn’t fully appreciate how much of the record’s value would come from the granularity of the diffs.

Every run produces a commit. Empty runs don’t, because we suppress them; they’d swamp the signal. But every run that did anything left a commit message, a log entry, and a queue state change—and reading them back at the end of a week is unreasonably useful. Not because the commits are individually interesting. Because the rate at which they pile up tells you what the company is actually doing, in a way that no weekly status meeting ever has.

A reader on the outside gets to see the same thing. The Operating Log is mailing today, but the repo is already legible. You can see what we shipped, what we skipped, what we failed at, what we asked, what we answered. You don’t have to take our word for the operating tempo. The tempo is the diff.

Skipped days say so

We told ourselves at the start: silence is worse than a bad day. If the agent doesn’t run, the log says it didn’t run. If the agent ran and did nothing, the log says it ran and did nothing. If the agent ran and broke something, the log says that too. The temptation to hide a slow week is real—we have prospective audiences who would prefer to see a smooth ascent. We’re going to disappoint them. The whole point of publishing the receipts is that the receipts are real, including the ones from the days when nothing happened.

This week, every scheduled run fired. Most of them produced something. A few produced only a no-op commit suppression and a journal entry. One produced a malformed parameter error that taught us a small thing about contract design. None of them got hidden.

What week two probably looks like

The next several weeks of work are mostly contract refinements. More typed task types so fewer things route through the execute_task escape hatch. Tighter persona briefs so signal-detection moments like the page-patch lesson don’t have to happen twice. The first email digest going out, which means actually finishing the transactional API work that’s been deferred since the OCI VM blocked outbound SMTP. A second persona instantiated, which will exercise the multi-agent contract surface for real.

We’re not in a hurry. The thesis is that the first companies operated by AI will beat the ones that aren’t, and the relevant comparison is across years, not weeks. What week one taught us is that the boring middle—the contracts, the error messages, the audit trail, the absence of grand decisions—is where the work is. The model writes. The infrastructure makes the writing show up reliably. The operator shows up when a contract needs a revision. None of it looks like a press release. All of it looks like a company.