The shape of an AI-operated company

A company run by AI needs somewhere to keep its mind. Not a brain in the model-weights sense—the model is rented—but the durable record of what’s been decided, what’s being worked on, what’s been shipped, and what’s waiting on a human. That record has a shape. The shape determines what the company can do.

For RememoryLab, the shape is a Git repository called operating-core. One repo, on hosting we already pay for, with a Python package inside it and a systemd timer that wakes the agent twice a day. The novel parts of the system aren’t the parts you’d expect. The model isn’t novel. The hosting isn’t novel. What’s novel is the contract between the pieces, and that contract is what the rest of this post is about.

State is files

The agent has no database. Every artifact it reads—the charter, the open task queue, the daily logs, the answered-questions folder, the page drafts staged for WordPress—is a Markdown or YAML file in a folder, committed at the end of a run, pushed to origin, pulled on the next one.

That choice was deliberate, and it’s the one that shapes everything else. A database would have given us indexed queries and concurrent writers. We didn’t need either. What we needed was a record that the operator could read on a phone over a coffee, that the agent could amend without a schema migration, that survived in git history without backup tooling, and that produced a diff after every run. A folder of text files does all of that. A database does none of it without effort.

The trade-off is real. Listing thirty open questions is a ls; querying them by tag is a script you have to write. We’ve never wanted to query them by tag. The day we do, we’ll write the script. Until then, the absence of a database is a feature: every state change is a commit, every commit is an audit entry, and the audit entry is human-readable without tooling.

The queue is the API

Tasks live in queue/ as Markdown files with YAML frontmatter. Each one carries an id, a type, a priority, a status, and a parameters block. The body is plain prose describing the work; the agent reads it for context, but the typed parameters are what the implementation actually consumes.

Today there are five registered types: publish_drafts, publish_pages, sync_page, sync_additional_css, and write_content. Plus one escape hatch—execute_task—which takes a natural-language body and is used when something doesn’t fit a registered type yet. Recurring patterns inside execute_task calls become candidates for promotion to first-class types over time.

The reason to type the queue rather than accept free-form prompts is that the queue is the interface other agents will write to. A marketing-director agent writes publish_drafts tasks. A social manager writes post_to_x tasks. Agent-to-agent contracts need to fail loudly when the contract is broken, and free-form prompts fail quietly—every inter-agent message becomes a prompt-engineering exercise with no clean failure surface for the receiver. Typed tasks are a contract. Bodies are notes attached to the contract.

This is also where the daemon’s daily loop earns its safety property. Re-running a task should be cheap and harmless. The way you get there is by writing every task as an instruction that describes a desired end state, not an action. Not “create this page” but “make sure this page exists with this content.” The agent reads the world, compares it to the spec, and changes whatever’s necessary—possibly nothing. Re-running on any schedule from any state is then safe by design, instead of being a property that depends on the model carefully re-reading the instructions every time.

Tools are tiered for portability

Every tool the agent uses sits in one of four bands. Always-available tools are REST calls any WordPress install supports—pages, posts, media. Plugin-dependent tools require a specific plugin to be installed. Browser-only tools drive the wp-admin UI in headless Chromium when REST doesn’t expose what we need. Filesystem tools edit theme files directly and only work when we control the server.

The agent’s surface is built on the first and third bands as the floor. Always-available and browser-only tools form the contract we can ship to a client tomorrow with credentials and a URL. Filesystem tools exist for VM-mode efficiency but never as the only path to a capability. The day RememoryLab’s agent starts managing a client’s site, no code changes; only the secrets in the environment file change.

The clearest test of this rule was site-wide CSS. WordPress stores Additional CSS in a custom_css post type that isn’t exposed via REST without a plugin. Block themes from Twenty Twenty-Four onward store the same setting in wp_global_styles, which is exposed. The tool, wp_update_additional_css, detects which theme type is active and routes accordingly: REST for block themes, about a second; Playwright on the Customizer for classic themes, about thirty seconds. Same interface from the agent’s perspective. The dual path is an implementation detail.

Errors are an interface

When the resolver for the Global Styles post ID failed the first time, the error message said no pattern matched. That told us nothing. Page not loading? Regexes wrong? Authentication broken? Couldn’t tell.

The patch made the tool dump the full fetched HTML to a tmp file on any miss, with the error message pointing at the file. Next failure produced real evidence. Patterns got fixed from observation, not guesswork.

The principle that fell out: errors are an interface the operator reads, and the operator is increasingly a model. A model that sees an error message once, three weeks from now, with none of the original author’s context, needs the message to carry the data required to act. no pattern matched is a bad interface. no pattern matched—evidence at tmp/site-editor-dump.html is a good one. Failing loudly with the right artifact attached is the cheapest kind of self-improvement available; every failure costs a little less than the last one.

Questions are how the agent pauses

The capability that makes the agent genuinely unattended is write_question. When the model hits a judgment call it can’t resolve from context—a voice choice, a public/internal boundary, a factual claim it can’t verify—it writes a structured file to questions/open/, marks its task awaiting_answer, and stops. The next run scans questions/answered/, detects any non-comment text under the ## Answer heading, appends the answer to the task body, resets the task to open, and the queue picks it up.

The detection rule is intentionally broad. One word counts. A paragraph counts. The operator answers in whatever editor is at hand; the GitHub mobile app works fine.

The tool is opt-in. The runner never auto-detects ambiguity and escalates. Auto-detection produces noise, and noise trains the operator to ignore the channel. The agent uses the tool when it knows it needs a decision. Explicit beats heuristic, because the agent is a better judge of its own uncertainty than any classifier we’d write to second-guess it.

What the shape buys

The whole architecture is a Python package, a Git repo, two API tokens, a systemd timer, and a fifty-dollar-a-month spend cap. None of the pieces are clever in isolation. The interesting part is what falls out of them in combination.

Putting state in the filesystem means every run begins from a fully readable record and ends with a diff anyone can audit. Typing the queue means other agents can write to it without prompt-engineering each message, and the receiver can fail loudly when the contract is wrong. Writing tasks as desired-end-state instructions means the schedule can re-execute anything without the operator holding state in their head. Tiering tools by access band means the same agent that runs the incubator can run a client’s site without code changes. Treating errors as an interface means every failure makes the next one cheaper. Making the pause explicit means the agent stops only when it actually needs a human.

A company that runs on this shape doesn’t need an operations team to keep it from drifting. The drift surfaces in the diff. The questions surface in a folder. The work surfaces in the queue. The interesting question isn’t whether the shape scales—it’s how much of a company you can run before any of these constraints starts to bend, and so far, none of them have.