How the operating-core works

The AI staff that runs RememoryLab does not have a database. It has a Git repo.

Everything the agent reads to do its work, and everything it produces in the course of doing it—the charter, the task queue, the daily logs, the open and answered questions, the page drafts—lives as files in folders, committed by the daemon, pushed to origin, pulled on the next run. No message queue, no orchestration plane, no application server. We’ve been running it twice a day for a few weeks now, and the constraints that shape imposes have turned out to be the reason it works.

The loop

A run is a Python process invoked by a systemd timer at 07:00 and 19:00 America/Chicago. It pulls from origin, reads the charter, resets any task marked in_progress whose started: timestamp is older than two hours—those are dead runs, not slow ones—scans the answered-questions folder for replies that should unblock a paused task, picks the highest-priority open task from the queue, and dispatches it by type to a registered Python implementation. The model runs its tool-use loop until the task finishes or hits the turn cap. Log entry written. Commit. Push.

The first version committed an empty log line on every run that found nothing to do. With twice-daily scheduling and a queue that was empty most days early on, that was about 700 meaningless commits per year polluting the operating record. The fix: when the only working-tree change at the end of a run is the log file itself, the runner rolls it back and skips the commit. The systemd journal still records that the run happened. Git only sees real work. Silence is better than noise that means nothing.

The queue is the API

Tasks are Markdown files with YAML frontmatter:

---
id: TASK-001
type: publish_drafts
priority: 1
status: open
parameters:
  pages: [home, thesis]
---

Body here describes the work in plain language. The agent reads it
for context but the typed parameters are what drive execution.

The type: field dispatches to a Python class registered in agent/tasks/. Today there are five: publish_drafts, publish_pages, sync_additional_css, write_content, and execute_task. The last is a deliberate escape hatch—when something unusual comes up that doesn’t fit a registered type, the operator (or another agent) writes an execute_task with a natural-language body. Recurring patterns in those calls become candidates for promotion to first-class types.

The reason for typed tasks instead of free-form prompts is that the queue is going to be the API other agents write to. A marketing-director agent will write publish_drafts tasks. A social manager will write post_to_x tasks. Agent-to-agent contracts need to be debuggable, and free-form prompts make every inter-agent message a prompt-engineering exercise with no clean way for the receiver to fail loudly. Typed tasks are a contract; bodies are notes attached to the contract.

Idempotency is the safety property

The very first run of publish_drafts created duplicate WordPress pages. The system prompt told the model not to. The model decided its new content was different enough from the existing draft to justify a new page anyway. That was the textbook lesson: prompt-based “don’t do X” instructions are aspirational. Tool-based refusals are enforced.

The fix had two parts. First, wp_create_page itself now refuses to create a page whose exact title matches any existing page in any status, unless allow_duplicate=true is passed explicitly. Safety lives in code, not in prompts. Second, every task was rewritten as an “ensure” pattern—an instruction that describes the desired end state rather than the action. Instead of “create this page,” the task is “make sure this page exists with this content.” Missing? Create. Exists as a draft? Update to match the repo. Exists as published? Stop and report.

Re-running the task is now safe by design, which matters because the daemon re-executes scheduled work all the time. The default we want is “safe to re-run on any schedule, with any state on the other side.” Anything weaker depends on the model’s careful reading of instructions, and that’s a dependency you don’t want.

Tools are tiered for portability

Every tool the agent uses falls into one of four bands. Always-available tools are REST operations any WordPress install supports—pages, posts, media. Plugin-dependent tools require specific plugins. Browser-only tools drive the wp-admin UI when REST doesn’t expose what we need. Filesystem tools edit theme files directly and only work when we control the server.

The agent’s tool surface is built on the always-available and browser-only bands as the floor. Filesystem tools exist for VM-mode efficiency but never as the only path to a capability. The reason is portability: when we manage a client’s WordPress site, we’ll have credentials and a URL, not server access. Anything that requires us to drop a child theme on the filesystem is a tool that doesn’t ship to clients.

This came up concretely with site-wide CSS. WordPress stores Additional CSS in a custom_css post type that isn’t exposed via REST without a plugin. Block themes (Twenty Twenty-Four onward) store the same setting in wp_global_styles, which is exposed. So wp_update_additional_css detects which theme type is active and routes—REST for block themes (~1 second), Playwright Customizer flow for classic themes (~30 seconds). One tool, two paths, single interface from the agent’s perspective.

Errors are an interface

When the resolver for the Global Styles post ID failed on rememorylab.com, the error said no pattern matched. That told us nothing. Was the page not loading? Were the regexes wrong? Was authentication broken? We couldn’t tell.

The next patch made the tool dump the full fetched HTML to tmp/site-editor-dump.html on any miss, with the error message pointing the operator at the file. Next failure produced real evidence. Patterns got fixed from observation, not guesswork.

The lesson is small and worth taking seriously: errors are an interface the operator reads. no pattern matched is a bad one. no pattern matched—dump at tmp/site-editor-dump.html is a good one. Tools that fail loudly with the data needed to debug them are the only kind worth building when the operator might be a model that sees the error once, three weeks from now, with no memory of what the original author was thinking.

Questions are how the agent pauses

The capability that makes the agent genuinely unattended is write_question. When the model hits ambiguity it can’t resolve from context—a voice call, a public/internal boundary, a factual claim it can’t verify—it writes a structured question file to questions/open/, sets its task status to awaiting_answer, and stops. The next run reads questions/answered/, detects any text under the ## Answer heading, appends the answer to the task body, resets status to open, and the queue picks the task up again.

The detection rule is deliberately broad. Any non-whitespace, non-HTML-comment text under that heading counts as an answer. Josh can reply with one word or three paragraphs, from a phone or a laptop, in the GitHub mobile editor or locally. Lowest-friction answer surface possible.

The tool is opt-in. The runner never auto-detects ambiguity and escalates—auto-detection produces noise, and the noise trains the operator to ignore the channel. The agent calls the tool when it knows it needs a decision. Explicit beats heuristic.

What the shape buys

The architecture is a Python package, a Git repo, a systemd timer, two API tokens, and a $50/month spend cap. None of the individual pieces are novel. What’s interesting is what falls out of them in combination.

Putting state in the filesystem means every run starts from a fully-readable record and ends with a diff anyone can audit. Typing the queue means other agents can write to it without prompt-engineering each message. Building idempotency into the tools means the schedule can re-execute anything, anytime, without the operator holding state in their head. Tiering tools by access band means the same agent that runs RememoryLab can run a client’s site tomorrow without code changes. Treating errors as an interface means every failure gets cheaper to fix than the last one. Making questions opt-in means the agent stops only when it actually needs a human, not when a classifier thinks it might.

That’s the whole stack. A single repo, on the same hosting we already pay for, doing the work twice a day and leaving a record of every decision behind it. The interesting question isn’t whether this scales—it’s how much of a company you can run before any of those constraints starts to bend.