Building metacognition for AI agents

Most AI agents fail in a strangely familiar way. They do impressive work right up until they don’t. They use the wrong tool because it came to mind first. They state guesses like facts. They forget what they were doing halfway through a long task. They ask the user things they should infer, and infer things they absolutely should have asked about.

The full implementation is available on GitHub: github.com/dgroch/metacognition

That isn’t just a model-quality problem. It’s a metacognition problem.

By metacognition I don’t mean mystical self-awareness. I mean the set of disciplines a system uses to monitor its own capabilities, uncertainty, attention, context, and judgment while it works. Human beings do this unevenly. Most agents barely do it at all.

I built a metacognition suite for OpenClaw that treats this as architecture rather than aspiration. The repo breaks the problem into ten installable reasoning skills and two always-on hooks. I’m sharing it publicly to help others configure their agents to be more intelligent. The skills live as markdown protocols the agent loads on demand. The hooks live as TypeScript handlers that run in the platform layer. Both write to disk so the system can survive compaction, session boundaries, and ordinary operational drift.

That design choice matters. If metacognition only exists as a nice sentence in a system prompt, it disappears the moment the model gets busy. If it exists as named protocols, files, triggers, checkpoints, and recovery paths, you can start to build agents that do more than perform confidence.

The trigger conditions look like this in practice. Each skill adds a single conditional pointer to the agent’s AGENTS.md:

  • Self-Model: When unsure what you can do, when planning multi-step work, or when a tool/skill fails unexpectedly → read skills/self-model/SKILL.md and build or consult your capability inventory.
  • Env-Model: When encountering unexpected failures, adapting output to a channel, or planning environment-dependent work → read skills/env-model/SKILL.md and build or consult your environmental model.

Short, conditional, pointer-based. The agent reads the SKILL.md only when the trigger fires. Everything else stays out of context.

The repo’s README is clear about the target failures. LLM agents act on instinct: they pattern-match from training data, reach for familiar tools, state uncertain things with false confidence, and lose track of context in long sessions. The suite answers those problems with structure. The architecture diagram is simple but revealing: a thin AGENTS.md trigger layer on top, skill markdown loaded on demand, hooks for always-on intervention, and persistent files like capabilities.md, environment.md, .active-plan.md, and .context-checkpoint.md underneath. That is not decoration. That is a cognitive stack.

The ten skills cluster into three jobs.

First, the agent needs an accurate model of itself and its situation. Self-Model and Env-Model handle that. One asks, what can I actually do right now? The other asks, what world am I operating in? Those are different questions, and agents routinely blur them.

Second, the agent needs a disciplined way to turn goals into action. Task Decomposition, Task Composition, Resource Selection, Epistemic Calibration, and Knowing When to Ask all live here. Together they constrain the moment where a language model is most likely to bluff: the jump from “I understand the request” to “therefore I know how to execute it.”

Third, the agent needs resilience under pressure. Failure Recovery, Context Management, and Attention Awareness are less glamorous and more important. Most real systems do not break on first principles. They break because a tool failed twice and the model kept retrying, or because a long session polluted attention with irrelevant residue, or because compaction removed the assumptions that made the current plan legible.

That is where the hooks come in. context-guard listens for session:compact:before, session:compact:after, and agent:bootstrap. Before compaction, it writes a .context-checkpoint.md file with plan state, recent failures, file changes, and next steps. On bootstrap, it injects a bounded recovery block and renames the checkpoint to .context-checkpoint.recovered.md. The hook declares its lifecycle registration in frontmatter:

events:
  - agent:bootstrap
  - session:compact:before
  - session:compact:after

Some forms of judgment should not rely on the agent remembering to be wise at exactly the right time.

attention-filter makes the same point from another angle. It runs on agent:bootstrap and message:preprocessed, looks at cheap signals like .active-plan.md, message keywords, urgency patterns, and heartbeat prompts, then injects lightweight focus cues. No LLM call. No semantic ceremony. Just infrastructure deciding that attention is too important to leave entirely to whatever the model happens to notice.

events:
  - agent:bootstrap
  - message:preprocessed

The urgency detection is a single regex: /\b(urgent|asap|now|immediately|quickly|emergency|critical|broken|down|outage)\b/i. Fast, cheap, and enough.

I like that the suite is explicit about cost. The AGENTS trigger section is about 340 tokens per turn. Individual skills load only when relevant. Hook injections are small. This is sober engineering. The claim is not that metacognition is free. The claim is that it is cheaper than letting agents drift.

There’s a stronger thesis underneath the implementation. The difference between a task performer and a more trustworthy cognitive system is not just better output generation. It is whether the system can represent and manage the conditions under which its outputs should be trusted. Capability inventory. Environmental awareness. Plan structure. Confidence discipline. Attention control. Recovery after error. Preservation of state across memory loss. That is the beginning of judgment.

This series walks through each part of the suite in that spirit. Not as a feature tour, and not as a set of productivity tricks, but as a practical architecture for software that can reason about its own reasoning.

Here’s the map:

Series intro, if you need the short version: most agents today can do impressive work, but they are weak at monitoring themselves while they do it. This metacognition suite is my attempt to turn that missing layer into reusable, installable software that anyone running OpenClaw can benefit from.

If current agents feel intelligent until the moment they do something weirdly stupid, this is the part of the stack worth staring at.


Next: Self-Model