Intelligence Layer — The 4 Pillars

How Arc OS validates, learns, focuses, and improves itself.

Overview

The Intelligence Layer sits between user messages and Claude responses. It operates in four stages:

         INPUT                    PROCESSING                 OUTPUT
    ┌──────────────┐        ┌──────────────────┐       ┌─────────────────┐
    │ User Message  │───────►│  Context Router   │       │ Claude Response  │
    │               │        │  (skill scoring)  │       │  + eval warnings │
    │ Fix It / 👎   │        │                  │       │                 │
    │ (feedback)    │        │  Learnings       │       │ Quality metrics │
    │               │        │  (correction     │       │  logged         │
    └──────────────┘        │   injection)     │       └────────┬────────┘
                            │                  │                │
                            │  buildGsdPrompt()│        ┌───────┴────────┐
                            └────────┬─────────┘        │ Nightly Loop   │
                                     │                  │ (improvement   │
                                     ▼                  │  proposals)    │
                              Claude CLI                └────────────────┘

Pillar 1: Binary Eval Engine

What: Declarative rules that check every response before delivery.

Why: AI output should have quality gates, like unit tests for code.

How:

Each skill can have a .evals.json file with rules:

{
  "rules": [
    { "id": "gm-001", "type": "string_not_contains", "value": "--force", "severity": "warning" },
    { "id": "gm-002", "type": "max_length", "value": 4000, "severity": "info" }
  ]
}

After Claude generates a response, the eval engine runs all applicable rules:

Pass: response delivered as-is
Fail: response delivered + warning footnote appended

[Claude's response about git operations]
---
Eval: ⚠️ No force push | ℹ️ Response under 4000 chars

Rule Types

Type	What It Checks
`string_contains`	Response must include literal text
`string_not_contains`	Response must NOT include text
`regex_match`	Response must match regex pattern
`regex_not_match`	Response must NOT match regex
`max_length`	Response must be <= N characters
`min_length`	Response must be >= N characters

Key Design Decisions

Non-blocking: Warnings don't suppress responses. Users see both the answer and the concern.
Per-skill: Different skills have different quality criteria.
Per-project: Same skill can have different rules for different tech stacks.
No AI in validation: Rules are deterministic. Binary pass/fail. No ambiguity.

Design reference: Anthropic Skill Creator — structured evals with binary assertions.

Pillar 2: Context Router

What: Intelligent skill selection that injects only relevant skills into each prompt.

Why: 25 skills loaded at once = context dilution. The model tries to apply deployment advice to code reviews.

How:

Before building the prompt, the router scores every registered skill against the user's message:

Score = (trigger matches x 2) + (keyword matches x 1)
Sort by score descending → take top 5

Example:

User: "Review this code for XSS vulnerabilities"

Scoring:
  code-review:     trigger "review" (2) + keyword "XSS" (1) = 3
  code-review-protocol: trigger "code review" (2)            = 2
  deployment-flow: no match                                  = 0
  git-manager:     no match                                  = 0

Injected into prompt:
  SKILLS_HINT (focus on these):
  - code-review: Security audit and code quality review...
  - code-review-protocol: Structured code review with OWASP...

Why Advisory, Not Filtering

The router suggests skills but doesn't block others. Claude still has full access.

Approach	Risk
Hard filtering (`--allowedTools`)	Misclassification breaks the session
Symlink mutations	Filesystem changes on running process
Advisory hints	Safe: wrong hint = Claude ignores it

Triggers vs Keywords

Triggers (2 pts): Explicit invocation signals. The user directly requests this capability.

"deploy", "review", "scaffold", "audit"

Keywords (1 pt): Broader semantic context. Suggests relevance without being a command.

"OWASP", "Docker", "CI/CD", "performance"

Design reference: Context priming — focused attention without hard filtering.

Pillar 3: Reflect Loop

What: Automatic capture of corrections as persistent rules.

Why: AI corrections should survive restarts. One correction = permanent improvement.

How:

CEO presses 🛠️ Fix It
    │
    ├── addLearning(source: "fixit", rule: "Fix requested for: <last response>")
    ├── projectLearnings reloaded from disk
    └── Fix prompt sent to Claude for immediate correction

CEO presses 👎
    │
    ├── addLearning(source: "negative", rule: "Negative feedback on: <response>")
    ├── qualityTracker.logFeedback(positive: false)
    └── projectLearnings reloaded from disk

Rules are stored in learnings.md:

# Learnings

## Rules

- [2026-04-03T14:22:00Z] [fixit] Always use t-call for translations in Odoo QWeb
- [2026-04-03T15:10:00Z] [negative] Avoid sudo in deployment scripts

On every subsequent message, accumulated learnings are injected:

LEARNINGS (past corrections — follow these rules):
- Avoid sudo in deployment scripts
- Always use t-call for translations in Odoo QWeb

Key Properties

Automatic: No manual rule writing. Press a button → rule created.
Persistent: Survives bot restarts. Written to disk as markdown.
Per-project: Each child bot has its own learnings.md. Odoo corrections don't affect React bot.
Newest first: Most recent corrections have highest visibility in the prompt.
Budgeted: Maximum 2000 characters in the LEARNINGS block. Oldest rules drop off.

Design reference: Claude Reflect System — corrections become permanent rules that prevent regression.

Pillar 4: Karpathy Loop

What: Nightly automated analysis of quality metrics with improvement proposals sent to CEO.

Why: Humans forget to review performance. The system should find its own weak spots.

How:

Every day at 03:00 UTC, scripts/nightly-improve.ts runs:

Read registry — enumerate all child bots from bot_registry.json
Read metrics — load quality-metrics.json per child
Find underperformers — filter skills where:
- applied_count >= 3 (minimum sample size to avoid noise)
- AND either success_rate < 80% OR feedback_negative > feedback_positive
Read learnings — extract related correction patterns
Generate proposals — template-based (deterministic, no AI)
Send to CEO — summary report + individual proposal cards in Telegram

CEO Approval Flow

Telegram: Proposal Card
┌──────────────────────────────────────┐
│ Improvement Proposal                  │
│                                       │
│ Child: citadel-v2                     │
│ Skill: code-review                    │
│ Reason: low success rate (72%)        │
│ Feedback: 👍 4 / 👎 6                │
│                                       │
│ Related learnings:                    │
│   • Always use t-call for i18n        │
│                                       │
│ [✅ Approve]  [❌ Reject]             │
└──────────────────────────────────────┘

Approve: Backup skill.md → skill.v1.md (max 3 versions). Mark approved.
Reject: Mark rejected in proposals.json. No changes made.

Key Design Decisions

Template-based proposals: No AI generates the improvements. The system identifies the problem; the human decides the fix.
CEO-in-the-loop: No autonomous skill rewriting. One tap approval required.
Skill versioning: Backups prevent data loss. Maximum 3 versions per skill.
Minimum sample size: Skills with fewer than 3 uses are excluded from analysis (avoid false positives).

Design reference: Karpathy AutoResearch Loop — modify → verify → keep/discard → repeat. With the critical addition of human approval.

How the 4 Pillars Work Together

Day 1:
  CEO sends message → Context Router suggests relevant skills
  Claude responds → Evals check output → Warning: "No force push"
  CEO sees warning, presses Fix It → Learning saved to learnings.md

Day 2:
  CEO sends similar message → Learnings injected: "Don't use --force"
  Claude avoids the mistake → No eval warnings → thumbs-up
  Quality metrics improve for that skill

Day 30:
  Nightly loop detects git-manager skill has 95% success rate
  No proposal needed — skill is healthy

Day 30 (different skill):
  Nightly loop detects code-review at 68% success
  Sends proposal to CEO → CEO approves → skill backed up
  CEO manually improves skill.md
  Next cycle: success rate climbs

The system creates a positive feedback loop: corrections become persistent rules → rules improve quality → metrics reflect improvement → nightly loop confirms health.

Pillar 5: Sage Worker (Phase 40.11+)

What: AI-powered skill analysis, benchmarking, and marketplace discovery.

Why: Manual skill improvement doesn't scale. Need automated analysis of skill quality and access to community expertise.

How:

Skill Analysis

Select any skill in the Skill Evolution UI → click "Sage Analyze" → Sage (Claude Haiku) evaluates:

Skill instruction clarity and completeness
Eval rule coverage
Improvement recommendations

A/B Benchmarks

Compare two versions of a skill:

Select a skill update (PR)
Run benchmark → Sage tests both versions against sample prompts
Results show quality comparison with summary

Marketplace Discovery

Search claudemarketplaces.com for community-created skills:

"Sage Scout" → search by keyword
Analyze compatibility with your project
Install globally or fork to a specific project

Design reference: Package managers (npm, pip) for AI skill management, with LLM-powered compatibility analysis.