Intelligence Layer — The 4 Pillars
How Arc OS validates, learns, focuses, and improves itself.
Overview
The Intelligence Layer sits between user messages and Claude responses. It operates in four stages:
INPUT PROCESSING OUTPUT
┌──────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ User Message │───────►│ Context Router │ │ Claude Response │
│ │ │ (skill scoring) │ │ + eval warnings │
│ Fix It / 👎 │ │ │ │ │
│ (feedback) │ │ Learnings │ │ Quality metrics │
│ │ │ (correction │ │ logged │
└──────────────┘ │ injection) │ └────────┬────────┘
│ │ │
│ buildGsdPrompt()│ ┌───────┴────────┐
└────────┬─────────┘ │ Nightly Loop │
│ │ (improvement │
▼ │ proposals) │
Claude CLI └────────────────┘
Pillar 1: Binary Eval Engine
What: Declarative rules that check every response before delivery.
Why: AI output should have quality gates, like unit tests for code.
How:
Each skill can have a .evals.json file with rules:
{
"rules": [
{ "id": "gm-001", "type": "string_not_contains", "value": "--force", "severity": "warning" },
{ "id": "gm-002", "type": "max_length", "value": 4000, "severity": "info" }
]
}
After Claude generates a response, the eval engine runs all applicable rules:
- Pass: response delivered as-is
- Fail: response delivered + warning footnote appended
[Claude's response about git operations]
---
Eval: ⚠️ No force push | ℹ️ Response under 4000 chars
Rule Types
| Type | What It Checks |
|---|---|
string_contains |
Response must include literal text |
string_not_contains |
Response must NOT include text |
regex_match |
Response must match regex pattern |
regex_not_match |
Response must NOT match regex |
max_length |
Response must be <= N characters |
min_length |
Response must be >= N characters |
Key Design Decisions
- Non-blocking: Warnings don't suppress responses. Users see both the answer and the concern.
- Per-skill: Different skills have different quality criteria.
- Per-project: Same skill can have different rules for different tech stacks.
- No AI in validation: Rules are deterministic. Binary pass/fail. No ambiguity.
Design reference: Anthropic Skill Creator — structured evals with binary assertions.
Pillar 2: Context Router
What: Intelligent skill selection that injects only relevant skills into each prompt.
Why: 25 skills loaded at once = context dilution. The model tries to apply deployment advice to code reviews.
How:
Before building the prompt, the router scores every registered skill against the user's message:
Score = (trigger matches x 2) + (keyword matches x 1)
Sort by score descending → take top 5
Example:
User: "Review this code for XSS vulnerabilities"
Scoring:
code-review: trigger "review" (2) + keyword "XSS" (1) = 3
code-review-protocol: trigger "code review" (2) = 2
deployment-flow: no match = 0
git-manager: no match = 0
Injected into prompt:
SKILLS_HINT (focus on these):
- code-review: Security audit and code quality review...
- code-review-protocol: Structured code review with OWASP...
Why Advisory, Not Filtering
The router suggests skills but doesn't block others. Claude still has full access.
| Approach | Risk |
|---|---|
Hard filtering (--allowedTools) |
Misclassification breaks the session |
| Symlink mutations | Filesystem changes on running process |
| Advisory hints | Safe: wrong hint = Claude ignores it |
Triggers vs Keywords
Triggers (2 pts): Explicit invocation signals. The user directly requests this capability.
- "deploy", "review", "scaffold", "audit"
Keywords (1 pt): Broader semantic context. Suggests relevance without being a command.
- "OWASP", "Docker", "CI/CD", "performance"
Design reference: Context priming — focused attention without hard filtering.
Pillar 3: Reflect Loop
What: Automatic capture of corrections as persistent rules.
Why: AI corrections should survive restarts. One correction = permanent improvement.
How:
CEO presses 🛠️ Fix It
│
├── addLearning(source: "fixit", rule: "Fix requested for: <last response>")
├── projectLearnings reloaded from disk
└── Fix prompt sent to Claude for immediate correction
CEO presses 👎
│
├── addLearning(source: "negative", rule: "Negative feedback on: <response>")
├── qualityTracker.logFeedback(positive: false)
└── projectLearnings reloaded from disk
Rules are stored in learnings.md:
# Learnings
## Rules
- [2026-04-03T14:22:00Z] [fixit] Always use t-call for translations in Odoo QWeb
- [2026-04-03T15:10:00Z] [negative] Avoid sudo in deployment scripts
On every subsequent message, accumulated learnings are injected:
LEARNINGS (past corrections — follow these rules):
- Avoid sudo in deployment scripts
- Always use t-call for translations in Odoo QWeb
Key Properties
- Automatic: No manual rule writing. Press a button → rule created.
- Persistent: Survives bot restarts. Written to disk as markdown.
- Per-project: Each child bot has its own
learnings.md. Odoo corrections don't affect React bot. - Newest first: Most recent corrections have highest visibility in the prompt.
- Budgeted: Maximum 2000 characters in the LEARNINGS block. Oldest rules drop off.
Design reference: Claude Reflect System — corrections become permanent rules that prevent regression.
Pillar 4: Karpathy Loop
What: Nightly automated analysis of quality metrics with improvement proposals sent to CEO.
Why: Humans forget to review performance. The system should find its own weak spots.
How:
Every day at 03:00 UTC, scripts/nightly-improve.ts runs:
- Read registry — enumerate all child bots from
bot_registry.json - Read metrics — load
quality-metrics.jsonper child - Find underperformers — filter skills where:
applied_count >= 3(minimum sample size to avoid noise)- AND either
success_rate < 80%ORfeedback_negative > feedback_positive
- Read learnings — extract related correction patterns
- Generate proposals — template-based (deterministic, no AI)
- Send to CEO — summary report + individual proposal cards in Telegram
CEO Approval Flow
Telegram: Proposal Card
┌──────────────────────────────────────┐
│ Improvement Proposal │
│ │
│ Child: citadel-v2 │
│ Skill: code-review │
│ Reason: low success rate (72%) │
│ Feedback: 👍 4 / 👎 6 │
│ │
│ Related learnings: │
│ • Always use t-call for i18n │
│ │
│ [✅ Approve] [❌ Reject] │
└──────────────────────────────────────┘
- Approve: Backup
skill.md→skill.v1.md(max 3 versions). Mark approved. - Reject: Mark rejected in
proposals.json. No changes made.
Key Design Decisions
- Template-based proposals: No AI generates the improvements. The system identifies the problem; the human decides the fix.
- CEO-in-the-loop: No autonomous skill rewriting. One tap approval required.
- Skill versioning: Backups prevent data loss. Maximum 3 versions per skill.
- Minimum sample size: Skills with fewer than 3 uses are excluded from analysis (avoid false positives).
Design reference: Karpathy AutoResearch Loop — modify → verify → keep/discard → repeat. With the critical addition of human approval.
How the 4 Pillars Work Together
Day 1:
CEO sends message → Context Router suggests relevant skills
Claude responds → Evals check output → Warning: "No force push"
CEO sees warning, presses Fix It → Learning saved to learnings.md
Day 2:
CEO sends similar message → Learnings injected: "Don't use --force"
Claude avoids the mistake → No eval warnings → thumbs-up
Quality metrics improve for that skill
Day 30:
Nightly loop detects git-manager skill has 95% success rate
No proposal needed — skill is healthy
Day 30 (different skill):
Nightly loop detects code-review at 68% success
Sends proposal to CEO → CEO approves → skill backed up
CEO manually improves skill.md
Next cycle: success rate climbs
The system creates a positive feedback loop: corrections become persistent rules → rules improve quality → metrics reflect improvement → nightly loop confirms health.
Pillar 5: Sage Worker (Phase 40.11+)
What: AI-powered skill analysis, benchmarking, and marketplace discovery.
Why: Manual skill improvement doesn't scale. Need automated analysis of skill quality and access to community expertise.
How:
Skill Analysis
Select any skill in the Skill Evolution UI → click "Sage Analyze" → Sage (Claude Haiku) evaluates:
- Skill instruction clarity and completeness
- Eval rule coverage
- Improvement recommendations
A/B Benchmarks
Compare two versions of a skill:
- Select a skill update (PR)
- Run benchmark → Sage tests both versions against sample prompts
- Results show quality comparison with summary
Marketplace Discovery
Search claudemarketplaces.com for community-created skills:
- "Sage Scout" → search by keyword
- Analyze compatibility with your project
- Install globally or fork to a specific project
Design reference: Package managers (npm, pip) for AI skill management, with LLM-powered compatibility analysis.