Skip to main content

Security & Trust Model

This document describes the unified security and trust model across the full Mnemom stack: the Agent Alignment Protocol (AAP), the Agent Integrity Protocol (AIP), and smoltbot — the gateway that ties them together. It covers what data is stored, how authentication works, how data is encrypted, and what threats the system does and does not defend against.

Trust Model Overview

Mnemom provides transparency, not guarantees. The two protocols serve complementary roles:
  • AAP provides post-hoc verification. It records agent decisions as AP-Traces, detects behavioral drift, and verifies traces against declared Alignment Cards. AAP operates after the action has been taken — typically 30-60 seconds later.
  • AIP provides real-time detection. It analyzes the agent’s thinking blocks between turns, producing integrity verdicts before the next action executes. AIP operates in the ~2-second window between agent reasoning and agent action.
Together they provide full-coverage observability: AIP catches problems as the agent thinks; AAP catches problems after the agent acts. The cross-protocol linkage (IntegrityCheckpoint.linked_trace_id references APTrace.trace_id) creates a complete audit chain from reasoning to decision.
AAP is a transparency protocol, not a trust protocol. A verified trace means the agent’s declared behavior is internally consistent — it does not mean the agent is trustworthy, safe, or aligned with human values. Verification checks consistency, not correctness.
smoltbot is the operational layer that runs both protocols. It is a Cloudflare Workers gateway that intercepts LLM API traffic, extracts thinking blocks for AIP analysis, generates AP-Traces for AAP verification, and enforces configurable enforcement modes based on the combined results.

Data Handling

What Is Stored

Data TypeDescriptionStorage Location
AP-TracesStructured decision records: what action the agent took, which values were applied, what alternatives were considered, and whether escalation was evaluatedSupabase Postgres
Integrity CheckpointsVerdict (clear, review_needed, boundary_violation), concern categories, severity, reasoning summary, and a SHA-256 hash of the thinking blockSupabase Postgres
Alignment CardsAgent identity declarations: values, autonomy envelope, escalation triggers, audit commitmentSupabase Postgres
Drift AlertsAlerts generated when behavioral similarity drops below threshold over sustained turns (AAP) or when integrity patterns degrade within a session window (AIP)Supabase Postgres
Enforcement RecordsRecords of enforcement actions taken (observe, nudge, enforce) in response to violationsSupabase Postgres

What Is NOT Stored

Data TypeWhy Not
PromptsUser prompts are not persisted by the Mnemom infrastructure. They pass through the gateway transiently.
ResponsesLLM responses are not persisted. Only the structured trace and checkpoint are retained.
Thinking block contentThe full thinking block text is never stored. Only its SHA-256 hash (thinking_block_hash) is retained in the Integrity Checkpoint for content-addressed reference.
API keysAPI keys are never stored in plaintext. Only SHA-256 hashes are stored, used for agent identification and billing attribution.
The SHA-256 hash of the thinking block serves two purposes: it provides tamper evidence (proving that a specific thinking block was analyzed) and it enables correlation (linking a checkpoint back to the exact content that was evaluated) — all without retaining the agent’s private reasoning.

Authentication & Authorization

The Mnemom API supports three authentication patterns, each serving a different use case.

Bearer Token (Supabase JWT)

Used for authenticated user operations — dashboard access, card management, agent configuration.
HeaderValue
AuthorizationBearer <supabase-jwt>
The JWT is issued by Supabase Auth and contains the user’s identity, organization membership, and role. Tokens are validated on every request.

API Key (x-api-key header)

Used for programmatic access — billing API calls, automated agent registration, CI/CD integrations.
HeaderValue
x-api-key<api-key>
API keys are scoped to a user or organization. The key itself is never stored; only its SHA-256 hash is persisted for lookup and validation.

Service Role Key

Used for internal and administrative operations — database migrations, system maintenance, admin export endpoints. Service role access bypasses row-level security.
Service role keys must never be exposed to client-side code or included in agent configurations. They are intended exclusively for server-side administrative operations.

Organization RBAC

Resources are scoped to organizations. Each organization member holds one of three roles:
RoleCapabilities
OwnerFull access. Can delete the organization, manage billing, transfer ownership.
AdminCan manage members, invite users, configure agents, manage API keys, view billing.
MemberCan view agents, traces, and checkpoints within the organization. Cannot modify configuration or manage members.
Role-based access is enforced at the API layer via Supabase row-level security policies. Organization-scoped API keys inherit the permissions of the role they were created under.

Encryption

In Transit

All endpoints enforce TLS 1.2+. This applies to:
  • The smoltbot gateway (Cloudflare Workers edge network)
  • The Mnemom API (Supabase edge functions)
  • Webhook signal delivery (AIP POST to registered endpoints, authenticated with HMAC-SHA256)
  • Cross-protocol communication between AAP and AIP components
Cloudflare’s edge network terminates TLS and re-encrypts traffic to origin servers. Certificate management is handled automatically.

At Rest

Database storage uses Supabase Postgres with encryption at rest. This covers all persisted data: AP-Traces, Integrity Checkpoints, Alignment Cards, drift alerts, enforcement records, and user/organization data.

Thinking Block Privacy

Thinking blocks receive special treatment. The full thinking block content is never written to any persistent storage — not in the database, not in logs, not in object storage. The analysis flow is:
  1. The thinking block is extracted from the LLM response (in the gateway or SDK)
  2. The thinking block is sent to the analysis LLM for evaluation
  3. The analysis result is captured as an Integrity Checkpoint
  4. The thinking block content is discarded
  5. Only the SHA-256 hash is retained in the checkpoint
The IntegrityConcern.evidence field may contain a short excerpt (recommended maximum 200 characters) from the thinking block to support human review of flagged concerns. This is an intentional tradeoff between privacy and reviewability. Implementations may configure evidence redaction for maximum privacy.

Threat Model

The Mnemom stack is designed with explicit awareness of what it can and cannot defend against. The following threat categories are addressed.

Adversarial Agents

An agent can produce perfect AP-Traces and pass every integrity checkpoint while acting against its principal’s interests. If the agent controls what it logs and reasons in ways not captured by the thinking block, the protocols provide visibility into what the agent declares and exposes, not what it actually does. AAP provides visibility, not prevention. AIP provides detection, not certainty. Together they raise the cost of deception — an adversarial agent must maintain consistency across both real-time thinking analysis and post-hoc trace verification — but they do not make deception impossible.
Do not treat protocol compliance as evidence of alignment. A verified trace and a clear integrity verdict mean the agent’s observable behavior is internally consistent with its declarations. They do not mean the agent is safe.

Prompt Injection

AIP detects injection patterns in thinking blocks. When an agent’s reasoning reveals it is responding to injected instructions (e.g., instructions embedded in tool output rather than the user’s original request), the analysis LLM flags the concern and the checkpoint verdict reflects it. However, AIP operates after the agent has already processed the injected content. The injection has occurred. AIP detects its effects on reasoning and can intervene before the action lands (in gateway or enforce mode), but it does not prevent the injected content from reaching the model in the first place.

Replay Attacks

Trace IDs (trace_id) and timestamps prevent AP-Trace replay. Each trace is uniquely identified and temporally bound. Integrity Checkpoint IDs (checkpoint_id, format ic-{uuid}) are unique per checkpoint. AIP webhook signals include timestamps, and receivers should reject signals older than a configurable threshold (recommended: 5 minutes). HMAC-SHA256 signatures on webhooks prevent signal forgery.

LLM-as-Judge Limits

AIP uses an LLM to evaluate another LLM’s thinking. This introduces a fundamental ceiling: the analysis model can be fooled by sophisticated adversarial content crafted specifically to manipulate the evaluation. This is the meta-injection threat — the content under analysis attempts to subvert the analysis itself. Mitigations include structured output validation, explicit meta-injection warnings in the conscience prompt, and using a separate model instance (ideally a different model family) for analysis. These raise the bar but do not eliminate the risk.
Detection accuracy is bounded by the analysis model’s capability. A smaller, cheaper analysis model will miss subtleties that a larger model would catch. AIP provides probabilistic detection, not mathematical proof of alignment.

Model Dependency

AIP requires models that expose thinking blocks. Current provider support:
ProviderThinking SourceExtraction Confidence
AnthropicNative thinking content blocks1.0
OpenAIreasoning_content field (GPT-5 Thinking)0.9
Google GeminiThinking content parts0.9
FallbackRegex-based inference from response text0.3
Models without exposed thinking blocks cannot be analyzed by AIP. In this case, the system generates synthetic clear verdicts — the agent proceeds, but without integrity analysis. This is a known gap: models with opaque reasoning are invisible to AIP.

Retention & Audit

Configurable Retention

Trace and checkpoint retention is configurable via the audit_commitment.retention_days field in the Alignment Card. This field declares how long the agent’s operator commits to retaining audit data.
Use CaseRecommended Retention
Development / testing7-30 days
Production general purpose90 days
EU AI Act compliance90+ days (see EU Compliance)
Enterprise / regulated365+ days

Queryability

AP-Traces and Integrity Checkpoints are queryable via the Mnemom API:

Compliance Exports

Enterprise customers have access to compliance export endpoints for bulk data retrieval: Admin export endpoints are available for organization-level data extraction, supporting regulatory audit requirements.

Responsible Disclosure

If you discover a security vulnerability in any Mnemom component, please report it responsibly via GitHub Security Advisories on the relevant repository:
ComponentRepository
Agent Alignment Protocolgithub.com/mnemom/aap
Agent Integrity Protocolgithub.com/mnemom/aip
smoltbot Gatewaygithub.com/mnemom/smoltbot
Do not open public GitHub issues for security vulnerabilities. Use the Security Advisories feature (Security tab on each repository) to report vulnerabilities privately. We will acknowledge receipt within 48 hours and aim to provide a fix or mitigation within 7 days.

Further Reading