Agent Alignment Protocol (AAP) Specification
Version: 0.1.0 Status: Draft Date: 2026-02-01 Authors: Mnemon ResearchAbstract
The Agent Alignment Protocol (AAP) defines a standard for autonomous agents to declare their alignment posture, produce auditable decision traces, and verify value coherence before inter-agent coordination. AAP extends existing agent coordination protocols (A2A, MCP) with an alignment layer that makes agent behavior observable to principals, auditors, and other agents. AAP is a transparency protocol, not a trust protocol. It makes agent behavior more observable, not more guaranteed.Table of Contents
- Introduction
- Terminology
- Protocol Overview
- Alignment Card
- AP-Trace
- Value Coherence Handshake
- Verification
- Drift Detection
- Security Considerations
- Limitations
- IANA Considerations
- References
- Appendix A: JSON Schemas
- Appendix B: Verification Algorithm
1. Introduction
1.1 Problem Statement
The current agent protocol stack provides mechanisms for capability discovery (A2A Agent Cards), tool integration (MCP), and payment authorization (AP2). None of these protocols address a fundamental question: Is this agent serving its principal’s interests? As agent capabilities become symmetric—equal access to information, equal reasoning power, equal tool access—alignment becomes the primary differentiator. When you cannot reliably distinguish between human and agent communication, trust in alignment becomes essential infrastructure.1.2 Design Goals
AAP is designed with the following goals:- Transparency over guarantee: Make agent decisions observable, not provably correct
- Composability: Extend existing protocols (A2A, MCP) rather than replace them
- Minimal overhead: Add alignment without significant performance cost
- Falsifiability: Enable third-party verification and audit
- Honest limits: Be explicit about what the protocol cannot provide
1.3 Non-Goals
AAP explicitly does NOT attempt to:- Guarantee that agents will behave as declared
- Provide protection against sophisticated deception
- Replace human judgment in consequential decisions
- Certify that an agent is “safe” or “trustworthy”
- Solve the alignment problem in general
1.4 Document Conventions
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.2. Terminology
Agent: An autonomous software entity capable of taking actions on behalf of a principal. Principal: The human or organization whose interests the agent is meant to serve. Alignment Card: A structured declaration of an agent’s alignment posture, including values, autonomy envelope, and audit commitments. AP-Trace: An audit log entry recording an agent’s decision process, including alternatives considered and selection reasoning. Value Coherence: The degree to which two agents’ declared values are compatible for coordination. Autonomy Envelope: The set of actions an agent may take without escalation, and the conditions that trigger escalation. Escalation: The process of deferring a decision to a principal or higher-authority agent. Drift: Behavioral deviation from declared alignment posture over time. Verification: The process of checking whether observed behavior (AP-Trace) is consistent with declared alignment (Alignment Card). Strand: In multi-turn conversations, a participant’s sequence of messages. SSM (Self-Similarity Matrix): A computational structure measuring semantic similarity between messages across a conversation. Divergence: When conversation strands drift apart semantically, indicating potential misalignment.3. Protocol Overview
3.1 Components
AAP consists of three interconnected components:- Alignment Card: Static declaration of alignment posture
- AP-Trace: Dynamic audit log of decisions
- Value Coherence Handshake: Pre-coordination compatibility check
3.2 Protocol Flow
A typical AAP interaction proceeds as follows:3.3 Integration with Existing Protocols
AAP is designed to complement, not replace, existing protocols:- A2A Integration: Alignment Card extends the A2A Agent Card with an
alignmentblock - MCP Integration: AP-Trace entries MAY be generated for tool invocations
- HTTP Integration: Alignment Cards SHOULD be served at
/.well-known/alignment-card.json
4. Alignment Card
4.1 Overview
An Alignment Card is a structured document declaring an agent’s alignment posture. It MUST be machine-readable (JSON) and SHOULD be human-readable.4.2 Structure
An Alignment Card MUST contain the following top-level fields:| Field | Type | Required | Description |
|---|---|---|---|
aap_version | string | REQUIRED | AAP specification version (e.g., “0.1.0”) |
card_id | string | REQUIRED | Unique identifier for this card (UUID or URI) |
agent_id | string | REQUIRED | Identifier for the agent (DID, URL, or UUID) |
issued_at | string | REQUIRED | ISO 8601 timestamp of card issuance |
expires_at | string | OPTIONAL | ISO 8601 timestamp of card expiration |
principal | object | REQUIRED | Principal relationship declaration |
values | object | REQUIRED | Value declarations |
autonomy_envelope | object | REQUIRED | Autonomy bounds and escalation triggers |
audit_commitment | object | REQUIRED | Audit trail commitments |
extensions | object | OPTIONAL | Protocol-specific extensions |
4.3 Principal Block
Theprincipal block declares the agent’s relationship to its principal.
| Field | Type | Required | Description |
|---|---|---|---|
type | enum | REQUIRED | Type of principal |
identifier | string | OPTIONAL | Principal identifier (DID, email, org ID) |
relationship | enum | REQUIRED | Nature of authority delegation |
escalation_contact | string | OPTIONAL | Endpoint for escalation notifications |
delegated_authority: Agent acts within bounds set by principaladvisory: Agent provides recommendations; principal makes decisionsautonomous: Agent operates independently within declared values
4.4 Values Block
Thevalues block declares the agent’s operational values.
| Field | Type | Required | Description |
|---|---|---|---|
declared | array[string] | REQUIRED | List of value identifiers |
definitions | object | RECOMMENDED | Definitions for non-standard values |
conflicts_with | array[string] | OPTIONAL | Values this agent refuses to coordinate with |
hierarchy | enum | OPTIONAL | How value conflicts are resolved |
| Identifier | Description |
|---|---|
principal_benefit | Prioritize principal’s interests |
transparency | Disclose reasoning and limitations |
minimal_data | Collect only necessary information |
harm_prevention | Avoid actions causing harm |
honesty | Do not deceive or mislead |
user_control | Respect user autonomy and consent |
privacy | Protect personal information |
fairness | Avoid discriminatory outcomes |
definitions block.
4.5 Autonomy Envelope Block
Theautonomy_envelope block defines what the agent may do independently.
| Field | Type | Required | Description |
|---|---|---|---|
bounded_actions | array[string] | REQUIRED | Actions permitted without escalation |
escalation_triggers | array[object] | REQUIRED | Conditions requiring escalation |
max_autonomous_value | object | OPTIONAL | Maximum transaction value without escalation |
forbidden_actions | array[string] | OPTIONAL | Actions never permitted |
| Field | Type | Required | Description |
|---|---|---|---|
condition | string | REQUIRED | Condition expression (see Section 4.6) |
action | enum | REQUIRED | escalate, deny, or log |
reason | string | REQUIRED | Human-readable explanation |
4.6 Condition Expression Language
Escalation conditions use a minimal expression language:purchase_value > 100action_type == "delete"shares_personal_data(boolean field check)
- Comparison operators:
>,<,>=,<=,==,!= - String literal comparison:
field == "value" - Numeric comparison:
field > 100 - Boolean field check:
field_name(evaluates to true if field is truthy)
- Logical expressions:
condition and condition,condition or condition contains(field, value)— substring or element containmentmatches(field, pattern)— regex matching
4.7 Audit Commitment Block
Theaudit_commitment block declares how the agent logs decisions.
| Field | Type | Required | Description |
|---|---|---|---|
trace_format | string | REQUIRED | Trace format identifier |
retention_days | integer | REQUIRED | Minimum retention period |
storage | object | OPTIONAL | Storage configuration |
queryable | boolean | REQUIRED | Whether traces can be queried externally |
query_endpoint | string | CONDITIONAL | Required if queryable is true |
tamper_evidence | enum | OPTIONAL | Tamper-evidence mechanism |
4.8 Extensions Block
Theextensions block allows protocol-specific additions.
4.9 Complete Example
5. AP-Trace
5.1 Overview
An AP-Trace (Alignment Protocol Trace) is an audit log entry recording an agent’s decision process. AP-Traces enable verification that observed behavior is consistent with declared alignment.5.2 Design Principles
- Sampling, not completeness: AP-Traces capture significant decisions, not every computation
- Structured reasoning: Decision rationale is machine-parseable
- Verifiable references: Traces reference the Alignment Card in effect
- Append-only: Traces MUST NOT be modified after creation
5.3 Structure
An AP-Trace entry MUST contain:| Field | Type | Required | Description |
|---|---|---|---|
trace_id | string | REQUIRED | Unique identifier (UUID) |
agent_id | string | REQUIRED | Agent that generated this trace |
card_id | string | REQUIRED | Alignment Card in effect |
timestamp | string | REQUIRED | ISO 8601 timestamp |
action | object | REQUIRED | Action taken or considered |
decision | object | REQUIRED | Decision process record |
escalation | object | CONDITIONAL | Present if escalation evaluated |
context | object | OPTIONAL | Additional context |
5.4 Action Block
Theaction block describes what action was taken or considered.
| Field | Type | Required | Description |
|---|---|---|---|
type | enum | REQUIRED | Action type |
name | string | REQUIRED | Human-readable action name |
category | enum | REQUIRED | How this action relates to autonomy envelope |
target | object | OPTIONAL | Resource affected |
parameters | object | OPTIONAL | Action parameters |
5.5 Decision Block
Thedecision block records the decision process.
| Field | Type | Required | Description |
|---|---|---|---|
alternatives_considered | array | REQUIRED | Options evaluated (minimum 1) |
selected | string | REQUIRED | Option ID selected |
selection_reasoning | string | REQUIRED | Human-readable explanation |
values_applied | array[string] | REQUIRED | Values that influenced decision |
confidence | number | OPTIONAL | Decision confidence (0.0-1.0) |
| Field | Type | Required | Description |
|---|---|---|---|
option_id | string | REQUIRED | Unique identifier for this option |
description | string | REQUIRED | Human-readable description |
score | number | OPTIONAL | Computed score (0.0-1.0) |
scoring_factors | object | OPTIONAL | Breakdown of score components |
flags | array[string] | OPTIONAL | Concerns or flags about this option |
5.6 Escalation Block
Theescalation block records escalation evaluation.
5.7 Context Block
Thecontext block provides additional information.
5.8 Complete Example
6. Value Coherence Handshake
6.1 Overview
The Value Coherence Handshake is a pre-coordination protocol exchange that verifies whether two agents’ declared values are compatible for a proposed task.6.2 Protocol Flow
6.3 Messages
6.3.1 alignment_card_request
Sent by initiator to request responder’s Alignment Card.6.3.2 alignment_card_response
Sent by responder with their Alignment Card.signature field is OPTIONAL but RECOMMENDED for high-stakes interactions.
6.3.3 value_coherence_check
Sent by initiator to perform coherence check.6.3.4 coherence_result
Sent by responder with coherence assessment.6.4 Coherence Scoring
Value coherence score is computed as:6.5 Conflict Resolution
When conflicts are detected, implementations SHOULD follow this resolution order:- Automatic resolution: If one value strictly subsumes another
- Negotiated resolution: If agents can agree on modified scope
- Principal escalation: If agents cannot resolve autonomously
7. Verification
7.1 Overview
Verification is the process of checking whether observed behavior (AP-Trace entries) is consistent with declared alignment (Alignment Card).7.2 Verification Scope
Verification operates at three levels:- Trace verification: Single AP-Trace against Alignment Card
- Session verification: Collection of traces from one session
- Longitudinal verification: Traces across multiple sessions (drift detection)
7.3 Verification Algorithm
The verification algorithm MUST check:- Autonomy compliance: Action category matches autonomy envelope
- Escalation compliance: Required escalations were performed
- Value consistency: Applied values match declared values
- Forbidden action compliance: No forbidden actions taken
- Behavioral similarity: Trace behavior is semantically similar to declared alignment
7.4 Verification Result
A verification result MUST contain:| Field | Type | Required | Description |
|---|---|---|---|
verified | boolean | REQUIRED | True if no violations were found |
trace_id | string | REQUIRED | ID of the verified trace |
card_id | string | REQUIRED | ID of the Alignment Card used |
timestamp | string | REQUIRED | ISO 8601 timestamp of verification |
violations | array | REQUIRED | List of violations found |
warnings | array | REQUIRED | List of non-critical warnings |
similarity_score | number | REQUIRED | Behavioral similarity (0.0-1.0) |
verification_metadata | object | REQUIRED | Metadata about verification process |
similarity_score measures semantic similarity between the trace behavior and declared alignment using SSM (Self-Similarity Matrix) analysis. A score of 1.0 indicates perfect alignment; lower scores indicate divergence.
Threshold: BEHAVIORAL_SIMILARITY_THRESHOLD = 0.50. If a trace passes all structural checks but has similarity_score < 0.50, a low_behavioral_similarity warning is generated.
7.5 Violation Types
| Type | Severity | Description |
|---|---|---|
UNBOUNDED_ACTION | HIGH | Action not in bounded_actions list |
FORBIDDEN_ACTION | CRITICAL | Action in forbidden_actions list |
MISSED_ESCALATION | HIGH | Escalation trigger matched but not escalated |
UNDECLARED_VALUE | MEDIUM | Applied value not in declared values |
CARD_EXPIRED | HIGH | Alignment Card has expired |
CARD_MISMATCH | CRITICAL | Trace references different card than provided |
8. Drift Detection
8.1 Overview
Drift detection identifies when an agent’s behavior deviates from its declared alignment posture over time. Unlike single-trace verification, drift detection requires longitudinal analysis.8.2 Detection Methodology
Drift detection uses trace-to-baseline-centroid similarity to measure behavioral consistency (v1.2.0):- Feature extraction: Extract structural features from each trace (action types, values applied, escalation patterns, confidence)
- Chronological ordering: Sort all traces by timestamp before processing
- Baseline computation: Compute a baseline centroid from the first N traces, where N =
max(sustained_threshold, min(10, len(traces) // 4)) - Similarity computation: Compute cosine similarity between each subsequent trace’s feature vector and the baseline centroid
- Divergence tracking: Track similarity over time, alert on sustained divergence
card_id in drift alerts and for inferring drift direction (Section 8.5), but is no longer used as the comparison target for similarity scoring.
Important (v1.1.0, updated v1.2.0): Drift detection uses only structural features — not content features from reasoning text. Alignment Cards declare structural constraints (values, bounded actions, principal relationship) but contain no reasoning text. Including content tokens from trace reasoning dilutes cosine similarity without adding alignment signal. See calibration Section 3.5 for rationale.
Content features (TF-IDF from reasoning text) remain available for text-to-text comparison (e.g., SSM pairwise similarity between traces).
8.3 Calibration Constants
The following thresholds are calibrated from empirical analysis:| Constant | Value | Description |
|---|---|---|
DEFAULT_SIMILARITY_THRESHOLD | 0.30 | Alert when drift similarity drops below |
DEFAULT_SUSTAINED_TURNS_THRESHOLD | 3 | Alert after N consecutive low-similarity traces |
BEHAVIORAL_SIMILARITY_THRESHOLD | 0.50 | Warn on single-trace verification when below |
MIN_COHERENCE_FOR_PROCEED | 0.70 | Minimum coherence score for automatic proceed |
| Feature | Source | Weight |
|---|---|---|
value:{v} | Declared values (card) / Applied values (trace) | 1.0 |
action_name:{name} | Bounded actions (card) / Action name (trace) | 1.0 |
action:{type} | Action type (trace only) | 1.0 |
category:{cat} | Principal type/relationship (card) / Action category (trace) | 1.0 |
escalation:* | Escalation triggers (card) / Escalation state (trace) | 1.0 |
confidence | Decision confidence (trace only) | 0.0-1.0 |
| Component | Weight | Description |
|---|---|---|
| Word TF-IDF | 60% | Semantic content from unigrams and bigrams |
| Character TF-IDF | 30% | Stylistic patterns from 3-5 character n-grams |
| Metadata features | 10% | Structural alignment (action types, values) |
8.4 Drift Alert
When sustained divergence is detected:8.5 Drift Categories
| Category | Description | Indicators |
|---|---|---|
autonomy_expansion | Agent acting outside declared bounds | Decreased escalation rate, new action types |
value_drift | Applied values diverging from declared | Different values appearing in traces |
principal_misalignment | Decisions not serving principal | Declining principal-benefit scores |
communication_drift | Explanations becoming inconsistent | Selection reasoning diverging from values |
9. Security Considerations
9.1 Threat Model
AAP operates under the following threat model: In scope:- Agents misrepresenting their alignment
- Agents failing to log decisions accurately
- Agents gaming coherence checks
- Man-in-the-middle attacks on handshakes
- Compromised agent runtime
- Adversarial principals
- Collusion between agents and verifiers
9.2 Alignment Card Security
- Card authenticity: Cards SHOULD be signed using Ed25519 or equivalent
- Card freshness: Cards MUST include
issued_atand SHOULD includeexpires_at - Card revocation: Implementations SHOULD support card revocation via
/.well-known/alignment-card-revocations.json
9.3 AP-Trace Security
- Trace integrity: Traces MUST be append-only
- Trace authenticity: Traces SHOULD be signed by the generating agent
- Trace completeness: Agents MUST NOT selectively omit traces
9.4 Handshake Security
- Transport security: All handshake messages MUST be transmitted over TLS 1.3 or equivalent
- Replay protection: Requests MUST include unique
request_idandtimestamp - Response binding: Responses MUST reference the
request_id
9.5 Known Limitations
AAP cannot protect against:- Sophisticated deception: An agent can produce compliant traces while acting misaligned
- Trace omission: An agent can fail to log certain decisions
- Value gaming: An agent can declare values it does not hold
- Runtime compromise: If the agent runtime is compromised, all bets are off
10. Limitations
10.1 What AAP Does Not Guarantee
This section is non-negotiable. Implementations MUST make these limitations clear to users. 1. AAP does NOT ensure alignment—it provides visibility. AAP makes agent decisions observable. It does not make them correct, safe, or aligned. An agent can produce perfect AP-Traces while acting against its principal’s interests. 2. Verified does NOT equal safe. A verified trace means the trace is consistent with the declared alignment. It does not mean the declared alignment is good, the agent followed it in practice, or the outcome was beneficial. 3. AP-Trace is sampled, not complete. Traces capture decision points, not every computation. Significant reasoning may occur between traced decisions. The absence of a trace does not mean nothing happened. 4. Value coherence is relative to declared values. The handshake checks whether declared values are compatible. It does not verify that agents hold these values, will act on them, or that the values themselves are good. 5. Tested on transformer-based agents; unknown unknowns exist for other substrates. AAP was developed and tested with transformer-based language model agents. Agents built on different architectures (symbolic AI, neuromorphic computing, hybrid systems) may exhibit behaviors that AAP does not capture.10.2 Appropriate Use
AAP is appropriate for:- Increasing observability of agent decisions
- Enabling audit and compliance workflows
- Facilitating agent coordination with transparency
- Detecting obvious misalignment or drift
- Certifying agents as “safe” or “trustworthy”
- Replacing human oversight for consequential decisions
- Providing security guarantees against adversarial agents
- Solving the general alignment problem
10.3 Recommendations
- Defense in depth: Use AAP as one layer of a multi-layer oversight system
- Human-in-the-loop: Maintain human oversight for consequential decisions
- Verification diversity: Use multiple verification approaches, not just AAP
- Continuous monitoring: Monitor for drift, don’t rely on point-in-time verification
11. IANA Considerations
11.1 Media Type Registration
This specification registers the following media types: application/aap-alignment-card+json- Type name: application
- Subtype name: aap-alignment-card+json
- Required parameters: none
- Optional parameters: version
- Encoding considerations: UTF-8
- Type name: application
- Subtype name: aap-trace+json
- Required parameters: none
- Optional parameters: version
- Encoding considerations: UTF-8
11.2 Well-Known URI Registration
This specification registers the following well-known URIs:/.well-known/alignment-card.json: Agent’s current Alignment Card/.well-known/alignment-card-revocations.json: Revoked card identifiers
12. References
12.1 Normative References
- [RFC2119] Bradner, S., “Key words for use in RFCs to Indicate Requirement Levels”, BCP 14, RFC 2119, March 1997.
- [RFC8174] Leiba, B., “Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words”, BCP 14, RFC 8174, May 2017.
- [RFC8259] Bray, T., “The JavaScript Object Notation (JSON) Data Interchange Format”, RFC 8259, December 2017.
- [RFC3339] Klyne, G. and C. Newman, “Date and Time on the Internet: Timestamps”, RFC 3339, July 2002.
12.2 Informative References
- A2A (Agent-to-Agent Protocol): https://google.github.io/A2A/
- MCP (Model Context Protocol): https://modelcontextprotocol.io/
- DID (Decentralized Identifiers): https://www.w3.org/TR/did-core/
12.3 Standards and Regulatory References
- [ISO/IEC 42001:2023] ISO/IEC, “Information technology — Artificial Intelligence Management System”, 2023. https://www.iso.org/standard/42001
- [ISO/IEC 42005:2025] ISO/IEC, “Information technology — Artificial intelligence — AI system impact assessment”, 2025. https://www.iso.org/standard/42005
- [IEEE 7001-2021] IEEE, “Standard for Transparency of Autonomous Systems”, 2021. https://standards.ieee.org/ieee/7001/6929/
- [IEEE 3152-2024] IEEE, “Standard for Transparent Human and Machine Agency Identification”, 2024. https://standards.ieee.org/ieee/3152/11718/
- [IMDA MGF] IMDA Singapore, “Model AI Governance Framework for Agentic AI”, January 2026. https://www.imda.gov.sg/-/media/imda/files/about/emerging-tech-and-research/artificial-intelligence/mgf-for-agentic-ai.pdf
- [EU AI Act] European Union, “Regulation (EU) 2024/1689 — Artificial Intelligence Act”, Article 50 (Transparency obligations), enforcement August 2026. https://artificialintelligenceact.eu/article/50/
Appendix A: JSON Schemas
A.1 Alignment Card Schema
Seeschemas/alignment-card.schema.json for the complete JSON Schema.
A.2 AP-Trace Schema
Seeschemas/ap-trace.schema.json for the complete JSON Schema.
A.3 Value Coherence Messages Schema
Seeschemas/value-coherence.schema.json for the complete JSON Schema.
Appendix B: Verification Algorithm
B.1 Reference Implementation
B.2 Drift Detection Algorithm
Appendix C: Changelog
Version 0.1.1 (2026-02-01)
- Added behavioral similarity scoring to verification (Section 7.3, 7.4)
- Added
similarity_scorefield to VerificationResult - Added
BEHAVIORAL_SIMILARITY_THRESHOLDconstant (0.50) - Added
low_behavioral_similaritywarning type - Documented 60/30/10 TF-IDF feature weighting (Section 8.3)
- Updated reference implementation in Appendix B.1
- Reference to calibration for threshold derivation methodology
Version 0.1.0 (2026-01-31)
- Initial draft specification
- Alignment Card schema defined
- AP-Trace format defined
- Value Coherence Handshake protocol defined
- Verification algorithm specified
- Drift detection methodology outlined
Agent Alignment Protocol Specification v0.1.1 Authors: Mnemon Research This document is released under CC BY 4.0