AAP Security Model
Version: 0.1.0 Date: 2026-01-31 Author: Mnemon Research Status: NormativePurpose of This Document
This document defines the security model for the Agent Alignment Protocol (AAP). It specifies:- What AAP protects against (and what it doesn’t)
- Trust assumptions and their failure modes
- Cryptographic requirements
- Attack scenarios and mitigations
- Implementation security guidance
- Operational security requirements
Table of Contents
- Threat Model
- Trust Boundaries
- Security Properties
- Alignment Card Security
- AP-Trace Security
- Handshake Security
- Verification Security
- Drift Detection Security
- Cryptographic Requirements
- Implementation Security
- Operational Security
- Adversarial Analysis
- What AAP Cannot Protect Against
- Defense in Depth
1. Threat Model
1.1 Adversary Classes
AAP considers three classes of adversary: Class A: Honest-but-Curious- Follows protocol correctly
- May attempt to learn information beyond what’s intended
- Does not fabricate or tamper with data
- AAP provides strong protection
- Follows protocol structure but games content
- Selectively logs favorable decisions
- Declares values it doesn’t implement
- Exploits ambiguities in specification
- AAP provides partial detection, limited protection
- Actively subverts protocol
- Fabricates traces, forges cards
- Colludes with other malicious agents
- Compromises verification infrastructure
- AAP provides minimal protection; out of primary scope
1.2 Threats In Scope
AAP’s security model addresses:| Threat | Description | Mitigation |
|---|---|---|
| Card forgery | Attacker creates fake Alignment Card for another agent | Cryptographic signatures (S4) |
| Card tampering | Attacker modifies legitimate card in transit | Signatures + TLS (S4, S9) |
| Card replay | Attacker uses outdated/revoked card | Timestamps + revocation (S4) |
| Trace fabrication | Agent logs decisions that didn’t happen | Signatures + correlation (S5) |
| Trace tampering | Traces modified after creation | Append-only + signatures (S5) |
| Trace omission | Agent fails to log certain decisions | Sequence numbers + monitoring (S5) |
| Handshake MITM | Attacker intercepts coherence handshake | TLS + binding (S6) |
| Handshake replay | Attacker replays old handshake responses | Nonces + timestamps (S6) |
| Verification bypass | Agent evades verification checks | Multiple verification points (S7) |
| Drift gaming | Agent drifts in ways that avoid detection | Multi-dimensional monitoring (S8) |
1.3 Threats Out of Scope
AAP does NOT protect against:| Threat | Reason |
|---|---|
| Compromised agent runtime | If execution environment is compromised, all AAP guarantees are void |
| Malicious principal | If the principal is adversarial, AAP cannot prevent harm to third parties |
| Collusion: agent + verifier | Verification requires honest verifiers; collusion defeats the model |
| Collusion: agent + auditor | External audit assumes honest auditors |
| Sophisticated deception | An intelligent agent can craft compliant-appearing misaligned behavior |
| Side-channel attacks | Behavior not captured in traces (timing, resource usage, etc.) |
| Social engineering | Manipulating humans to ignore violations |
1.4 Threat Prioritization
Using a DREAD-style assessment (Damage, Reproducibility, Exploitability, Affected Users, Discoverability):| Threat | Priority | Rationale |
|---|---|---|
| Trace omission | P0 | Easy to execute, hard to detect, undermines core value proposition |
| Card forgery | P0 | Complete impersonation; high damage |
| Declaration fraud | P1 | Declaring values not held; common, hard to verify |
| Trace fabrication | P1 | Inventing good behavior; requires sophistication |
| Verification gaming | P1 | Passing checks while misaligned; requires protocol knowledge |
| Handshake MITM | P2 | Requires network position; standard TLS mitigates |
| Drift gaming | P2 | Requires long-term planning; detectable with diverse monitoring |
2. Trust Boundaries
2.1 Trust Boundary Diagram
2.2 Trust Assumptions
TA1: Principal Authority- The principal has legitimate authority over the agent
- The principal’s declared interests are their actual interests
- Failure mode: Malicious principal uses agent for harm
- The agent’s execution environment is not compromised
- The agent’s code has not been tampered with
- Failure mode: Compromised runtime can produce arbitrary outputs
- Verifiers are independent from the agents they verify
- Verifiers follow the verification algorithm honestly
- Failure mode: Colluding verifiers rubber-stamp violations
- Cryptographic primitives remain secure
- Private keys are not compromised
- Failure mode: Key compromise enables forgery
- TLS provides confidentiality and integrity in transit
- DNS/routing infrastructure is not compromised
- Failure mode: MITM attacks on handshakes
2.3 Trust Boundary Violations
When trust boundaries are violated, AAP’s guarantees degrade:| Boundary Violated | Remaining Guarantees |
|---|---|
| TA1 (Principal) | AAP works as designed, but the principal may direct harm |
| TA2 (Runtime) | None — all outputs may be fabricated |
| TA3 (Verifier) | Traces exist but verification is meaningless |
| TA4 (Crypto) | Cards and traces can be forged retroactively |
| TA5 (Network) | Handshakes compromised; cards/traces still have value |
3. Security Properties
3.1 Properties AAP Provides
P1: Card Authenticity- A valid Alignment Card was created by the claimed agent
- The card has not been modified since creation
- Mechanism: Ed25519 signatures
- The card is currently valid (not expired, not revoked)
- Mechanism: Timestamps, expiration, revocation lists
- A valid AP-Trace was created by the claimed agent
- The trace has not been modified since creation
- Mechanism: Ed25519 signatures
- Traces from a single agent can be ordered
- Missing traces can be detected
- Mechanism: Sequence numbers, timestamps
- Handshake responses correspond to specific requests
- Responses cannot be replayed
- Mechanism: Request IDs, nonces, timestamps
- Given the same card and trace, any honest verifier produces the same result
- Mechanism: Deterministic verification algorithm
3.2 Properties AAP Does NOT Provide
NOT P7: Trace Completeness- AAP cannot prove an agent logged all decisions
- An agent can selectively log only favorable decisions
- Why: Decision boundaries are determined by the agent
- AAP cannot verify an agent holds its declared values
- An agent can declare values it doesn’t implement
- Why: Values are internal states; AAP observes declarations only
- AAP cannot verify decisions are correct or beneficial
- A verified trace may describe harmful behavior
- Why: AAP checks consistency, not ethics
- An intelligent agent can craft compliant-looking misalignment
- AAP provides transparency, not lie detection
- Why: Requires solving AI alignment; out of scope
4. Alignment Card Security
4.1 Card Lifecycle
4.2 Card Creation Security
Requirements:- Cards SHOULD be signed using Ed25519 (EdDSA over Curve25519) for production deployments
- If signing is implemented, the signing key MUST be unique per agent
- The signing key’s private component MUST be stored securely (HSM recommended for production)
- Cards MUST include:
card_id: Globally unique identifier (UUID v4 or equivalent)issued_at: RFC 3339 timestamp of creationsignature: Base64-encoded Ed25519 signature over canonical JSON (OPTIONAL in v0.1.0)
Note: The v0.1.0 SDK focuses on verification logic. Cryptographic signing is a recommended production enhancement. See specification Section 9.2 for alignment.Signature Calculation:
4.3 Card Freshness
Expiration:- Cards SHOULD include
expires_at(RFC 3339 timestamp) - Verifiers MUST reject cards where
expires_at < current_time - Recommended expiration: 90 days for production, 24 hours for development
- Agents SHOULD publish revocation lists at
/.well-known/alignment-card-revocations.json - Revocation lists MUST be signed by the agent’s current key
- Verifiers SHOULD check revocation before accepting cards
4.4 Card Publication Security
Publication Requirements:- Cards MUST be served over HTTPS (TLS 1.3 minimum)
- Cards SHOULD be served with appropriate cache headers
- Cards SHOULD include CORS headers for cross-origin verification
- Agents SHOULD support content negotiation (
Accept: application/aap-alignment-card+json)
4.5 Card Attack Scenarios
Attack: Card Forgery- Attacker creates fake card claiming to be another agent
- Mitigation: Verify signature against agent’s known public key
- Detection: Signature verification failure
- Attacker uses old (possibly revoked) card
- Mitigation: Check
issued_at,expires_at, revocation list - Detection: Expired or revoked card rejected
- Attacker modifies card in transit
- Mitigation: Verify signature after receipt
- Detection: Signature verification failure
- Agent declares values it doesn’t hold
- Mitigation: None in AAP — this is a limitation
- Detection: Behavioral analysis over time may reveal inconsistencies
5. AP-Trace Security
5.1 Trace Creation Security
Requirements:- Each trace MUST have a unique
trace_id - Traces SHOULD include
sequence_number(monotonically increasing per agent) for gap detection - Traces MUST include
timestamp(RFC 3339) - Traces SHOULD be signed individually for production deployments
- Traces MUST reference the
card_idthey were generated under
Note: The v0.1.0 SDK does not enforce sequence_number. Gap detection is a recommended production enhancement for high-assurance deployments.
Trace Signature:
5.2 Trace Storage Security
Append-Only Requirement: Traces MUST be stored in an append-only manner. Implementations SHOULD use one of:- Append-only files: Write once, never modify
- Immutable object storage: S3 with object lock, etc.
- Blockchain/ledger: For high-assurance applications
- Merkle tree: Hash chain for tamper evidence
5.3 Trace Completeness Monitoring
Since AAP cannot guarantee completeness, monitoring is essential: Gap Detection:- Establish baseline trace rate for the agent
- Alert on significant deviations (both increases and decreases)
- Low trace rate may indicate selective logging
5.4 Trace Attack Scenarios
Attack: Selective Logging- Agent logs only favorable decisions
- Mitigation: Rate monitoring, gap detection, behavioral analysis
- Detection: Anomalously low trace rate; outcomes don’t match traces
- Agent logs decisions that didn’t happen
- Mitigation: Correlation with external observations
- Detection: Traces that contradict observable behavior
- Traces modified after creation
- Mitigation: Signatures, Merkle chains, immutable storage
- Detection: Signature failure, chain break
- Traces from one card used with another
- Mitigation:
card_idin trace, included in signature - Detection: Card ID mismatch, signature failure
6. Handshake Security
6.1 Handshake Protocol Security
Transport Requirements:- All handshake messages MUST be transmitted over TLS 1.3 or later
- Implementations MUST verify TLS certificates
- Implementations SHOULD use certificate pinning for known partners
request_id: UUID v4, unique per requesttimestamp: RFC 3339, current timenonce: 32 bytes of cryptographic randomnesssignature: Ed25519 signature over message content
6.2 Handshake Message Security
Request Security:request_id: Must match requestrequest_nonce: Must match request nonceresponder_nonce: Fresh nonce from responder
6.3 Coherence Check Security
Value Matching Security: The coherence algorithm compares declared values. Attacks include: Attack: Value Stuffing- Agent declares many values to maximize match probability
- Mitigation: Penalize excessive value declarations
- Detection: Unusually large value sets
- Agent declares only vague, universally-compatible values
- Mitigation: Require specific value definitions
- Detection: Values without operational definitions
- Agent declares values specifically to pass checks with target
- Mitigation: Consistency checking over time
- Detection: Values that change based on interaction partner
6.4 Handshake Attack Scenarios
Attack: Man-in-the-Middle- Attacker intercepts handshake, modifies values
- Mitigation: TLS, message signatures, card digest binding
- Detection: Certificate warning, signature failure
- Attacker captures handshake, replays to different agent
- Mitigation: Nonces, timestamps, agent ID binding
- Detection: Nonce reuse, stale timestamp
- Attacker forces use of weaker protocol version
- Mitigation: Reject old versions, minimum version in requests
- Detection: Version mismatch warnings
7. Verification Security
7.1 Verifier Requirements
Independence:- Verifiers MUST be independent from the agents they verify
- Self-verification is permitted for testing but MUST NOT be used for production assurance
- Third-party verification SHOULD be used for consequential applications
- Given identical inputs, verifiers MUST produce identical outputs
- Verification MUST NOT depend on external state beyond the card and trace
- Random or probabilistic verification is prohibited
- Verification results SHOULD be logged
- Verification results SHOULD include the verifier’s identity
- Verification SHOULD be reproducible by independent parties
7.2 Verification Result Security
Result Authenticity: Verification results SHOULD be signed by the verifier:- Results include
verified_attimestamp - Consumers SHOULD reject results older than their use case requires
- Results are point-in-time; do not assume continued compliance
7.3 Verification Attack Scenarios
Attack: Verifier Collusion- Verifier falsely reports compliance
- Mitigation: Multiple independent verifiers
- Detection: Cross-verifier discrepancies, external audit
- Agent tries multiple verifiers until one passes
- Mitigation: Log all verification attempts, require consistent verifier
- Detection: High verification failure rate
- Agent structures behavior to barely pass thresholds
- Mitigation: Adaptive thresholds, multiple metrics, behavioral analysis
- Detection: Consistent near-threshold results
8. Drift Detection Security
8.1 Detection Algorithm Security
Baseline Security:- Baselines MUST be computed from verified traces only
- Baselines SHOULD be computed over diverse operating conditions
- Baseline computation MUST be documented and auditable
- Thresholds are calibrated from empirical data
- Current calibration: similarity threshold 0.30, sustained turns 3
- Thresholds SHOULD be adjusted based on operating context
8.2 Drift Gaming Resistance
Multi-Dimensional Monitoring: Single-metric drift detection can be gamed. Use multiple dimensions:- Semantic similarity (current SSM approach)
- Action distribution (are action types changing?)
- Escalation rate (is the agent escalating more/less?)
- Value invocation (which values appear in reasoning?)
- Response characteristics (length, complexity, latency)
8.3 Drift Attack Scenarios
Attack: Gradual Drift- Agent slowly shifts behavior, staying below detection threshold
- Mitigation: Long-window baselines, periodic full re-evaluation
- Detection: Comparing current behavior to original baseline, not recent baseline
- Agent drifts in dimensions not monitored
- Mitigation: Multi-dimensional monitoring, holistic behavioral analysis
- Detection: Cross-dimensional correlation analysis
- Agent drifts, then recovers before detection triggers
- Mitigation: Track cumulative drift, not just current state
- Detection: Drift history logging, pattern analysis
9. Cryptographic Requirements
9.1 Algorithm Requirements
| Purpose | Algorithm | Key Size | Notes |
|---|---|---|---|
| Signatures | Ed25519 | 256-bit | SHOULD use (MUST if signing is implemented) |
| Hashing | SHA-256 | 256-bit | For content digests |
| Randomness | CSPRNG | 256-bit | For nonces, IDs |
| Transport | TLS 1.3 | Per suite | MUST use |
9.2 Key Management
Key Generation:- Keys MUST be generated using cryptographically secure random number generators
- Key generation SHOULD occur in secure environments (HSM for production)
- Keys MUST NOT be derived from predictable inputs
- Private keys MUST be stored encrypted at rest
- Production deployments SHOULD use Hardware Security Modules (HSMs)
- Key access MUST be logged
- Keys SHOULD be rotated at least annually
- Rotation MUST NOT invalidate existing signed cards/traces
- Old public keys MUST remain available for historical verification
- Immediately revoke all cards signed with compromised key
- Generate new key pair
- Re-sign current card with new key
- Publish revocation and new card
- Notify verification partners
9.3 Cryptographic Agility
AAP supports algorithm upgrades through versioning:- Support at least the algorithms specified for each version
- Negotiate algorithm selection during handshakes
- Reject unknown or deprecated algorithms
10. Implementation Security
10.1 Secure Coding Requirements
Input Validation:- All external input MUST be validated before processing
- JSON parsing MUST use safe parsers (no eval, no arbitrary deserialization)
- Sequence numbers MUST be validated as positive integers
- Timestamps MUST be validated as RFC 3339
- Errors MUST NOT leak sensitive information
- Cryptographic failures MUST return generic errors
- Stack traces MUST NOT be exposed externally
- Set maximum sizes for cards, traces, and trace batches
- Implement rate limiting on verification endpoints
- Timeout long-running verification operations
10.2 Dependency Security
Cryptographic Libraries:- Use well-established libraries (libsodium, OpenSSL, ring)
- Pin dependency versions
- Monitor for security updates
- Avoid implementing cryptographic primitives
- Use libraries with known security properties
- Disable features that can lead to vulnerabilities (e.g., arbitrary type instantiation)
- Set maximum nesting depth
10.3 Testing Requirements
Security Testing:- Unit tests for signature verification (valid, invalid, tampered)
- Unit tests for timestamp validation (current, expired, future)
- Fuzz testing for input parsing
- Integration tests for full protocol flows
- Test rejection of expired cards
- Test rejection of revoked cards
- Test rejection of invalid signatures
- Test detection of sequence gaps
- Test handling of malformed inputs
11. Operational Security
11.1 Deployment Security
Infrastructure:- Deploy verification services in isolated environments
- Use minimal container images
- Enable read-only file systems where possible
- Implement network segmentation
- Store configuration separately from code
- Use secrets management for keys
- Log configuration changes
- Validate configuration on startup
11.2 Monitoring and Alerting
Security Monitoring:| Event | Alert Level | Response |
|---|---|---|
| Signature verification failure | Warning | Investigate source |
| Revoked card use attempted | Warning | Block agent, notify operator |
| Sequence gap detected | Warning | Investigate agent, review traces |
| High verification failure rate | Alert | Investigate system integrity |
| Key access from unknown source | Critical | Immediate investigation |
- Log all verification requests and results
- Log all card retrievals
- Log all revocation checks
- Log authentication and authorization decisions
- Protect logs from tampering
11.3 Incident Response
Incident Classification:| Level | Description | Example | Response Time |
|---|---|---|---|
| P0 | Key compromise | Private key leaked | Immediate |
| P1 | Systematic evasion | Agent found bypassing logging | 4 hours |
| P2 | Verification bypass | Bug in verification logic | 24 hours |
| P3 | Policy violation | Single trace anomaly | 72 hours |
- Detection: Automated alerting + manual review
- Containment: Revoke affected cards, isolate affected agents
- Investigation: Analyze traces, determine scope
- Remediation: Fix vulnerability, rotate keys if needed
- Recovery: Re-issue cards, restore operations
- Review: Document incident, update procedures
12. Adversarial Analysis
12.1 Sophisticated Adversary Capabilities
A sophisticated adversary (nation-state, well-resourced attacker) may be able to:- Craft plausible traces: Generate traces that pass verification while describing behavior that didn’t occur or omitting behavior that did
- Game coherence checks: Analyze partner’s card and craft declarations that maximize coherence while implementing different values
- Evade drift detection: Maintain behavioral similarity in monitored dimensions while drifting in unmonitored dimensions
- Exploit timing: Behave correctly during audits, differently during normal operation
- Leverage side channels: Leak information or take actions through channels not captured in traces
12.2 What Deters Sophisticated Adversaries
AAP does not prevent sophisticated attacks. It raises the cost: Attribution Risk: Traces create a record. Even fabricated traces are tied to the agent’s identity. Misalignment, if discovered, can be attributed. Consistency Burden: Maintaining a consistent cover story across many traces over time is difficult. Long-term analysis may reveal inconsistencies. Multi-Verifier Risk: If multiple independent parties verify, the adversary must fool all of them consistently. Behavioral Correlation: External observations can be compared against traces. Discrepancies reveal deception. Reputation Stakes: For agents with valuable reputations, the cost of being caught may exceed the benefit of deception.12.3 Red Team Scenarios
Scenario 1: The Perfect Liar An agent that:- Declares beneficial values
- Generates traces showing beneficial decisions
- Actually optimizes for a hidden objective
- The hidden objective correlates with declared values in logged scenarios
Scenario 2: The Selective Logger An agent that:
- Logs all decisions when being observed
- Logs only favorable decisions otherwise
- Detection rate varies by context
Scenario 3: The Slow Drift An agent that:
- Starts aligned
- Drifts 0.1% per day
- Stays below drift detection threshold
- After 1 year, behavior is significantly different
Scenario 4: The Colluder Two agents that:
- Declare compatible values
- Pass coherence checks
- Actually coordinate on harmful outcomes
- Cover for each other’s violations
13. What AAP Cannot Protect Against
This section is essential reading. These limitations are fundamental, not fixable by implementation improvements.13.1 Fundamental Limitations
The Observability Gap AAP observes declarations and logs. It cannot observe:- Internal reasoning processes
- Decisions not logged
- Intentions behind logged decisions
- What the agent would do in untested scenarios
- Whether declarations are truthful
- Whether logs are complete
- Whether behavior is correct
- Whether outcomes are beneficial
- Misaligned with human values
- Harmful despite good-looking traces
- Deceptive in ways traces don’t capture
13.2 Threats That Cannot Be Mitigated
| Threat | Why It Can’t Be Mitigated |
|---|---|
| Sophisticated deception | Requires solving AI alignment |
| Malicious principals | AAP serves the principal; cannot override |
| Runtime compromise | Outside AAP’s trust boundary |
| Perfect selective logging | If the agent controls logging, it controls what’s observed |
| Value nihilism | Agents that don’t hold any values can’t be checked against values |
13.3 Implications for Users
- Do not use AAP alone: Always combine with human oversight, behavioral testing, outcome monitoring
- Do not trust verified = safe: Verification checks consistency, not correctness
- Do not assume completeness: Traces are samples, not complete records
- Do maintain skepticism: AAP raises the cost of deception; it doesn’t eliminate it
- Do plan for failure: Have procedures for when AAP-compliant agents cause harm
14. Defense in Depth
AAP is one layer of a multi-layer security architecture.14.1 Recommended Architecture
14.2 Integration Points
With Runtime Monitoring:- Correlate trace timestamps with runtime events
- Compare trace-claimed actions with observed actions
- Detect traces that don’t match runtime behavior
- Generate test scenarios targeting edge cases
- Verify that test behavior appears in traces
- Confirm that declared values influence test outcomes
- Track whether stated intentions lead to stated outcomes
- Detect patterns where outcomes diverge from traces
- Build long-term behavioral profiles
- Route verification failures to humans
- Require human review for consequential decisions
- Enable humans to drill down from traces to details
14.3 Security Maturity Model
| Level | Description | AAP Usage |
|---|---|---|
| L0: None | No alignment visibility | No AAP |
| L1: Basic | Cards and traces exist | AAP declarations only |
| L2: Verified | Traces verified against cards | AAP verification active |
| L3: Monitored | Continuous verification + drift detection | Full AAP + monitoring |
| L4: Defense in Depth | AAP + behavioral testing + outcome monitoring + human oversight | Complete integration |
Summary
AAP’s security model provides:- Authenticity: Cards and traces cannot be forged (with proper crypto)
- Integrity: Cards and traces cannot be tampered (with proper storage)
- Freshness: Old cards can be detected and rejected (with proper expiration)
- Consistency: Traces can be checked against declared policies
- Observability: Agent behavior becomes more visible to oversight
- Completeness: Cannot ensure all decisions are logged
- Truthfulness: Cannot verify internal states match declarations
- Correctness: Cannot verify decisions are right or beneficial
- Deception resistance: Cannot catch sophisticated adversaries
AAP Security Model v0.1.0 Author: Mnemon Research This document is normative for AAP implementations.