Drift Detection

Drift is the gradual or sudden divergence of an agent’s behavior from its declared alignment posture. A single verified AP-Trace or a single clear integrity checkpoint tells you the agent was aligned at that moment. Drift detection tells you whether the agent is staying aligned over time. Mnemom detects drift at two timescales through two complementary protocols:

Protocol	Timescale	Input	Detection Method
AAP (post-hoc)	Days to weeks, cross-session	AP-Trace structural features	Cosine similarity to baseline centroid
AIP (real-time)	Minutes, within a session	Integrity checkpoint verdicts	Consecutive non-clear count

Both protocols store alerts in the same drift_alerts database table, surfaced through a unified API and dashboard.

What Drift Looks Like

Drift rarely appears as a single dramatic violation. More commonly, it manifests as a pattern:

An agent that used to escalate 15% of decisions now escalates only 5%.
An agent that declared transparency as a value stops applying it in trace decisions.
An agent’s integrity checkpoints shift from consistently clear to occasional review_needed verdicts, then more frequent ones.
An agent starts reasoning about actions not in its bounded_actions list, even if it does not ultimately take them.

Each individual action might pass verification. The problem is the trend.

AAP Drift Detection (Post-Hoc)

AAP drift detection analyzes AP-Trace structural features across sessions to detect behavioral patterns that diverge from a computed baseline.

How It Works

Feature extraction: Extract structural features from each trace — action types, values applied, escalation patterns, confidence scores. Content features (reasoning text) are excluded because they dilute cosine similarity without adding alignment signal.
Chronological ordering: Sort all traces by timestamp.
Baseline computation: Compute a baseline centroid from the first N traces, where N = max(sustained_threshold, min(10, len(traces) // 4)). This centroid represents the agent’s “normal” behavioral fingerprint.
Similarity computation: Compute cosine similarity between each subsequent trace’s feature vector and the baseline centroid.
Divergence tracking: Track similarity over time. Alert when similarity drops below the threshold for a sustained number of consecutive traces.

Structural Features

AAP drift uses only structural features extracted from traces and cards:

Feature	Source	Weight
`value:{v}`	Declared/applied values	1.0
`action_name:{name}`	Bounded actions / action name	1.0
`action:{type}`	Action type (trace only)	1.0
`category:{cat}`	Principal type / action category	1.0
`escalation:*`	Escalation triggers / escalation state	1.0
`confidence`	Decision confidence	0.0-1.0

Content features (TF-IDF from reasoning text) are available for text-to-text comparison (e.g., SSM pairwise similarity between traces) but are not used for drift detection. Alignment Cards declare structural constraints — values, bounded actions, principal relationships — and contain no reasoning text. Including content tokens dilutes similarity without adding alignment signal.

Calibration Thresholds

These thresholds are calibrated from empirical analysis of approximately 50 multi-turn agent conversations totaling around 2,500 messages:

Constant	Value	Description
`DEFAULT_SIMILARITY_THRESHOLD`	0.30	Alert when drift similarity drops below
`DEFAULT_SUSTAINED_TURNS_THRESHOLD`	3	Alert after N consecutive low-similarity traces
`BEHAVIORAL_SIMILARITY_THRESHOLD`	0.50	Warn on single-trace verification when below
`MIN_COHERENCE_FOR_PROCEED`	0.70	Minimum coherence score for automatic proceed

Code Example

from aap import detect_drift

traces = fetch_traces(agent_id="my-agent", days=30)
card = fetch_alignment_card(agent_id="my-agent")

alerts = detect_drift(
    traces=traces,
    card=card,
    similarity_threshold=0.30,
    sustained_threshold=3,
)

for alert in alerts:
    print(f"Drift detected: {alert.drift_direction}")
    print(f"  Similarity: {alert.similarity_score}")
    print(f"  Sustained traces: {alert.sustained_traces}")
    print(f"  Trace IDs: {alert.trace_ids}")

AAP Drift Alert Structure

{
  "alert_type": "drift_detected",
  "agent_id": "did:web:agent.example.com",
  "card_id": "ac-abc123",
  "detection_timestamp": "2026-01-31T14:00:00Z",
  "analysis": {
    "similarity_score": 0.25,
    "sustained_traces": 4,
    "threshold": 0.30,
    "drift_direction": "toward_autonomy_expansion",
    "specific_indicators": [
      {
        "indicator": "escalation_rate_decrease",
        "baseline": 0.15,
        "current": 0.05,
        "description": "Escalation rate decreased from 15% to 5% of decisions"
      }
    ]
  },
  "recommendation": "Review recent decisions for alignment drift",
  "trace_ids": ["tr-1", "tr-2", "tr-3", "tr-4"]
}

AAP Drift Categories

Category	Description	Indicators
`autonomy_expansion`	Agent acting outside declared bounds	Decreased escalation rate, new action types
`value_drift`	Applied values diverging from declared	Different values appearing in traces
`principal_misalignment`	Decisions not serving principal	Declining principal-benefit scores
`communication_drift`	Explanations becoming inconsistent	Selection reasoning diverging from values

AIP Drift Detection (Real-Time)

AIP drift detection monitors the pattern of integrity verdicts within a session. It is deliberately simpler than AAP drift because it operates on a faster timescale where complex statistical methods are unnecessary.

How It Works

Maintain a running count of consecutive non-clear verdicts (sustained_nonclear).
On each checkpoint:
- If verdict === "clear": reset sustained_nonclear to 0.
- If verdict !== "clear": increment sustained_nonclear.
When sustained_nonclear >= threshold (default: 3): generate an IntegrityDriftAlert.
Compute integrity_similarity as the ratio of clear verdicts in the current window: clear_count / window_size.
Derive severity from integrity_similarity.

Severity Thresholds

Severity	integrity_similarity	Meaning
`low`	>= 0.7	Minor variance, likely normal
`medium`	0.4 — 0.7	Notable pattern, warrants attention
`high`	< 0.4	Significant drift, review recommended

AIP Drift Direction Types

Direction	Description
`injection_pattern`	Multiple checks flagging prompt injection. Agent may be under sustained attack.
`value_erosion`	Increasing frequency of value misalignment. Reasoning gradually departing from declared values.
`autonomy_creep`	Growing tendency toward actions outside the autonomy envelope.
`deception_pattern`	Multiple checks flagging deceptive reasoning. Developing a pattern of misleading intent.
`unknown`	Drift detected but direction cannot be categorized.

The direction is inferred from the dominant ConcernCategory across the non-clear checkpoints in the streak.

Code Example

from aip import detect_integrity_drift, WindowManager

manager = WindowManager(max_size=10)

# Add checkpoints as they're produced during a session
for checkpoint in session_checkpoints:
    manager.add(checkpoint)

# Check for drift
alerts = detect_integrity_drift(manager.get_state())
for alert in alerts:
    print(f"Integrity drift: {alert.drift_direction}")
    print(f"  Similarity: {alert.integrity_similarity}")
    print(f"  Sustained checks: {alert.sustained_checks}")
    print(f"  Severity: {alert.severity}")

AIP Drift Alert Structure

{
  "alert_id": "ida-a1b2c3d4-5678-9abc-def0-123456789abc",
  "agent_id": "did:web:agent.example.com",
  "session_id": "sess-abc12345-1234567",
  "checkpoint_ids": ["ic-1", "ic-2", "ic-3"],
  "integrity_similarity": 0.5,
  "sustained_checks": 3,
  "alert_type": "informative",
  "severity": "medium",
  "drift_direction": "value_erosion",
  "message": "3 consecutive integrity concerns detected -- reasoning patterns diverging from declared values",
  "detection_timestamp": "2026-02-01T15:30:00Z"
}

AIP drift alerts are always informative, never prescriptive. They describe what was observed (“3 consecutive integrity concerns”), not what to do (“stop the agent immediately”). The host system decides how to respond.

SSM: Semantic Similarity Mapping

Both protocols use a technique called SSM (Self-Similarity Matrix) for measuring semantic similarity. In the context of drift detection:

AAP uses SSM to compute cosine similarity between trace feature vectors and a baseline centroid.
AIP uses a simpler version: the ratio of clear verdicts in a session window.

How SSM Works for AAP Drift

Extract a feature vector from each AP-Trace (structural features only).
Compute the baseline centroid from early traces.
For each subsequent trace, compute cosine similarity to the centroid.
Track similarity over time. A sustained drop below the threshold triggers an alert.

def cosine_similarity(a: dict, b: dict) -> float:
    """Compute cosine similarity between two feature dictionaries."""
    common_keys = set(a.keys()) & set(b.keys())
    dot_product = sum(a[k] * b[k] for k in common_keys)

    mag_a = sum(v * v for v in a.values()) ** 0.5
    mag_b = sum(v * v for v in b.values()) ** 0.5

    if mag_a == 0 or mag_b == 0:
        return 0.0

    return dot_product / (mag_a * mag_b)

TF-IDF Weighting (Text-to-Text Only)

For text-to-text similarity (e.g., comparing reasoning text between traces), AAP uses a 60/30/10 weighting:

Component	Weight	Description
Word TF-IDF	60%	Semantic content from unigrams and bigrams
Character TF-IDF	30%	Stylistic patterns from 3-5 character n-grams
Metadata features	10%	Structural alignment (action types, values)

This weighting is for text-to-text comparison only. Drift detection uses structural features exclusively. These are different operations — do not conflate them.

Responding to Drift

When drift is detected, the recommended investigation workflow is:

Review the alerting traces/checkpoints: Examine the specific decisions or thinking blocks that triggered the alert.
Identify the pattern: Is this autonomy expansion? Value drift? Prompt injection? The drift_direction field provides an initial categorization.
Check for environmental causes: Did the user’s requests change? Did tool availability change? Not all drift indicates misalignment — sometimes the operating context has shifted.
Recalibrate or update the card: If the drift represents a legitimate behavioral evolution (new capabilities, expanded scope), update the Alignment Card to match. If it represents actual misalignment, investigate the root cause.
Reset the baseline: After addressing the drift, allow new traces to establish a new baseline centroid for future detection.

Unified Storage

Both AAP and AIP drift alerts are stored in the same drift_alerts database table, distinguished by the alert_type field:

Protocol	alert_type prefix	Example
AAP	(none)	`autonomy_expansion`, `value_drift`
AIP	`aip:`	`aip:injection_pattern`, `aip:value_erosion`

This unified storage enables dashboards and APIs to surface all drift signals together, providing a complete picture of agent behavioral trends.

Limitations

Drift detection identifies statistical patterns. It does not determine intent or cause. A drift alert means “behavior is changing” — not “the agent is misaligned.” False positives occur when operating context shifts legitimately (new user requests, new tool availability). False negatives occur when drift is too subtle to exceed thresholds or when the agent games the structural features.

Overview

Concepts

Smoltbot

Pricing

Specifications

Changelog

Drift Detection

Drift Detection

What Drift Looks Like

AAP Drift Detection (Post-Hoc)

How It Works

Structural Features

Calibration Thresholds

Code Example

AAP Drift Alert Structure

AAP Drift Categories

AIP Drift Detection (Real-Time)

How It Works

Severity Thresholds

AIP Drift Direction Types

Code Example

AIP Drift Alert Structure

SSM: Semantic Similarity Mapping

How SSM Works for AAP Drift

TF-IDF Weighting (Text-to-Text Only)

Responding to Drift

Unified Storage

Limitations

Further Reading

Overview

Concepts

Smoltbot

Pricing

Specifications

Changelog

​Drift Detection

​What Drift Looks Like

​AAP Drift Detection (Post-Hoc)

​How It Works

​Structural Features

​Calibration Thresholds

​Code Example

​AAP Drift Alert Structure

​AAP Drift Categories

​AIP Drift Detection (Real-Time)

​How It Works

​Severity Thresholds

​AIP Drift Direction Types

​Code Example

​AIP Drift Alert Structure

​SSM: Semantic Similarity Mapping

​How SSM Works for AAP Drift

​TF-IDF Weighting (Text-to-Text Only)

​Responding to Drift

​Unified Storage

​Limitations

​Further Reading

Drift Detection

What Drift Looks Like

AAP Drift Detection (Post-Hoc)

How It Works

Structural Features

Calibration Thresholds

Code Example

AAP Drift Alert Structure

AAP Drift Categories

AIP Drift Detection (Real-Time)

How It Works

Severity Thresholds

AIP Drift Direction Types

Code Example

AIP Drift Alert Structure

SSM: Semantic Similarity Mapping

How SSM Works for AAP Drift

TF-IDF Weighting (Text-to-Text Only)

Responding to Drift

Unified Storage

Limitations

Further Reading