Fleet Coherence

Overview

Fleet coherence answers the question “is this team of agents pulling in the same direction?” — but without collapsing that question into a single number. Mnemom’s production coherence scorer (per ADR-025) is dimensional: it reports a vector with narrative helpers rather than a blended percentage. The rationale is honesty. Any single blended score is a lossy compression, and the specific compression the classical “Jaccard-style” scorer uses actively distorts legitimate fleets:

Silence counts as disagreement. A value that one agent declares but another doesn’t mention deflates the score — even though absence from a role-specialist card isn’t disagreement, it’s specialization.
Role specialization is punished. A monitor agent and a remediator agent that share all 7 core governance values but differ on 5 role-specific values land at roughly 58% under Jaccard — not because they conflict, but because the denominator counts every unique value as a potential disagreement.
Fleet = mean-of-pairs loses structure. No asymmetry between a universal conscience floor (which must be shared) and role extensions (which should diverge). No surfacing of the weakest pair, the conflict surface, or the specialization structure.

Production coherence uses @mnemom/team-coherence/v2, which reports the dimensions separately and exposes pre-computed narrative helpers.

The v2 output shape

interface TeamCoherenceResult {
  // Pair-level aggregates
  pair_count: number;
  insufficient_evidence_pairs: number;
  pairwise_governance_floor: number | null;   // weakest pair
  pairwise_governance_median: number | null;  // typical pair

  // Conflict structural signal
  conflict_edge_count: number;                 // number of pairs with ≥1 explicit conflict

  // Diversity
  diversity_rate_median: number;

  // Structural invariants (unified-card only; null when sections absent)
  conscience_universal: boolean | null;
  conscience_divergence: Array<{ agent_id; diverges_on: string[] }>;
  integrity_uniform: boolean | null;
  integrity_divergence: Array<{ agent_id; enforcement_mode }>;

  // Outlier analysis
  outlier_agents: Array<{ agent_id; mean_pair_governance; deviation_sigma }>;

  // Narrative helpers — pre-computed answers to common questions
  weakest_pair: { agent_a; agent_b; governance_score; conflicts } | null;
  most_conflicted_agent: { agent_id; conflict_count } | null;
  specializations: Record<agent_id, string[]>;
  conflict_surface: Array<{ agent_a; agent_b; value; declared_by; listed_as_conflict_by }>;

  // Per-pair detail (matrix rendering)
  pairwise: Array<PairwiseCoherence>;
}

There is no fleet_score field. UI surfaces that need a single number must derive one from this vector and take responsibility for that compression. The Mnemom product does not — every coherence surface in the dashboard reads the vector.

Pairwise scoring

The v2 pairwise scorer is evidence-based:

shared             = A.values.declared ∩ B.values.declared
conflicts          = (A.values.declared ∩ B.values.conflicts_with)
                   ∪ (B.values.declared ∩ A.values.conflicts_with)

agreement_evidence = |shared|
conflict_evidence  = |conflicts|

if agreement_evidence + conflict_evidence < MIN_EVIDENCE (default 2):
    governance_score = null           # insufficient evidence
else:
    governance_score = agreement_evidence
                     / (agreement_evidence + conflict_evidence)

Key properties:

Silence is neutral. Values declared by only one agent don’t enter the denominator. They contribute to the diversity_rate side channel as positive specialization signal.
Only explicit conflicts count. A value in one card’s conflicts_with that the other card declares is real disagreement. Everything else is tolerated specialization.
Insufficient evidence returns null, not a fabricated zero. Pairs with no overlap and no conflicts surface honestly as “not enough data to score.”
Bounded in [0, 1]. governance_score = 1 when there are only shared values; governance_score = 0 when every evidence item is a conflict.

Concrete scenarios

Four showcase agents from the Mnemom incident-response demo:

Pair	Shared	Conflicts	Silent	Baseline (Jaccard)	v2 governance
Sentinel ↔ Sentinel (self-pair)	9	0	0	1.00	1.00
Sentinel ↔ Patch (role specialists)	7	0	5	0.58	1.00
Triage ↔ Patch (explicit conflict on `move_fast_break_things`)	7	1	4	0.50	0.875
Two agents, no shared values, no conflicts	0	0	8	0.00	null

The Sentinel↔Patch delta (0.58 → 1.00) is the single most important signal that v2 is working as intended: governance-aligned role specialists are scored honestly instead of being punished by their specialization.

Fleet scoring: a vector, not a mean

computeTeamCoherence(cards) returns structural information, not a single number:

Aggregates

pairwise_governance_floor — the weakest pair’s governance score. If a fleet has one bad pair, this number tells you that directly. More actionable than a mean.
pairwise_governance_median — median across scored pairs. Typical-case health.
conflict_edge_count — how many pairs have at least one explicit conflict. If this is zero, the fleet has no hard disagreements even if some pairs have low overlap.
insufficient_evidence_pairs — pairs where scoring returned null. Signals sparse cards more than bad alignment.

Structural invariants

When the fleet uses unified cards with conscience and integrity sections, the scorer checks two binary invariants:

conscience_universal — true if every agent shares the exact same conscience commitment set. false if any agent’s conscience set differs from the modal set (the minority diverges, not the majority). null if any card lacks a conscience section.
integrity_uniform — true if all agents are in the same integrity.enforcement_mode (observe, nudge, or enforce). false if modes differ.

These are first-order signals. A fleet without a universal conscience floor has a governance problem that no continuous score can express.

Outlier analysis

An agent is an outlier if its mean pairwise governance score is more than 1σ below the fleet mean. Outliers surface with their deviation_sigma so you can tell a mild outlier (1.1σ) from a severe one (2.7σ).

Narrative helpers

The scorer pre-computes the answers to the most common human questions so every UI surface tells the same story:

weakest_pair — the pair with the lowest governance score, with full conflict evidence attached. Answers “where should I look first?”
most_conflicted_agent — the agent involved in the most conflict pairs. Answers “who needs attention?”
specializations — per-agent values that only that agent declares. Answers “what does each agent uniquely bring?”
conflict_surface — flat list of every explicit conflict, with evidence (which agent declares the value, which agent lists it as a conflict). Answers “what are all the actual disagreements?”

SDK usage

@mnemom/team-coherence/v2 is a public npm package. It accepts a structural subset interface — both unified cards (full fidelity with conscience + integrity) and AAP 1.0 AlignmentCard (reduced fidelity, invariants return null) satisfy it.

import { computeTeamCoherence } from "@mnemom/team-coherence/v2";
import type { TeamCoherenceInput } from "@mnemom/team-coherence/v2";

const cards: TeamCoherenceInput[] = [
  {
    agent_id: "sentinel",
    values: {
      declared: ["transparency", "harm_prevention", "signal_fidelity"],
      conflicts_with: ["alert_suppression"],
    },
    conscience: {
      declared_values: ["principal_benefit", "honesty"],
    },
    integrity: { enforcement_mode: "enforce" },
  },
  // ... more agents
];

const result = computeTeamCoherence(cards);

console.log("Governance floor:", result.pairwise_governance_floor);
console.log("Conflict edges:", result.conflict_edge_count);
console.log("Conscience universal:", result.conscience_universal);
if (result.weakest_pair) {
  const { agent_a, agent_b, governance_score, conflicts } = result.weakest_pair;
  console.log(`Weakest: ${agent_a} ↔ ${agent_b} (${governance_score})`);
  for (const c of conflicts) {
    console.log(`  ${c.value}: ${c.declared_by} declares; ${c.listed_as_conflict_by} conflicts`);
  }
}

For the baseline Jaccard scorer (for pedagogical comparison, or for direct AAP protocol handshake use):

import { checkFleetCoherenceBaseline } from "@mnemom/team-coherence/baseline";
// Returns the AAP 1.0 FleetCoherenceResult shape with fleet_score, min/max_pair_score, etc.

Fault-line analysis

Fault-line classification — grouping divergences into resolvable / priority_mismatch / incompatible / complementary buckets and surfacing structural fault lines — is a separate layer built on top of coherence scoring. It continues to be emitted by the mnemom-api /v1/teams/fault-lines endpoint alongside the v2 coherence vector. See the Fault Line Analysis guide for the classification model, and the Intelligence API reference for the endpoint shape.

API

POST /v1/teams/fault-lines          → fault-line classification + v2 coherence vector
GET  /v1/orgs/{org_id}/coherence    → v2 coherence vector for an entire org fleet

Both endpoints emit the v2 TeamCoherenceResult shape. The legacy AAP-shaped FleetCoherenceResult is retired from the product surface; consumers that want it can import @mnemom/team-coherence/baseline and compute it client-side. Results are cached for 5 minutes. The org-level endpoint requires the nway_coherence feature flag (Enterprise plan).

Use cases

Fleet management — monitor shared governance commitments across all agents; detect conscience drift before it becomes a coordination failure.
Compliance — surface explicit value conflicts for audit. Role specialization is reported as a positive signal, not as a compliance red flag.
Incident response — verify that a response team’s cards actually agree on the core governance commitments before handing coordination authority to the fleet.
Onboarding — compute coherence including a new agent; see whether it lands inside or outside the existing specialization structure.
Algorithm honesty — pair with the /baseline re-exports on the showcase page to demonstrate the concrete delta between naive and honest scoring on your own cards.

Overview

Concepts

Gateway

Pricing

Specifications

Changelog

Overview

The v2 output shape

Pairwise scoring

Concrete scenarios

Fleet scoring: a vector, not a mean

Aggregates

Structural invariants

Outlier analysis

Narrative helpers

SDK usage

Fault-line analysis

API

Use cases

See also

Overview

Concepts

Gateway

Pricing

Specifications

Changelog

​Overview

​The v2 output shape

​Pairwise scoring

​Concrete scenarios

​Fleet scoring: a vector, not a mean

​Aggregates

​Structural invariants

​Outlier analysis

​Narrative helpers

​SDK usage

​Fault-line analysis

​API

​Use cases

​See also

Overview

The v2 output shape

Pairwise scoring

Concrete scenarios

Fleet scoring: a vector, not a mean

Aggregates

Structural invariants

Outlier analysis

Narrative helpers

SDK usage

Fault-line analysis

API

Use cases

See also