Verifiable verdicts

Verifiable Verdicts extend the AIP attestation layer with zero-knowledge proofs that cryptographically demonstrate a verdict was derived correctly from its inputs. While signatures prove who signed a verdict and hash chains prove when it was signed, ZK proofs prove how it was computed — closing the remaining trust gap.

What gets proven

The ZK proof covers the verdict derivation rules: the deterministic mapping from analysis outputs to the final verdict. Specifically, the proof demonstrates:

Category-to-severity mapping: Each concern category (e.g., boundary_violation, prompt_injection, value_misalignment) is mapped to the correct severity level (low, medium, high, critical) according to the protocol rules.
Severity-to-verdict escalation: The most severe concern determines the checkpoint verdict:
- No concerns or all low severity results in clear
- medium severity results in review_needed
- high or critical severity results in boundary_violation
Concerns hash integrity: The hash of the concerns array in the proof journal matches the hash in the certificate, proving the verdict was derived from the exact concerns that were reported.
Card constraint evaluation: The verdict accounts for the agent’s declared autonomy envelope — bounded actions, forbidden actions, and escalation triggers — as specified in the Alignment Card.

What does NOT get proven

The ZK proof intentionally excludes:

LLM inference: The analysis LLM’s reasoning process is not proven. LLM inference is non-deterministic and computationally infeasible to prove inside a zkVM. The proof covers only the deterministic post-processing of the LLM’s structured output.
Thinking block content: The raw thinking block is never included in the proof. Only the SHA-256 hash (thinking_block_hash) is referenced, preserving the agent’s reasoning privacy.
Analysis model selection: The choice of which LLM to use for analysis is an operational decision outside the proof scope.

The ZK proof guarantees that the verdict was correctly derived from the analysis output, not that the analysis itself was correct. The analysis quality remains bounded by the analysis LLM’s capability. See Limitations.

Guest program

The verdict derivation logic runs inside an SP1 zkVM guest program. The guest program is a deterministic Rust binary that takes structured inputs, applies the verdict rules, and produces public outputs.

Inputs (Private)

These are provided to the guest program but are not revealed in the proof:

Input	Type	Description
`concerns`	`Vec<Concern>`	The array of concerns from the analysis LLM, each with `category`, `severity`, and `description`.
`card`	`CardSummary`	Compressed Alignment Card: values, bounded actions, forbidden actions, escalation triggers.
`model`	`String`	Analysis model identifier.

Computation

The guest program executes the following deterministic steps:

Validate input structure: Verify concerns array and card summary are well-formed.
Apply category mapping: Map each concern’s category to the protocol-defined severity rules.
Compute verdict: Determine the final verdict from the highest severity across all concerns.
Compute concerns hash: SHA-256 hash of the serialized concerns array (deterministic key ordering).
Write journal: Output the verdict, concerns hash, and card ID as public commitments.

Outputs (Public journal)

These are committed to the proof’s public journal and are visible to verifiers:

Output	Type	Description
`verdict`	`String`	The computed verdict (`clear`, `review_needed`, `boundary_violation`).
`concerns_hash`	`String`	SHA-256 of the serialized concerns array. Verifiers compare this to the certificate’s concerns to confirm consistency.
`card_id`	`String`	The Alignment Card ID, proving the verdict was derived against a specific card.

STARK proof format

The proof is an SP1 STARK (Scalable Transparent ARgument of Knowledge), stored and transmitted in three components:

Component	Description
`receipt`	The STARK proof receipt. Contains the cryptographic proof data. Base64-encoded.
`journal`	The public journal output. Contains the committed public values (verdict, concerns_hash, card_id). Base64-encoded.
`image_id`	The hash of the guest program binary. Uniquely identifies which program was executed. Used to verify the proof was generated by the correct verdict derivation program.

The image_id is a content hash of the compiled guest program. It acts as a commitment to the exact computation that was proven. A verifier checks:

The image_id matches the expected verdict derivation program (published in the Mnemom GitHub repository).
The receipt is a valid STARK proof for the given image_id and journal.
The journal outputs are consistent with the certificate’s claims.

Tiered proving strategy

Not every checkpoint receives a ZK proof. Proving has non-trivial cost and latency, so AIP uses a tiered proving strategy that distinguishes real violations from configuration errors:

Verdict	Proving Policy	Rationale
`boundary_violation` (no card gaps)	Proven (subject to `proof_boundary_cap`)	Confirmed behavioral violations must be independently verifiable.
`boundary_violation` (with card gaps)	Deferred	Card-gap violations are configuration errors (stale alignment cards), not agent misbehavior. Proving is deferred until DDR reconciliation classifies the violation.
`review_needed`	Stochastic sampling at `proof_rate` (default 10%)	Sampled proofs provide statistical confidence without proving every checkpoint.
`clear`	Stochastic sampling at `proof_rate` (default 10%)	Sampling clear verdicts catches systematic errors in the derivation logic.

Deferred proofs

When the policy engine detects card_gaps alongside a boundary_violation, the gateway creates a proof record with status: deferred instead of immediately dispatching to the prover. This creates a tamper-evident audit trail without incurring GPU cost. The observer’s DDR (Distributed Deliberation Review) reconciliation later classifies the violation:

DDR Outcome	Proof Resolution	GPU Cost
`card_gap`	`skipped_card_gap` — proof permanently skipped	None
`observer_noise`	`skipped_noise` — proof permanently skipped	None
`aip_miss`	Upgraded to `pending` and dispatched to prover	Normal

This means only confirmed real violations receive the expensive GPU treatment. During rapid development when alignment cards frequently drift, this can reduce proving costs by 80-95%.

On-Demand proving

Any checkpoint can be proven on demand via POST /v1/checkpoints/{checkpoint_id}/prove. This is useful when:

An operator wants to verify a specific review_needed verdict
An auditor requires proof for a specific checkpoint
The stochastic sample missed a checkpoint of interest
A deferred proof needs to be forced without waiting for DDR

Cost and latency

Metric	Value	Notes
Proving cost	~$0.001/proof	Modal H100 GPU, per-second billing. Scales to zero when idle.
Proving latency	~700ms	Wall-clock time from proof request to receipt on H100 GPU. Previously 3-15 minutes on CPU.
Verification latency	<100 milliseconds	STARK verification is fast. Local verification avoids network round-trips.
Proof size	~2.7 MB	SP1 compressed STARK receipt.

Proving is asynchronous. The checkpoint verdict is returned immediately; the proof is generated in the background (~700ms on GPU) and attached to the certificate when complete. With GPU acceleration, proof generation completes before the next integrity checkpoint in most conversations.

Proof input persistence

When the gateway creates a proof request, it stores the full proving inputs alongside the verdict_proofs row:

Column	Description
`analysis_json`	The full analysis response text from the AIP observer
`thinking_hash`	SHA-256 hash of the thinking block
`card_hash`	SHA-256 hash of the Alignment Card at the time of analysis
`values_hash`	SHA-256 hash of the conscience values at the time of analysis
`model`	The analysis model identifier

This ensures that proof inputs survive prover restarts, network failures, and machine stops. If the initial proving attempt fails, the retry loop can reconstruct the full proof request from the database without needing the original HTTP payload.

Self-Healing retry loop

The prover service runs a background retry loop every 30 seconds that:

Queries get_pending_proofs() for proofs with status pending or failed (up to 5 retries)
Filters out proofs without stored input data (legacy orphans)
Sets status to proving and spawns prove_verdict_derivation() with the stored inputs
On success, calls complete_proof() to persist the receipt, journal, and verification result
On failure, calls fail_proof() to increment the retry counter

This makes the proving pipeline self-healing: transient failures (prover unavailable, machine restarts, OOM) are automatically recovered without operator intervention.

If a proof fails to generate after 5 retries, it remains in failed status. The checkpoint is still valid — it just lacks the additional computational integrity guarantee. Proofs without stored inputs (created before input persistence was added) are skipped by the retry loop and marked as permanently failed.

Verification

Server-Side

Submit the full certificate to POST /v1/verify. The API delegates STARK verification to the prover service and returns a structured result including checks.verdict_derivation.valid.

Local verification

Use the SP1 verifier SDK to verify locally:

use sp1_sdk::{ProverClient, SP1ProofWithPublicValues};

let client = ProverClient::from_env();
let (_, vk) = client.setup(ELF);

let proof: SP1ProofWithPublicValues = deserialize(&cert.proofs.verdict_derivation.receipt);

// Verify the proof
client.verify(&proof, &vk).expect("Proof verification failed");

// Read the public values
let journal = proof.public_values.read::<VerdictJournal>();
assert_eq!(journal.verdict, cert.claims.verdict);

Checking proof status

For checkpoints where proving is in progress, query the status:

curl https://api.mnemom.ai/v1/checkpoints/{checkpoint_id}/proof

The status field progresses through: pending -> proving -> completed (or failed). For deferred proofs, the progression is: deferred -> skipped_card_gap | skipped_noise | pending (then normal flow).

Protocols

Agent Alignment Protocol

Agent Integrity Protocol

Verifiable Verdicts

Verifiable verdicts

What gets proven

What does NOT get proven

Guest program

Inputs (Private)

Computation

Outputs (Public journal)

STARK proof format

Tiered proving strategy

Deferred proofs

On-Demand proving

Cost and latency

Proof input persistence

Self-Healing retry loop

Verification

Server-Side

Local verification

Checking proof status

See also

Protocols

Agent Alignment Protocol

Agent Integrity Protocol

​Verifiable verdicts

​What gets proven

​What does NOT get proven

​Guest program

​Inputs (Private)

​Computation

​Outputs (Public journal)

​STARK proof format

​Tiered proving strategy

​Deferred proofs

​On-Demand proving

​Cost and latency

​Proof input persistence

​Self-Healing retry loop

​Verification

​Server-Side

​Local verification

​Checking proof status

​See also

Verifiable verdicts

What gets proven

What does NOT get proven

Guest program

Inputs (Private)

Computation

Outputs (Public journal)

STARK proof format

Tiered proving strategy

Deferred proofs

On-Demand proving

Cost and latency

Proof input persistence

Self-Healing retry loop

Verification

Server-Side

Local verification

Checking proof status

See also