Skip to main content

Verifiable verdicts

Verifiable Verdicts extend the AIP attestation layer with zero-knowledge proofs that cryptographically demonstrate a verdict was derived correctly from its inputs. While signatures prove who signed a verdict and hash chains prove when it was signed, ZK proofs prove how it was computed — closing the remaining trust gap.

What gets proven

The ZK proof covers the verdict derivation rules: the deterministic mapping from analysis outputs to the final verdict. Specifically, the proof demonstrates:
  1. Category-to-severity mapping: Each concern category (e.g., boundary_violation, prompt_injection, value_misalignment) is mapped to the correct severity level (low, medium, high, critical) according to the protocol rules.
  2. Severity-to-verdict escalation: The most severe concern determines the checkpoint verdict:
    • No concerns or all low severity results in clear
    • medium severity results in review_needed
    • high or critical severity results in boundary_violation
  3. Concerns hash integrity: The hash of the concerns array in the proof journal matches the hash in the certificate, proving the verdict was derived from the exact concerns that were reported.
  4. Card constraint evaluation: The verdict accounts for the agent’s declared autonomy envelope — bounded actions, forbidden actions, and escalation triggers — as specified in the Alignment Card.

What does NOT get proven

The ZK proof intentionally excludes:
  • LLM inference: The analysis LLM’s reasoning process is not proven. LLM inference is non-deterministic and computationally infeasible to prove inside a zkVM. The proof covers only the deterministic post-processing of the LLM’s structured output.
  • Thinking block content: The raw thinking block is never included in the proof. Only the SHA-256 hash (thinking_block_hash) is referenced, preserving the agent’s reasoning privacy.
  • Analysis model selection: The choice of which LLM to use for analysis is an operational decision outside the proof scope.
The ZK proof guarantees that the verdict was correctly derived from the analysis output, not that the analysis itself was correct. The analysis quality remains bounded by the analysis LLM’s capability. See Limitations.

Guest program

The verdict derivation logic runs inside an SP1 zkVM guest program. The guest program is a deterministic Rust binary that takes structured inputs, applies the verdict rules, and produces public outputs.

Inputs (Private)

These are provided to the guest program but are not revealed in the proof:
InputTypeDescription
concernsVec<Concern>The array of concerns from the analysis LLM, each with category, severity, and description.
cardCardSummaryCompressed Alignment Card: values, bounded actions, forbidden actions, escalation triggers.
modelStringAnalysis model identifier.

Computation

The guest program executes the following deterministic steps:
  1. Validate input structure: Verify concerns array and card summary are well-formed.
  2. Apply category mapping: Map each concern’s category to the protocol-defined severity rules.
  3. Compute verdict: Determine the final verdict from the highest severity across all concerns.
  4. Compute concerns hash: SHA-256 hash of the serialized concerns array (deterministic key ordering).
  5. Write journal: Output the verdict, concerns hash, and card ID as public commitments.

Outputs (Public journal)

These are committed to the proof’s public journal and are visible to verifiers:
OutputTypeDescription
verdictStringThe computed verdict (clear, review_needed, boundary_violation).
concerns_hashStringSHA-256 of the serialized concerns array. Verifiers compare this to the certificate’s concerns to confirm consistency.
card_idStringThe Alignment Card ID, proving the verdict was derived against a specific card.

STARK proof format

The proof is an SP1 STARK (Scalable Transparent ARgument of Knowledge), stored and transmitted in three components:
ComponentDescription
receiptThe STARK proof receipt. Contains the cryptographic proof data. Base64-encoded.
journalThe public journal output. Contains the committed public values (verdict, concerns_hash, card_id). Base64-encoded.
image_idThe hash of the guest program binary. Uniquely identifies which program was executed. Used to verify the proof was generated by the correct verdict derivation program.
The image_id is a content hash of the compiled guest program. It acts as a commitment to the exact computation that was proven. A verifier checks:
  1. The image_id matches the expected verdict derivation program (published in the Mnemom GitHub repository).
  2. The receipt is a valid STARK proof for the given image_id and journal.
  3. The journal outputs are consistent with the certificate’s claims.

Tiered proving strategy

Not every checkpoint receives a ZK proof. Proving has non-trivial cost and latency, so AIP uses a tiered proving strategy that distinguishes real violations from configuration errors:
VerdictProving PolicyRationale
boundary_violation (no card gaps)Proven (subject to proof_boundary_cap)Confirmed behavioral violations must be independently verifiable.
boundary_violation (with card gaps)DeferredCard-gap violations are configuration errors (stale alignment cards), not agent misbehavior. Proving is deferred until DDR reconciliation classifies the violation.
review_neededStochastic sampling at proof_rate (default 10%)Sampled proofs provide statistical confidence without proving every checkpoint.
clearStochastic sampling at proof_rate (default 10%)Sampling clear verdicts catches systematic errors in the derivation logic.

Deferred proofs

When the policy engine detects card_gaps alongside a boundary_violation, the gateway creates a proof record with status: deferred instead of immediately dispatching to the prover. This creates a tamper-evident audit trail without incurring GPU cost. The observer’s DDR (Distributed Deliberation Review) reconciliation later classifies the violation:
DDR OutcomeProof ResolutionGPU Cost
card_gapskipped_card_gap — proof permanently skippedNone
observer_noiseskipped_noise — proof permanently skippedNone
aip_missUpgraded to pending and dispatched to proverNormal
This means only confirmed real violations receive the expensive GPU treatment. During rapid development when alignment cards frequently drift, this can reduce proving costs by 80-95%.

On-Demand proving

Any checkpoint can be proven on demand via POST /v1/checkpoints/{checkpoint_id}/prove. This is useful when:
  • An operator wants to verify a specific review_needed verdict
  • An auditor requires proof for a specific checkpoint
  • The stochastic sample missed a checkpoint of interest
  • A deferred proof needs to be forced without waiting for DDR

Cost and latency

MetricValueNotes
Proving cost~$0.001/proofModal H100 GPU, per-second billing. Scales to zero when idle.
Proving latency~700msWall-clock time from proof request to receipt on H100 GPU. Previously 3-15 minutes on CPU.
Verification latency<100 millisecondsSTARK verification is fast. Local verification avoids network round-trips.
Proof size~2.7 MBSP1 compressed STARK receipt.
Proving is asynchronous. The checkpoint verdict is returned immediately; the proof is generated in the background (~700ms on GPU) and attached to the certificate when complete. With GPU acceleration, proof generation completes before the next integrity checkpoint in most conversations.

Proof input persistence

When the gateway creates a proof request, it stores the full proving inputs alongside the verdict_proofs row:
ColumnDescription
analysis_jsonThe full analysis response text from the AIP observer
thinking_hashSHA-256 hash of the thinking block
card_hashSHA-256 hash of the Alignment Card at the time of analysis
values_hashSHA-256 hash of the conscience values at the time of analysis
modelThe analysis model identifier
This ensures that proof inputs survive prover restarts, network failures, and machine stops. If the initial proving attempt fails, the retry loop can reconstruct the full proof request from the database without needing the original HTTP payload.

Self-Healing retry loop

The prover service runs a background retry loop every 30 seconds that:
  1. Queries get_pending_proofs() for proofs with status pending or failed (up to 5 retries)
  2. Filters out proofs without stored input data (legacy orphans)
  3. Sets status to proving and spawns prove_verdict_derivation() with the stored inputs
  4. On success, calls complete_proof() to persist the receipt, journal, and verification result
  5. On failure, calls fail_proof() to increment the retry counter
This makes the proving pipeline self-healing: transient failures (prover unavailable, machine restarts, OOM) are automatically recovered without operator intervention.
If a proof fails to generate after 5 retries, it remains in failed status. The checkpoint is still valid — it just lacks the additional computational integrity guarantee. Proofs without stored inputs (created before input persistence was added) are skipped by the retry loop and marked as permanently failed.

Verification

Server-Side

Submit the full certificate to POST /v1/verify. The API delegates STARK verification to the prover service and returns a structured result including checks.verdict_derivation.valid.

Local verification

Use the SP1 verifier SDK to verify locally:
use sp1_sdk::{ProverClient, SP1ProofWithPublicValues};

let client = ProverClient::from_env();
let (_, vk) = client.setup(ELF);

let proof: SP1ProofWithPublicValues = deserialize(&cert.proofs.verdict_derivation.receipt);

// Verify the proof
client.verify(&proof, &vk).expect("Proof verification failed");

// Read the public values
let journal = proof.public_values.read::<VerdictJournal>();
assert_eq!(journal.verdict, cert.claims.verdict);

Checking proof status

For checkpoints where proving is in progress, query the status:
curl https://api.mnemom.ai/v1/checkpoints/{checkpoint_id}/proof
The status field progresses through: pending -> proving -> completed (or failed). For deferred proofs, the progression is: deferred -> skipped_card_gap | skipped_noise | pending (then normal flow).

See also