Design Spec · Draft v1

Memory Engine

2026-05-16 · branch feat-mem-mcp · memory v1

TL;DR

Personal memory as a first-class engine on the ai-gateway. The Cloudflare Worker is the control plane and the sole billing emitter — multi-tenancy, rate limits, ownership re-check, per-user usage events to payment-gateway. The Railway side runs 100% vanilla mem0ai/mem0 — no source patch, no extension files, no factory-dict swaps. Configuration only, plus one init.sql that seeds the pgvector extension and mem0's default User row. Worker→engine auth uses mem0's stock ADMIN_API_KEY path: X-API-Key: <MEM0_ADMIN_API_KEY> matched via secrets.compare_digest. Mem0 talks to Google's Gemini API directly via its native provider: "gemini" (uses the google-genai SDK, properly translates response_format → Gemini's response_mime_type). Vector store: native 3072-d gemini embeddings stored as vector(3072) with hnsw=False — exact sequential scan inside each user's partition, best retrieval quality, fast enough at v1 per-user volumes. Billing is by request count: the Worker emits one { operation, count: 1 } usage event per successful MCP tool call. Credit pricing per operation lives on payment-gateway, not in the Worker. MCP at /engines/memory/mcp matches the existing /engines/<id>/mcp convention.

V1 scope ship

Memory MCP at /engines/memory/mcp; seven tools.
Two write modes via MCP: remember (agent-curated) and ingest (turn capture for runtime hooks).
100% vanilla mem0ai/mem0 on Railway. No source patch, no extension files. Config only + one init.sql.
Mem0 talks to Google's Gemini API directly via its native provider: "gemini" (google-genai SDK).
Native 3072-d gemini embeddings, vector(3072), hnsw=False — exact sequential scan, best retrieval quality.
Worker→engine auth: X-API-Key: <MEM0_ADMIN_API_KEY> — mem0's stock admin path.
Worker control plane: content-hash dedupe (for remember), ownership re-check, rate limits, recency decay.
Per-user billing: one usage event per successful MCP tool call with { operation, count: 1 }. Failed requests (4xx/5xx, rate limits, ownership rejects) emit nothing. Memory bills by request, not by tokens (mem0's cost per call is bounded). Credit pricing per operation lives on payment-gateway, not in the Worker.
Disaster recovery: Railway native scheduled backups (daily, 6-day retention). Target RTO 4 h, RPO 24 h. Tighter posture (5-min RPO via WAL-G to R2 or Railway PITR) deferred to before-external-GA hardening.

V1 non-goals defer

No preferences toggle. Opt-in is by wiring up the stop hook (or not).
No browser-style REST endpoints. Agents handle list/search/delete via MCP.
No hybrid search (BM25 + RRF). Vector + recency decay only in v1.
No reranker. Pilot once eval data accumulates.
No persistent volume. Mem0 history is disabled; revisit before external users.
No organization memory or admin controls.

Two ways memory is populated, both via MCP:

Agent-curated: agent calls remember({memory: "..."}) during reasoning. Worker hashes, checks D1 for prior hash hit, forwards to mem0 POST /memories with infer=False. No LLM call.
Turn capture: a runtime hook calls ingest({messages: [...]}). Worker forwards to mem0.add(messages, user_id=..., infer=True). Mem0 runs its full native pipeline (top-10 search, LLM extraction with ADDITIVE_EXTRACTION_PROMPT, cross-memory hash dedup, batched embeddings, entity store with linked_memory_ids, prompt auto-updates). Mem0's LLM and embedder calls go to Google's Gemini API directly via its native gemini providers with the engine's GOOGLE_API_KEY_MEMORY; Google bills the dedicated "memory" GCP project at exact-token rates. After mem0 returns 2xx, the Worker emits one { operation: "ingest", count: 1 } usage event to payment-gateway. One round-trip per ingest.

Architecture

Runtime path

Trust boundary. ai-gateway is the only caller of the memory engine. The engine accepts only inbound requests carrying X-API-Key: <MEM0_ADMIN_API_KEY> — matched via secrets.compare_digest by mem0's stock verify_auth. All multi-tenancy enforcement and per-user usage emission lives in the Worker. Mem0 talks to Google's Gemini API directly via its native provider: "gemini" using the engine's GOOGLE_API_KEY_MEMORY; Google bills the dedicated "memory" GCP project at exact-token rates. Billing is by request count: the Worker emits one { operation, count: 1 } usage event per successful MCP tool call via the existing USAGE_INGEST_API_KEY path; failed requests are free. Credit pricing lives on payment-gateway. No engine-side billing machinery; no custom auth file; no subclasses; 100% vanilla mem0.

No circular dependency. Mem0's internal LLM/embedding calls go to Google's Gemini API directly via its native gemini providers (uses the google-genai SDK). Google bills the dedicated "memory" GCP project. The Worker emits one usage event per successful MCP tool call (no event on failure) via the existing USAGE_INGEST_API_KEY path — payload { operation, count: 1 }.

ai-gateway owns (control plane)

MCP transport, JWT verification, scope derivation.
Content-hash dedupe before forwarding remember.
Ownership re-check on every mutation (read → verify → mutate).
Per-user rate limits.
Recency decay on search results.
Emits one { operation, count: 1 } usage event per successful MCP tool call — sole billing emitter, no engine-side billing code. Pricing per operation is configured on payment-gateway.

Memory engine owns (100% vanilla mem0)

Mem0's full native pipeline under infer=True: extraction with ADDITIVE_EXTRACTION_PROMPT, hash dedup, batched embeddings, entity store with linked_memory_ids.
Auth via mem0's stock ADMIN_API_KEY path (secrets.compare_digest); no custom verifier.
LLM + embedder via mem0's native provider: "gemini" (google-genai SDK). response_format → Gemini's response_mime_type handled inside mem0's GeminiLLM (line 170).
Vector store: vector(3072), hnsw=False, exact sequential scan inside each user's partition. Best retrieval quality.
init.sql seeds the vector extension and the default User row idempotently.
Railway native scheduled backups (daily, 6-day retention).

Components

ai-gateway changes (`src/`)

Path	Responsibility
`src/mcp/engines/memory/`	Memory MCP server (image-gen pattern). Hosts all seven memory tools. Exports `memoryServer` and `memoryMetadata`.
`src/routes/engines/index.ts`	One line added: enginesApp.route(`/${memoryMetadata.id}`, memoryServer). Same pattern as image-gen.
`src/mcp/engines/memory/scope.ts`	Branded `MemoryScope` type; constructor reads JWT.sub + JWT.azp after verification.
`src/mcp/engines/memory/client.ts`	Thin HTTP client around mem0's stock REST API. Methods take `scope: MemoryScope`. Attaches `X-API-Key: <MEM0_ADMIN_API_KEY>` (Wrangler secret) on every call. No wrapping endpoint shape; no token-exchange hop.
`src/mcp/engines/memory/dedupe.ts`	Light idempotency for `remember` only (not `ingest` — mem0 handles dedup natively under `infer=True`). SHA-256 → D1 check; write hash after mem0 success; verify mem0 row on hash hit; nightly TTL sweep.
`src/mcp/engines/memory/ratelimit.ts`	Per-user windows in `GATEWAY_KV`. 429 + `Retry-After`.
`src/mcp/engines/memory/search.ts`	Recency decay on returned scores; re-sort; return top-k. Honors `metadata.pinned`.
`src/mcp/engines/memory/billing.ts`	After each successful mem0 round-trip, emits one usage event to payment-gateway via `USAGE_INGEST_API_KEY` with payload `{ user_id, product_id: "memory", operation, count: 1, request_id }`. No credit field — pricing per operation lives on payment-gateway. D1 outbox fallback on payment-gateway 5xx.

Memory engine (`infra/mem0/`) — 100% vanilla mem0

The engine is the vanilla mem0ai/mem0 Docker image, unmodified. No COPY over source. No Python extension files. No factory-dict swaps. Zero lines of our own Python. Two things make this possible:

Mem0 has a native provider: "gemini" for both LLM and embedder (uses Google's google-genai SDK). It translates response_format={"type": "json_object"} to Gemini-native response_mime_type: "application/json", so no subclass is needed to coerce JSON output.
hnsw=False uses mem0's stock pgvector schema (vector(<dim>)) with no index — exact sequential scan over the user's partition. Avoids pgvector's 2000-d HNSW cap, so no schema subclass is needed for native 3072-d embeddings. At v1 per-user volumes (hundreds to low thousands of memories) seq scan is sub-millisecond and gives better recall than HNSW would (exact vs. approximate).

File	Purpose
`Dockerfile`	`FROM mem0ai/mem0@sha256:<digest>` (pinned by digest, not tag). Vanilla — no `COPY`, no `RUN pip install` beyond mem0's defaults.
`init.sql`	Runs once against the Postgres add-on after provision: `CREATE EXTENSION IF NOT EXISTS vector;` + `INSERT INTO users (username) VALUES ('memory-engine') ON CONFLICT DO NOTHING;`. Idempotent. The User row is the single account mem0's `require_auth` resolves to under `ADMIN_API_KEY` auth.
`mem0.config.json` / env	Mem0's native config: `provider: "gemini"` for both LLM and embedder; model `gemini-3-flash` with `max_tokens=8000` for reasoning headroom; embedder model `models/gemini-embedding-2`; `embedding_dims=3072`; vector store pgvector with `hnsw=False`, `embedding_model_dims=3072`, cosine distance. History disabled.
`Dockerfile HEALTHCHECK`	Boot-time self-test: container fires an unauthenticated `GET /memories` against itself and fails liveness on anything other than 401. Catches an upstream `AUTH_DISABLED` default flip or an accidentally-empty `MEM0_ADMIN_API_KEY` that would otherwise open the engine to the public.
`tests/test_admin_auth_wired.py`	CI integration test: boots the built Docker image; sends POST `/memories` with no `X-API-Key`, with a bogus key, and with the real `MEM0_ADMIN_API_KEY`; asserts 401/401/2xx. Failure blocks the Railway deploy.
`railway.json`	One service + Postgres add-on. `init.sql` mounted as a post-provision migration. No persistent volume.
`README.md`	Deploy / rotate / backup / restore / DR-drill runbook.

MCP Tool Surface

All seven tools live at /engines/memory/mcp, mounted via enginesApp exactly like image-gen. The Worker translates each tool to direct calls against mem0's native REST endpoints.

Tool	Worker behavior	Mem0 call
`search_memory`	Forward query; apply recency decay to non-pinned rows on the response. Worker emits one `{operation: "search_memory", count: 1}` event after success.	`POST /memories/search`
`remember`	SHA-256 content hash → D1 check. Hit → verify mem0 row exists (cheap GET); if 404 delete orphan + fall through. Miss → forward; write hash only on mem0 2xx. Optional `pinned: true` opts out of recency decay. Worker emits one `{operation: "remember", count: 1}` event after success.	`POST /memories` (`infer=False`) — single curated fact, no extraction needed
`ingest`	Forward to mem0 with `infer=True`. Mem0 runs its full native pipeline (top-10 search, LLM extraction, hash-dedup, batched embed + insert, entity store). Worker emits one `{operation: "ingest", count: 1}` event after mem0 returns 2xx.	one `POST /memories` with `infer=True`
`update_memory`	`GET /memories/{id}` → verify `user_id` matches scope → forward.	`GET` then `PUT /memories/{id}`
`delete_memory`	Same ownership check → forward.	`GET` then `DELETE /memories/{id}`
`list_memory`	Forward with `user_id` filter; pass cursor.	`GET /memories?user_id=...`
`clear_all_memory`	Validate `confirm: true`; atomic 5/hour check via per-user Durable Object (KV is too eventually-consistent for a destructive op); forward.	`DELETE /memories?user_id=...`

Billing rule. billing.ts emits exactly one { operation, count: 1 } event per successful MCP tool call (mem0 returns 2xx and Worker post-processing — ownership check, decay, dedupe — succeeds). No event on any failure: rate limit reject (429), ownership mismatch / not_found, schema violation, payload too large, engine 5xx, timeout, mutex contention. Failed requests are free.

What `infer=False` vs `infer=True` means in mem0

Mem0's add() API has two paths through it, controlled by the infer argument. The choice matters because it determines where the LLM call happens, which in turn determines who gets billed.

	`infer=False` — raw store	`infer=True` — full pipeline (mem0's default)
What it does	Each user message becomes one row, stored verbatim. System messages skipped.	Mem0 retrieves the user's top-10 existing memories, calls the LLM with `ADDITIVE_EXTRACTION_PROMPT` (new messages + existing memories + last-k context + profile summary). The LLM emits ADD/UPDATE/DELETE decisions; mem0 dispatches each.
LLM calls per `add()`	Zero.	One extraction call, plus embeddings for whatever was added/updated.
Dedup against existing memories	None. Same content twice = two rows.	The LLM is shown existing memories and decides whether to ADD a new row, UPDATE an existing one, or DELETE a stale fact.
Cost per call	Embedding only (~tiny).	Extraction LLM + embeddings (~30× the `infer=False` cost in our setup).
Quality	Stores whatever you sent. If you sent a curated fact, good. If you sent a raw turn, you get raw turn rows.	Cleans up the input — splits compound facts, consolidates with existing knowledge, reconciles updates.

Concrete example — same input, two outcomes

Input: [{user: "I just adopted a Welsh Corgi named Otis. He loves fetch."}, {user: "He's 8 weeks."}]. Existing memory for the user: "User has a dog named Otis."

With infer=False: two new rows stored verbatim. Existing "User has a dog named Otis" untouched. Total: 3 memories, partially redundant.
With infer=True: one LLM call extracts → UPDATE old="User has a dog named Otis" → "User adopted Otis, a Welsh Corgi, 8 weeks old" + ADD "Otis enjoys playing fetch". Total: 2 memories, deduplicated and consolidated.

Which path our two write tools use:

remember → infer=False. The agent has already curated a single fact. Running an LLM to "extract" it would paraphrase, split, or drop nuance. We just embed and store.
ingest → infer=True. Mem0 runs its full native pipeline against the raw turn — extraction, hash-dedup, entity store, batched embeds, prompt auto-updates. We get the quality of mem0's pipeline (verified in mem0/memory/main.py:699-971).

How per-user billing works under infer=True without re-implementing extraction in the Worker: mem0 talks to Google's Gemini API directly via its native gemini providers — Google bills the dedicated "memory" GCP project at exact-token rates. The Worker emits one usage event per successful MCP tool call (no emission on failure): { user_id, product_id: "memory", operation, count: 1 }. Memory bills by request (mem0's cost per call is bounded), not by tokens; payment-gateway maps operation → credit price via its own config.

Stop-hook wiring (the `ingest` path)

curl -sS -X POST "$GATEWAY_URL/engines/memory/mcp" \
  -H "Authorization: Bearer $USER_TOKEN" \
  -H "Content-Type: application/json" \
  -d @- <<'EOF'
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "ingest",
    "arguments": {
      "messages": [
        { "role": "user",      "content": "..." },
        { "role": "assistant", "content": "..." }
      ],
      "metadata": { "session_id": "abc123" }
    }
  }
}
EOF

Validation Findings

Validated end-to-end on 2026-05-18 / 19 against a local docker-compose stack: pgvector/pgvector:pg16 + mem0ai==2.0.2 + mem0's native provider: "gemini" pointed at Google's Gemini API with gemini-3-flash (LLM) and gemini-embedding-2 (embeds). Full add/search/update/delete cycle confirmed; example infer=True extraction from a three-message turn: "User adopted a Welsh Corgi named Otis around May 19, 2026, who loves playing fetch."

1. Mem0 has a native provider: "gemini" for both LLM (mem0/llms/gemini.py) and embedder (mem0/embeddings/gemini.py) — uses Google's google-genai SDK directly. The LLM provider translates response_format={"type": "json_object"} to Gemini-native response_mime_type: "application/json" (line 170 of gemini.py). Whitelist accepts "gemini" directly. This is what makes the engine 100% vanilla — no subclass, no factory-dict swap.

2. Mem0 v2 changed search/get_all API to filters={...}. mem0.search(query, user_id=alice) raises ValueError: Top-level entity parameters not supported in search(). Use filters={'user_id': '...'} instead. Also the top-k arg is top_k, not limit. The Mem0Client in the Worker uses the v2 shape.

3. google/gemini-embedding-2 outputs 3072 dimensions natively. Probed directly via Google's Gemini API — response includes 3072-length vector. The model supports Matryoshka via the output_dimensionality parameter (mem0's stock GoogleGenAIEmbedding passes it through from config.embedding_dims), but we use the native dim in v1. pgvector's embedding_model_dims MUST match exactly — set to 3072.

4. google/gemini-3-flash is a reasoning model. Responses include completion_tokens_details.reasoning_tokens in the usage block. A trivial echo prompt used 74 reasoning tokens + 5 output tokens. Mem0's default max_tokens=2000 may starve the model on a real ADDITIVE_EXTRACTION_PROMPT. Fix: set MEM0_LLM_MAX_TOKENS=8000 in the engine's mem0 config. Billing implications: Google bills the "memory" GCP project for the full completion line (reasoning included); the Worker emits one ingest usage event regardless of tokens consumed. Reasoning headroom is a quality concern, not a billing one.

5. pgvector's vector type caps HNSW indexes at 2000 dimensions. Native gemini-embedding-2 at 3072-d would fail table creation with column cannot have more than 2000 dimensions for hnsw index if HNSW were enabled. Resolution: ship v1 with hnsw=False — mem0's stock pgvector schema uses vector(3072) with no index, search runs exact sequential scan inside each user's partition. Fast enough at v1 per-user volumes; better recall than HNSW would give (seq scan is exact, HNSW is approximate). When a power user makes seq-scan latency painful, the upgrade path is to switch to halfvec(3072) + halfvec_cosine_ops + HNSW — a one-time online schema migration, not v1 work.

6. Mem0's update wipes metadata. Returns {"message": "Memory updated successfully!"} — not the row. Subsequent get(id) shows metadata: null even when the original had metadata. The Worker must explicitly re-pass metadata on every update to preserve it.

Mem0 response shapes (captured verbatim)

// add(infer=False)
{
  "results": [
    { "id": "...", "memory": "User is allergic to peanuts.",
      "event": "ADD", "actor_id": null, "role": "user" }
  ]
}

// add(infer=True) — three-message turn, one extracted fact
{
  "results": [
    { "id": "...",
      "memory": "User adopted a Welsh Corgi named Otis around May 19, 2026, who loves playing fetch.",
      "event": "ADD" }
  ]
}

// search — note score, metadata round-trips, role on user-source memories
{
  "results": [
    { "id": "...", "memory": "...", "hash": "...",
      "metadata": { "pinned": true, "source_type": "explicit" },
      "score": 0.8834746264057188,
      "created_at": "...", "updated_at": "...",
      "user_id": "...", "role": "user" }
  ]
}

// update — message only, no row data
{ "message": "Memory updated successfully!" }

Multi-tenancy And Security

Memories are separated by user. Three enforcement layers in the Worker, plus a Worker-side ownership re-check that substitutes for same-DB-transaction enforcement (which is no longer possible without engine code).

Type-enforced scope derivation
MemoryScope is a branded type whose only constructor reads JWT.sub post-verification. Mem0Client methods accept only MemoryScope. Passing a raw string is a compile-time error.
Request-body sanitization
Any user_id-shaped field in MCP tool arguments is dropped before forwarding. Only the server-derived scope reaches the engine.
Admin API key gate at the engine
Inbound calls must carry X-API-Key: <MEM0_ADMIN_API_KEY>, matched by mem0's stock verify_auth via secrets.compare_digest (constant-time). The key is a Wrangler secret on the Worker side and a Railway env var on the engine side — same shared-secret posture as USAGE_INGEST_API_KEY. Non-gateway callers cannot reach the engine.
Worker-side ownership re-check + per-id mutex
Update / delete / clear_all do GET /memories/{id} → verify user_id == scope.userId → mutate, wrapped by a KV mutex (SETNX svc_mutex:mem:{id} 2-second TTL) so concurrent Worker requests on the same id serialize. Eliminates the TOCTOU race that retry-on-fetch would otherwise expose. Cross-user id returns not_found.
Worker is the sole billing emitter
Every per-user usage event to payment-gateway originates from the Worker. The engine never reaches payment-gateway and only sees the user_id string as request-body metadata. A Railway env leak (GOOGLE_API_KEY_MEMORY or MEM0_ADMIN_API_KEY) lets an attacker hit Google and our engine until we rotate, but cannot mint usage events for arbitrary users.

Mem0 itself partitions by user_id in its pgvector schema; every query requires at least one of user_id / agent_id / run_id and refuses to run without it. We rely on user_id only.

Hop	Auth	Notes
Client → ai-gateway	User JWT (Keycloak) or `sk-cs-*` API key	Existing `identity()` middleware.
ai-gateway → engine	`X-API-Key: <MEM0_ADMIN_API_KEY>` (mem0's stock admin path)	Constant-time match via `secrets.compare_digest`; vanilla mem0, no source patch.
engine → Google Gemini API (LLM + embeddings under `infer=True`)	`GOOGLE_API_KEY_MEMORY` (engine's own Google API key, "memory" GCP project)	Bills the GCP project at exact-token rates via mem0's native `google-genai` SDK.
Worker → payment-gateway (per-user usage events)	`USAGE_INGEST_API_KEY` (existing)	Worker emits one `{operation, count:1}` event per successful MCP tool call (no emission on failure). Credit pricing lives on payment-gateway.

Hardening Before External Users

V1 ships with internal users only. Before flipping the switch to external, these items must land. They are not blocking the initial implementation but ARE blocking the external GA gate.

Audit log on destructive ops
V1 has none — debugging "where did my memory go?" requires SSH-ing into Railway. Before external users, add a Worker-side audit table (one row per update/delete/clear_all) with nightly export to R2 object-lock for HIPAA/SOC2-Type-II evidence.
Postgres-backed history
Mem0 history currently disabled (":memory:"). Before external users, a ~50 LOC custom PostgresHistoryManager subclass restores content-diff forensics without re-introducing a SQLite volume.
Eval recall@5 stable for 14 days
At staging-traffic volume; size the eval set to detect a 10pp regression with significance.
Tighter DR posture
Either move Postgres backups to WAL-G shipping WAL to R2 every 5 min (5-min RPO, off-Railway redundancy), or wait for Railway's native PITR if/when it ships. v1's 24-hour Railway-native RPO is below the bar for irreversible personal-memory data at external-user scale.
Successful DR drill
RTO ≤ 4 h, RPO meeting the new posture.

What Has No Worker-side Alternative

Two things genuinely cannot be recovered on the Worker side under pure-mem0. Calling them out explicitly so future revisits know what they cost and how to bring them back.

1. BM25 keyword scoring (part of hybrid search). Needs direct Postgres access from the Worker (ts_rank_cd over the tsvector index mem0 already creates). Workers cannot speak Postgres without Cloudflare Hyperdrive + Railway external connectivity, which is new infra not in v1 scope. v1 ships vector + recency decay only. If eval shows recall-at-top-5 regresses materially on exact-token queries (proper nouns, code identifiers, unusual phrases), the recovery paths are a reranker (Cohere/Jina, deferred) or adding Hyperdrive to let the Worker run BM25 directly.

2. Engine-internal metrics. Mem0's stock server does not expose Prometheus, and we are not adding wrapper code that could. We lose engine-side request latency histograms, Postgres pool saturation, embedder/LLM call timing inside mem0. Root-causing mem0-internal slowness requires SSH-ing into Railway and pulling stdout/Postgres logs by hand. Acceptable for v1 internal traffic; revisit if engine perf becomes the binding constraint.

Everything else (ownership check, per-user usage emission, rate limit, dedupe, recency decay) has a Worker-side alternative — see the Components and Multi-tenancy sections.

Deployment And DR

State	Production backing	Rule
Mem0 vectors + payloads	Railway Postgres with `pgvector`	`vector(3072)`, `hnsw=False`, exact sequential scan per user partition. Native `gemini-embedding-2` dim preserved.
Mem0 history	Disabled (in-memory)	No reader in v1; revisit `PostgresHistoryManager` subclass before external users.
Postgres backups	Railway native scheduled (daily)	6-day retention per Railway docs. RPO ~24 h. Tighter posture deferred to before-external-GA hardening.
Dedupe table	D1 (existing ai-gateway DB)	One migration: `memory_content_hashes`. Worker-managed.
Engine LLM + embedding calls (mem0 internal under `infer=True`)	Google Gemini API directly	Engine key `GOOGLE_API_KEY_MEMORY`; charges the dedicated "memory" GCP project at exact-token rates via mem0's native `google-genai` SDK. Worker emits one `{operation, count:1}` usage event per successful MCP tool call; credit pricing lives on payment-gateway.

v1 DR posture: Railway native backups only. Railway runs scheduled snapshots (daily, 6-day retention per their docs) — periodic full backups, no point-in-time recovery, no off-platform redundancy. RPO is ~24 hours: a bad migration or accidental DELETE can lose up to a full day of memory writes. Two targets describe the quality of this DR posture:

RPO (Recovery Point Objective) = 24 hours — worst-case data loss between daily snapshots.
RTO (Recovery Time Objective) = 4 hours — wall-clock budget to restore the latest snapshot and cut over.

This is acceptable for v1 internal users. Before external-user GA, the Hardening section calls for tightening to either WAL-G shipping WAL to R2 every 5 min (5-min RPO + off-platform redundancy) or Railway's PITR offering if it ships in the meantime.

Required staging validation

Round-trip
remember → search_memory → update_memory → delete_memory.
Cross-user attempt
From a fresh JWT, try to read/update/delete another user's id; must receive not_found.
Recreate engine service
Tear down the FastAPI service while keeping Postgres; redeploy; verify memories survive (no volume to lose).
Quarterly DR drill
Restore latest snapshot + replay WAL into a scratch Railway project; run eval suite; record actual recovery time.

Implementation Libraries

Versions pinned to the latest stable as of 2026-05-18.

ai-gateway (TypeScript, Cloudflare Worker)

Concern	Library	Latest	Why this lib
MCP transport	`@modelcontextprotocol/sdk`	1.29.0	Reference SDK; same one image-gen uses.
Schemas	`zod`	4.4.3	MCP SDK + Hono target zod.
JWT / JWKS	`jose`	6.2.3	Pure Web Crypto, runs in V8 isolates.
HTTP server	`hono`	4.12.19	Workers-native router; existing gateway uses it.

Memory engine (Python — zero of our own code)

The engine is the vanilla mem0ai/mem0 Docker image plus an init.sql. We write zero lines of Python.

Concern	Library	Latest	Why this lib
Memory primitives + REST server + admin auth + Gemini providers	`mem0ai`	2.0.2	Native `ADMIN_API_KEY` auth; native `provider: "gemini"` for both LLM and embedder. No wrapper, no subclass, no fork.

Testing And Rollout

ai-gateway tests (vitest)

getMemoryScope produces MemoryScope only from JWT.sub.
Mem0Client rejects non-MemoryScope args (type-level).
Mem0Client attaches X-API-Key on every request; never sends Authorization.
Content-hash dedupe: D1 hit → no engine call; miss → forward + write hash only on mem0 2xx.
Ownership check: cross-user id → not_found (engine never called).
Rate limits: 31st remember in a minute → 429; clock advances → reset.
billing.ts emits exactly one usage event per successful mem0 round-trip; event shape { user_id, product_id: "memory", operation, count: 1, request_id } (no credit field — pricing on payment-gateway); no emission on engine 5xx; D1 outbox catches payment-gateway 5xx.
Recency decay: 60-day-old at 0.85 ranks below 1-day-old at 0.80; pinned: true does not decay.
Engine 5xx → backend_unavailable.

Engine tests (pytest)

End-to-end add(infer=True) against the dev Postgres + Google Gemini returns a non-empty results array with event: "ADD".
init.sql is idempotent: running twice doesn't duplicate the User row or the vector extension.
Mem0's native gemini provider translates response_format={"type": "json_object"} → Gemini's response_mime_type: "application/json" (parity smoke test against mem0/llms/gemini.py:170).
hnsw=False schema check: \d memories shows vector column with no HNSW index. Search latency at 1k rows < 100 ms p99.
Wired-up admin-auth integration test against built Docker image: missing X-API-Key → 401; bogus key → 401; real MEM0_ADMIN_API_KEY → 2xx with event: "ADD".

Eval harness (infra/mem0/eval/golden.jsonl): 50–100 hand-crafted memory/query pairs. CI runs the suite on PRs; recall@5 must not regress by more than 5 percentage points vs main.

Provision per env
Generate MEM0_ADMIN_API_KEY (32-byte random); mint a Google API key (GOOGLE_API_KEY_MEMORY) in a dedicated "memory" GCP project restricted to the Generative Language API; Railway project with Postgres and Railway native daily backups enabled; run init.sql against Postgres; deploy the vanilla mem0ai/mem0 Docker image.
Land code and smoke-test dev
Catalog shows memory; MCP tools list at /engines/memory/mcp; full cycle works; D1 dedupe rows appear; payment-gateway shows one {operation, count:1} usage event per successful MCP tool call (failures emit nothing).
Promote to staging via main push
Repeat smoke + DR drill.
Production with internal users
Gate external rollout on the tighter DR posture from Hardening landing (WAL-G to R2 or Railway PITR) and eval recall@5 stable for 14 days.

Source markdown: docs/superpowers/specs/2026-05-16-memory-engine-design.md in ai-gateway
Companion specs: Sarea Memory System · Memory Candidate Validation

TL;DR

V1 scope ship

V1 non-goals defer

Architecture

ai-gateway owns (control plane)

Memory engine owns (100% vanilla mem0)

Components

ai-gateway changes (src/)

Memory engine (infra/mem0/) — 100% vanilla mem0

MCP Tool Surface

What infer=False vs infer=True means in mem0

Concrete example — same input, two outcomes

Stop-hook wiring (the ingest path)

Validation Findings

Mem0 response shapes (captured verbatim)

Multi-tenancy And Security

Type-enforced scope derivation

Request-body sanitization

Admin API key gate at the engine

Worker-side ownership re-check + per-id mutex

Worker is the sole billing emitter

Hardening Before External Users

Audit log on destructive ops

Postgres-backed history

Eval recall@5 stable for 14 days

Tighter DR posture

Successful DR drill

What Has No Worker-side Alternative

Deployment And DR

Required staging validation

Round-trip

Cross-user attempt

Recreate engine service

Quarterly DR drill

Implementation Libraries

ai-gateway (TypeScript, Cloudflare Worker)

Memory engine (Python — zero of our own code)

Testing And Rollout

ai-gateway tests (vitest)

Engine tests (pytest)

Provision per env

Land code and smoke-test dev

Promote to staging via main push

Production with internal users

ai-gateway changes (`src/`)

Memory engine (`infra/mem0/`) — 100% vanilla mem0

What `infer=False` vs `infer=True` means in mem0

Stop-hook wiring (the `ingest` path)

Eval `recall@5` stable for 14 days