Design Spec · Draft v1

Memory Engine

2026-05-16 · branch feat-mem-mcp · memory v1

TL;DR

Personal memory as a first-class engine on the ai-gateway. The Cloudflare Worker is the control plane and the sole billing emitter — multi-tenancy, rate limits, ownership re-check, per-user usage events to payment-gateway. The Railway side runs 100% vanilla mem0ai/mem0 — no source patch, no extension files, no factory-dict swaps. Configuration only, plus one init.sql that seeds the pgvector extension and mem0's default User row. Worker→engine auth uses mem0's stock ADMIN_API_KEY path: X-API-Key: <MEM0_ADMIN_API_KEY> matched via secrets.compare_digest. Mem0 talks to Google's Gemini API directly via its native provider: "gemini" (uses the google-genai SDK, properly translates response_format → Gemini's response_mime_type). Vector store: native 3072-d gemini embeddings stored as vector(3072) with hnsw=False — exact sequential scan inside each user's partition, best retrieval quality, fast enough at v1 per-user volumes. Billing is by request count: the Worker emits one { operation, count: 1 } usage event per successful MCP tool call. Credit pricing per operation lives on payment-gateway, not in the Worker. MCP at /engines/memory/mcp matches the existing /engines/<id>/mcp convention.

V1 scope ship

  • Memory MCP at /engines/memory/mcp; seven tools.
  • Two write modes via MCP: remember (agent-curated) and ingest (turn capture for runtime hooks).
  • 100% vanilla mem0ai/mem0 on Railway. No source patch, no extension files. Config only + one init.sql.
  • Mem0 talks to Google's Gemini API directly via its native provider: "gemini" (google-genai SDK).
  • Native 3072-d gemini embeddings, vector(3072), hnsw=False — exact sequential scan, best retrieval quality.
  • Worker→engine auth: X-API-Key: <MEM0_ADMIN_API_KEY> — mem0's stock admin path.
  • Worker control plane: content-hash dedupe (for remember), ownership re-check, rate limits, recency decay.
  • Per-user billing: one usage event per successful MCP tool call with { operation, count: 1 }. Failed requests (4xx/5xx, rate limits, ownership rejects) emit nothing. Memory bills by request, not by tokens (mem0's cost per call is bounded). Credit pricing per operation lives on payment-gateway, not in the Worker.
  • Disaster recovery: Railway native scheduled backups (daily, 6-day retention). Target RTO 4 h, RPO 24 h. Tighter posture (5-min RPO via WAL-G to R2 or Railway PITR) deferred to before-external-GA hardening.

V1 non-goals defer

  • No preferences toggle. Opt-in is by wiring up the stop hook (or not).
  • No browser-style REST endpoints. Agents handle list/search/delete via MCP.
  • No hybrid search (BM25 + RRF). Vector + recency decay only in v1.
  • No reranker. Pilot once eval data accumulates.
  • No persistent volume. Mem0 history is disabled; revisit before external users.
  • No organization memory or admin controls.
Two ways memory is populated, both via MCP:
  1. Agent-curated: agent calls remember({memory: "..."}) during reasoning. Worker hashes, checks D1 for prior hash hit, forwards to mem0 POST /memories with infer=False. No LLM call.
  2. Turn capture: a runtime hook calls ingest({messages: [...]}). Worker forwards to mem0.add(messages, user_id=..., infer=True). Mem0 runs its full native pipeline (top-10 search, LLM extraction with ADDITIVE_EXTRACTION_PROMPT, cross-memory hash dedup, batched embeddings, entity store with linked_memory_ids, prompt auto-updates). Mem0's LLM and embedder calls go to Google's Gemini API directly via its native gemini providers with the engine's GOOGLE_API_KEY_MEMORY; Google bills the dedicated "memory" GCP project at exact-token rates. After mem0 returns 2xx, the Worker emits one { operation: "ingest", count: 1 } usage event to payment-gateway. One round-trip per ingest.

Architecture

Runtime path
ai-gateway (Cloudflare Worker) — CONTROL PLANE / SOLE BILLING EMITTER /engines/memory/mcp (7 MCP tools) · /catalog · MemoryScope branded type from JWT.sub · Ownership re-check on update/delete/clear · Per-user rate limits (GATEWAY_KV) · Recency decay on search · Per-request usage event emission ({ user_id, product_id: "memory", operation, count: 1 }) Memory engine (Railway) 100% vanilla mem0ai/mem0 no source patch, no extension files config only, plus init.sql: · CREATE EXTENSION vector · default User row for ADMIN_API_KEY · provider: gemini · hnsw=False · vector(3072) native, exact seq scan · Postgres + pgvector + Railway backups payment-gateway /usage/ingest · count: 1 per call Google Gemini API "memory" GCP project, exact-token rates Railway Postgres + pgvector vector(3072) · seq scan · daily backups X-API-Key: MEM0_ADMIN_API_KEY x-goog-api-key: GOOGLE_API_KEY_MEMORY (mem0's native google-genai SDK) USAGE_INGEST_API_KEY (existing path)
Trust boundary. ai-gateway is the only caller of the memory engine. The engine accepts only inbound requests carrying X-API-Key: <MEM0_ADMIN_API_KEY> — matched via secrets.compare_digest by mem0's stock verify_auth. All multi-tenancy enforcement and per-user usage emission lives in the Worker. Mem0 talks to Google's Gemini API directly via its native provider: "gemini" using the engine's GOOGLE_API_KEY_MEMORY; Google bills the dedicated "memory" GCP project at exact-token rates. Billing is by request count: the Worker emits one { operation, count: 1 } usage event per successful MCP tool call via the existing USAGE_INGEST_API_KEY path; failed requests are free. Credit pricing lives on payment-gateway. No engine-side billing machinery; no custom auth file; no subclasses; 100% vanilla mem0.
No circular dependency. Mem0's internal LLM/embedding calls go to Google's Gemini API directly via its native gemini providers (uses the google-genai SDK). Google bills the dedicated "memory" GCP project. The Worker emits one usage event per successful MCP tool call (no event on failure) via the existing USAGE_INGEST_API_KEY path — payload { operation, count: 1 }.

ai-gateway owns (control plane)

  • MCP transport, JWT verification, scope derivation.
  • Content-hash dedupe before forwarding remember.
  • Ownership re-check on every mutation (read → verify → mutate).
  • Per-user rate limits.
  • Recency decay on search results.
  • Emits one { operation, count: 1 } usage event per successful MCP tool call — sole billing emitter, no engine-side billing code. Pricing per operation is configured on payment-gateway.

Memory engine owns (100% vanilla mem0)

  • Mem0's full native pipeline under infer=True: extraction with ADDITIVE_EXTRACTION_PROMPT, hash dedup, batched embeddings, entity store with linked_memory_ids.
  • Auth via mem0's stock ADMIN_API_KEY path (secrets.compare_digest); no custom verifier.
  • LLM + embedder via mem0's native provider: "gemini" (google-genai SDK). response_format → Gemini's response_mime_type handled inside mem0's GeminiLLM (line 170).
  • Vector store: vector(3072), hnsw=False, exact sequential scan inside each user's partition. Best retrieval quality.
  • init.sql seeds the vector extension and the default User row idempotently.
  • Railway native scheduled backups (daily, 6-day retention).

Components

ai-gateway changes (src/)

PathResponsibility
src/mcp/engines/memory/Memory MCP server (image-gen pattern). Hosts all seven memory tools. Exports memoryServer and memoryMetadata.
src/routes/engines/index.tsOne line added: enginesApp.route(`/${memoryMetadata.id}`, memoryServer). Same pattern as image-gen.
src/mcp/engines/memory/scope.tsBranded MemoryScope type; constructor reads JWT.sub + JWT.azp after verification.
src/mcp/engines/memory/client.tsThin HTTP client around mem0's stock REST API. Methods take scope: MemoryScope. Attaches X-API-Key: <MEM0_ADMIN_API_KEY> (Wrangler secret) on every call. No wrapping endpoint shape; no token-exchange hop.
src/mcp/engines/memory/dedupe.tsLight idempotency for remember only (not ingest — mem0 handles dedup natively under infer=True). SHA-256 → D1 check; write hash after mem0 success; verify mem0 row on hash hit; nightly TTL sweep.
src/mcp/engines/memory/ratelimit.tsPer-user windows in GATEWAY_KV. 429 + Retry-After.
src/mcp/engines/memory/search.tsRecency decay on returned scores; re-sort; return top-k. Honors metadata.pinned.
src/mcp/engines/memory/billing.tsAfter each successful mem0 round-trip, emits one usage event to payment-gateway via USAGE_INGEST_API_KEY with payload { user_id, product_id: "memory", operation, count: 1, request_id }. No credit field — pricing per operation lives on payment-gateway. D1 outbox fallback on payment-gateway 5xx.

Memory engine (infra/mem0/) — 100% vanilla mem0

The engine is the vanilla mem0ai/mem0 Docker image, unmodified. No COPY over source. No Python extension files. No factory-dict swaps. Zero lines of our own Python. Two things make this possible:

  1. Mem0 has a native provider: "gemini" for both LLM and embedder (uses Google's google-genai SDK). It translates response_format={"type": "json_object"} to Gemini-native response_mime_type: "application/json", so no subclass is needed to coerce JSON output.
  2. hnsw=False uses mem0's stock pgvector schema (vector(<dim>)) with no index — exact sequential scan over the user's partition. Avoids pgvector's 2000-d HNSW cap, so no schema subclass is needed for native 3072-d embeddings. At v1 per-user volumes (hundreds to low thousands of memories) seq scan is sub-millisecond and gives better recall than HNSW would (exact vs. approximate).
FilePurpose
DockerfileFROM mem0ai/mem0@sha256:<digest> (pinned by digest, not tag). Vanilla — no COPY, no RUN pip install beyond mem0's defaults.
init.sqlRuns once against the Postgres add-on after provision: CREATE EXTENSION IF NOT EXISTS vector; + INSERT INTO users (username) VALUES ('memory-engine') ON CONFLICT DO NOTHING;. Idempotent. The User row is the single account mem0's require_auth resolves to under ADMIN_API_KEY auth.
mem0.config.json / envMem0's native config: provider: "gemini" for both LLM and embedder; model gemini-3-flash with max_tokens=8000 for reasoning headroom; embedder model models/gemini-embedding-2; embedding_dims=3072; vector store pgvector with hnsw=False, embedding_model_dims=3072, cosine distance. History disabled.
Dockerfile HEALTHCHECKBoot-time self-test: container fires an unauthenticated GET /memories against itself and fails liveness on anything other than 401. Catches an upstream AUTH_DISABLED default flip or an accidentally-empty MEM0_ADMIN_API_KEY that would otherwise open the engine to the public.
tests/test_admin_auth_wired.pyCI integration test: boots the built Docker image; sends POST /memories with no X-API-Key, with a bogus key, and with the real MEM0_ADMIN_API_KEY; asserts 401/401/2xx. Failure blocks the Railway deploy.
railway.jsonOne service + Postgres add-on. init.sql mounted as a post-provision migration. No persistent volume.
README.mdDeploy / rotate / backup / restore / DR-drill runbook.

MCP Tool Surface

All seven tools live at /engines/memory/mcp, mounted via enginesApp exactly like image-gen. The Worker translates each tool to direct calls against mem0's native REST endpoints.

ToolWorker behaviorMem0 call
search_memoryForward query; apply recency decay to non-pinned rows on the response. Worker emits one {operation: "search_memory", count: 1} event after success.POST /memories/search
rememberSHA-256 content hash → D1 check. Hit → verify mem0 row exists (cheap GET); if 404 delete orphan + fall through. Miss → forward; write hash only on mem0 2xx. Optional pinned: true opts out of recency decay. Worker emits one {operation: "remember", count: 1} event after success.POST /memories (infer=False) — single curated fact, no extraction needed
ingestForward to mem0 with infer=True. Mem0 runs its full native pipeline (top-10 search, LLM extraction, hash-dedup, batched embed + insert, entity store). Worker emits one {operation: "ingest", count: 1} event after mem0 returns 2xx.one POST /memories with infer=True
update_memoryGET /memories/{id} → verify user_id matches scope → forward.GET then PUT /memories/{id}
delete_memorySame ownership check → forward.GET then DELETE /memories/{id}
list_memoryForward with user_id filter; pass cursor.GET /memories?user_id=...
clear_all_memoryValidate confirm: true; atomic 5/hour check via per-user Durable Object (KV is too eventually-consistent for a destructive op); forward.DELETE /memories?user_id=...
Billing rule. billing.ts emits exactly one { operation, count: 1 } event per successful MCP tool call (mem0 returns 2xx and Worker post-processing — ownership check, decay, dedupe — succeeds). No event on any failure: rate limit reject (429), ownership mismatch / not_found, schema violation, payload too large, engine 5xx, timeout, mutex contention. Failed requests are free.

What infer=False vs infer=True means in mem0

Mem0's add() API has two paths through it, controlled by the infer argument. The choice matters because it determines where the LLM call happens, which in turn determines who gets billed.

infer=False — raw storeinfer=True — full pipeline (mem0's default)
What it doesEach user message becomes one row, stored verbatim. System messages skipped.Mem0 retrieves the user's top-10 existing memories, calls the LLM with ADDITIVE_EXTRACTION_PROMPT (new messages + existing memories + last-k context + profile summary). The LLM emits ADD/UPDATE/DELETE decisions; mem0 dispatches each.
LLM calls per add()Zero.One extraction call, plus embeddings for whatever was added/updated.
Dedup against existing memoriesNone. Same content twice = two rows.The LLM is shown existing memories and decides whether to ADD a new row, UPDATE an existing one, or DELETE a stale fact.
Cost per callEmbedding only (~tiny).Extraction LLM + embeddings (~30× the infer=False cost in our setup).
QualityStores whatever you sent. If you sent a curated fact, good. If you sent a raw turn, you get raw turn rows.Cleans up the input — splits compound facts, consolidates with existing knowledge, reconciles updates.

Concrete example — same input, two outcomes

Input: [{user: "I just adopted a Welsh Corgi named Otis. He loves fetch."}, {user: "He's 8 weeks."}]. Existing memory for the user: "User has a dog named Otis."

Which path our two write tools use:
  • rememberinfer=False. The agent has already curated a single fact. Running an LLM to "extract" it would paraphrase, split, or drop nuance. We just embed and store.
  • ingestinfer=True. Mem0 runs its full native pipeline against the raw turn — extraction, hash-dedup, entity store, batched embeds, prompt auto-updates. We get the quality of mem0's pipeline (verified in mem0/memory/main.py:699-971).
How per-user billing works under infer=True without re-implementing extraction in the Worker: mem0 talks to Google's Gemini API directly via its native gemini providers — Google bills the dedicated "memory" GCP project at exact-token rates. The Worker emits one usage event per successful MCP tool call (no emission on failure): { user_id, product_id: "memory", operation, count: 1 }. Memory bills by request (mem0's cost per call is bounded), not by tokens; payment-gateway maps operation → credit price via its own config.

Stop-hook wiring (the ingest path)

curl -sS -X POST "$GATEWAY_URL/engines/memory/mcp" \
  -H "Authorization: Bearer $USER_TOKEN" \
  -H "Content-Type: application/json" \
  -d @- <<'EOF'
{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "ingest",
    "arguments": {
      "messages": [
        { "role": "user",      "content": "..." },
        { "role": "assistant", "content": "..." }
      ],
      "metadata": { "session_id": "abc123" }
    }
  }
}
EOF

Validation Findings

Validated end-to-end on 2026-05-18 / 19 against a local docker-compose stack: pgvector/pgvector:pg16 + mem0ai==2.0.2 + mem0's native provider: "gemini" pointed at Google's Gemini API with gemini-3-flash (LLM) and gemini-embedding-2 (embeds). Full add/search/update/delete cycle confirmed; example infer=True extraction from a three-message turn: "User adopted a Welsh Corgi named Otis around May 19, 2026, who loves playing fetch."

1. Mem0 has a native provider: "gemini" for both LLM (mem0/llms/gemini.py) and embedder (mem0/embeddings/gemini.py) — uses Google's google-genai SDK directly. The LLM provider translates response_format={"type": "json_object"} to Gemini-native response_mime_type: "application/json" (line 170 of gemini.py). Whitelist accepts "gemini" directly. This is what makes the engine 100% vanilla — no subclass, no factory-dict swap.
2. Mem0 v2 changed search/get_all API to filters={...}. mem0.search(query, user_id=alice) raises ValueError: Top-level entity parameters not supported in search(). Use filters={'user_id': '...'} instead. Also the top-k arg is top_k, not limit. The Mem0Client in the Worker uses the v2 shape.
3. google/gemini-embedding-2 outputs 3072 dimensions natively. Probed directly via Google's Gemini API — response includes 3072-length vector. The model supports Matryoshka via the output_dimensionality parameter (mem0's stock GoogleGenAIEmbedding passes it through from config.embedding_dims), but we use the native dim in v1. pgvector's embedding_model_dims MUST match exactly — set to 3072.
4. google/gemini-3-flash is a reasoning model. Responses include completion_tokens_details.reasoning_tokens in the usage block. A trivial echo prompt used 74 reasoning tokens + 5 output tokens. Mem0's default max_tokens=2000 may starve the model on a real ADDITIVE_EXTRACTION_PROMPT. Fix: set MEM0_LLM_MAX_TOKENS=8000 in the engine's mem0 config. Billing implications: Google bills the "memory" GCP project for the full completion line (reasoning included); the Worker emits one ingest usage event regardless of tokens consumed. Reasoning headroom is a quality concern, not a billing one.
5. pgvector's vector type caps HNSW indexes at 2000 dimensions. Native gemini-embedding-2 at 3072-d would fail table creation with column cannot have more than 2000 dimensions for hnsw index if HNSW were enabled. Resolution: ship v1 with hnsw=False — mem0's stock pgvector schema uses vector(3072) with no index, search runs exact sequential scan inside each user's partition. Fast enough at v1 per-user volumes; better recall than HNSW would give (seq scan is exact, HNSW is approximate). When a power user makes seq-scan latency painful, the upgrade path is to switch to halfvec(3072) + halfvec_cosine_ops + HNSW — a one-time online schema migration, not v1 work.
6. Mem0's update wipes metadata. Returns {"message": "Memory updated successfully!"} — not the row. Subsequent get(id) shows metadata: null even when the original had metadata. The Worker must explicitly re-pass metadata on every update to preserve it.

Mem0 response shapes (captured verbatim)

// add(infer=False)
{
  "results": [
    { "id": "...", "memory": "User is allergic to peanuts.",
      "event": "ADD", "actor_id": null, "role": "user" }
  ]
}

// add(infer=True) — three-message turn, one extracted fact
{
  "results": [
    { "id": "...",
      "memory": "User adopted a Welsh Corgi named Otis around May 19, 2026, who loves playing fetch.",
      "event": "ADD" }
  ]
}

// search — note score, metadata round-trips, role on user-source memories
{
  "results": [
    { "id": "...", "memory": "...", "hash": "...",
      "metadata": { "pinned": true, "source_type": "explicit" },
      "score": 0.8834746264057188,
      "created_at": "...", "updated_at": "...",
      "user_id": "...", "role": "user" }
  ]
}

// update — message only, no row data
{ "message": "Memory updated successfully!" }

Multi-tenancy And Security

Memories are separated by user. Three enforcement layers in the Worker, plus a Worker-side ownership re-check that substitutes for same-DB-transaction enforcement (which is no longer possible without engine code).

  1. Type-enforced scope derivation

    MemoryScope is a branded type whose only constructor reads JWT.sub post-verification. Mem0Client methods accept only MemoryScope. Passing a raw string is a compile-time error.

  2. Request-body sanitization

    Any user_id-shaped field in MCP tool arguments is dropped before forwarding. Only the server-derived scope reaches the engine.

  3. Admin API key gate at the engine

    Inbound calls must carry X-API-Key: <MEM0_ADMIN_API_KEY>, matched by mem0's stock verify_auth via secrets.compare_digest (constant-time). The key is a Wrangler secret on the Worker side and a Railway env var on the engine side — same shared-secret posture as USAGE_INGEST_API_KEY. Non-gateway callers cannot reach the engine.

  4. Worker-side ownership re-check + per-id mutex

    Update / delete / clear_all do GET /memories/{id} → verify user_id == scope.userId → mutate, wrapped by a KV mutex (SETNX svc_mutex:mem:{id} 2-second TTL) so concurrent Worker requests on the same id serialize. Eliminates the TOCTOU race that retry-on-fetch would otherwise expose. Cross-user id returns not_found.

  5. Worker is the sole billing emitter

    Every per-user usage event to payment-gateway originates from the Worker. The engine never reaches payment-gateway and only sees the user_id string as request-body metadata. A Railway env leak (GOOGLE_API_KEY_MEMORY or MEM0_ADMIN_API_KEY) lets an attacker hit Google and our engine until we rotate, but cannot mint usage events for arbitrary users.

Mem0 itself partitions by user_id in its pgvector schema; every query requires at least one of user_id / agent_id / run_id and refuses to run without it. We rely on user_id only.

HopAuthNotes
Client → ai-gatewayUser JWT (Keycloak) or sk-cs-* API keyExisting identity() middleware.
ai-gateway → engineX-API-Key: <MEM0_ADMIN_API_KEY> (mem0's stock admin path)Constant-time match via secrets.compare_digest; vanilla mem0, no source patch.
engine → Google Gemini API (LLM + embeddings under infer=True)GOOGLE_API_KEY_MEMORY (engine's own Google API key, "memory" GCP project)Bills the GCP project at exact-token rates via mem0's native google-genai SDK.
Worker → payment-gateway (per-user usage events)USAGE_INGEST_API_KEY (existing)Worker emits one {operation, count:1} event per successful MCP tool call (no emission on failure). Credit pricing lives on payment-gateway.

Hardening Before External Users

V1 ships with internal users only. Before flipping the switch to external, these items must land. They are not blocking the initial implementation but ARE blocking the external GA gate.

  1. Audit log on destructive ops

    V1 has none — debugging "where did my memory go?" requires SSH-ing into Railway. Before external users, add a Worker-side audit table (one row per update/delete/clear_all) with nightly export to R2 object-lock for HIPAA/SOC2-Type-II evidence.

  2. Postgres-backed history

    Mem0 history currently disabled (":memory:"). Before external users, a ~50 LOC custom PostgresHistoryManager subclass restores content-diff forensics without re-introducing a SQLite volume.

  3. Eval recall@5 stable for 14 days

    At staging-traffic volume; size the eval set to detect a 10pp regression with significance.

  4. Tighter DR posture

    Either move Postgres backups to WAL-G shipping WAL to R2 every 5 min (5-min RPO, off-Railway redundancy), or wait for Railway's native PITR if/when it ships. v1's 24-hour Railway-native RPO is below the bar for irreversible personal-memory data at external-user scale.

  5. Successful DR drill

    RTO ≤ 4 h, RPO meeting the new posture.

What Has No Worker-side Alternative

Two things genuinely cannot be recovered on the Worker side under pure-mem0. Calling them out explicitly so future revisits know what they cost and how to bring them back.

1. BM25 keyword scoring (part of hybrid search). Needs direct Postgres access from the Worker (ts_rank_cd over the tsvector index mem0 already creates). Workers cannot speak Postgres without Cloudflare Hyperdrive + Railway external connectivity, which is new infra not in v1 scope. v1 ships vector + recency decay only. If eval shows recall-at-top-5 regresses materially on exact-token queries (proper nouns, code identifiers, unusual phrases), the recovery paths are a reranker (Cohere/Jina, deferred) or adding Hyperdrive to let the Worker run BM25 directly.
2. Engine-internal metrics. Mem0's stock server does not expose Prometheus, and we are not adding wrapper code that could. We lose engine-side request latency histograms, Postgres pool saturation, embedder/LLM call timing inside mem0. Root-causing mem0-internal slowness requires SSH-ing into Railway and pulling stdout/Postgres logs by hand. Acceptable for v1 internal traffic; revisit if engine perf becomes the binding constraint.

Everything else (ownership check, per-user usage emission, rate limit, dedupe, recency decay) has a Worker-side alternative — see the Components and Multi-tenancy sections.

Deployment And DR

StateProduction backingRule
Mem0 vectors + payloadsRailway Postgres with pgvectorvector(3072), hnsw=False, exact sequential scan per user partition. Native gemini-embedding-2 dim preserved.
Mem0 historyDisabled (in-memory)No reader in v1; revisit PostgresHistoryManager subclass before external users.
Postgres backupsRailway native scheduled (daily)6-day retention per Railway docs. RPO ~24 h. Tighter posture deferred to before-external-GA hardening.
Dedupe tableD1 (existing ai-gateway DB)One migration: memory_content_hashes. Worker-managed.
Engine LLM + embedding calls (mem0 internal under infer=True)Google Gemini API directlyEngine key GOOGLE_API_KEY_MEMORY; charges the dedicated "memory" GCP project at exact-token rates via mem0's native google-genai SDK. Worker emits one {operation, count:1} usage event per successful MCP tool call; credit pricing lives on payment-gateway.
v1 DR posture: Railway native backups only. Railway runs scheduled snapshots (daily, 6-day retention per their docs) — periodic full backups, no point-in-time recovery, no off-platform redundancy. RPO is ~24 hours: a bad migration or accidental DELETE can lose up to a full day of memory writes. Two targets describe the quality of this DR posture:
  • RPO (Recovery Point Objective) = 24 hours — worst-case data loss between daily snapshots.
  • RTO (Recovery Time Objective) = 4 hours — wall-clock budget to restore the latest snapshot and cut over.
This is acceptable for v1 internal users. Before external-user GA, the Hardening section calls for tightening to either WAL-G shipping WAL to R2 every 5 min (5-min RPO + off-platform redundancy) or Railway's PITR offering if it ships in the meantime.

Required staging validation

  1. Round-trip

    remembersearch_memoryupdate_memorydelete_memory.

  2. Cross-user attempt

    From a fresh JWT, try to read/update/delete another user's id; must receive not_found.

  3. Recreate engine service

    Tear down the FastAPI service while keeping Postgres; redeploy; verify memories survive (no volume to lose).

  4. Quarterly DR drill

    Restore latest snapshot + replay WAL into a scratch Railway project; run eval suite; record actual recovery time.

Implementation Libraries

Versions pinned to the latest stable as of 2026-05-18.

ai-gateway (TypeScript, Cloudflare Worker)

ConcernLibraryLatestWhy this lib
MCP transport@modelcontextprotocol/sdk1.29.0Reference SDK; same one image-gen uses.
Schemaszod4.4.3MCP SDK + Hono target zod.
JWT / JWKSjose6.2.3Pure Web Crypto, runs in V8 isolates.
HTTP serverhono4.12.19Workers-native router; existing gateway uses it.

Memory engine (Python — zero of our own code)

The engine is the vanilla mem0ai/mem0 Docker image plus an init.sql. We write zero lines of Python.

ConcernLibraryLatestWhy this lib
Memory primitives + REST server + admin auth + Gemini providersmem0ai2.0.2Native ADMIN_API_KEY auth; native provider: "gemini" for both LLM and embedder. No wrapper, no subclass, no fork.

Testing And Rollout

ai-gateway tests (vitest)

  • getMemoryScope produces MemoryScope only from JWT.sub.
  • Mem0Client rejects non-MemoryScope args (type-level).
  • Mem0Client attaches X-API-Key on every request; never sends Authorization.
  • Content-hash dedupe: D1 hit → no engine call; miss → forward + write hash only on mem0 2xx.
  • Ownership check: cross-user id → not_found (engine never called).
  • Rate limits: 31st remember in a minute → 429; clock advances → reset.
  • billing.ts emits exactly one usage event per successful mem0 round-trip; event shape { user_id, product_id: "memory", operation, count: 1, request_id } (no credit field — pricing on payment-gateway); no emission on engine 5xx; D1 outbox catches payment-gateway 5xx.
  • Recency decay: 60-day-old at 0.85 ranks below 1-day-old at 0.80; pinned: true does not decay.
  • Engine 5xx → backend_unavailable.

Engine tests (pytest)

  • End-to-end add(infer=True) against the dev Postgres + Google Gemini returns a non-empty results array with event: "ADD".
  • init.sql is idempotent: running twice doesn't duplicate the User row or the vector extension.
  • Mem0's native gemini provider translates response_format={"type": "json_object"} → Gemini's response_mime_type: "application/json" (parity smoke test against mem0/llms/gemini.py:170).
  • hnsw=False schema check: \d memories shows vector column with no HNSW index. Search latency at 1k rows < 100 ms p99.
  • Wired-up admin-auth integration test against built Docker image: missing X-API-Key → 401; bogus key → 401; real MEM0_ADMIN_API_KEY → 2xx with event: "ADD".

Eval harness (infra/mem0/eval/golden.jsonl): 50–100 hand-crafted memory/query pairs. CI runs the suite on PRs; recall@5 must not regress by more than 5 percentage points vs main.

  1. Provision per env

    Generate MEM0_ADMIN_API_KEY (32-byte random); mint a Google API key (GOOGLE_API_KEY_MEMORY) in a dedicated "memory" GCP project restricted to the Generative Language API; Railway project with Postgres and Railway native daily backups enabled; run init.sql against Postgres; deploy the vanilla mem0ai/mem0 Docker image.

  2. Land code and smoke-test dev

    Catalog shows memory; MCP tools list at /engines/memory/mcp; full cycle works; D1 dedupe rows appear; payment-gateway shows one {operation, count:1} usage event per successful MCP tool call (failures emit nothing).

  3. Promote to staging via main push

    Repeat smoke + DR drill.

  4. Production with internal users

    Gate external rollout on the tighter DR posture from Hardening landing (WAL-G to R2 or Railway PITR) and eval recall@5 stable for 14 days.