feat-mem-mcp · memory v1
Personal memory as a first-class engine on the ai-gateway. The Cloudflare Worker is the control plane and the sole billing emitter — multi-tenancy, rate limits, ownership re-check, per-user usage events to payment-gateway. The Railway side runs 100% vanilla mem0ai/mem0 — no source patch, no extension files, no factory-dict swaps. Configuration only, plus one init.sql that seeds the pgvector extension and mem0's default User row. Worker→engine auth uses mem0's stock ADMIN_API_KEY path: X-API-Key: <MEM0_ADMIN_API_KEY> matched via secrets.compare_digest. Mem0 talks to Google's Gemini API directly via its native provider: "gemini" (uses the google-genai SDK, properly translates response_format → Gemini's response_mime_type). Vector store: native 3072-d gemini embeddings stored as vector(3072) with hnsw=False — exact sequential scan inside each user's partition, best retrieval quality, fast enough at v1 per-user volumes. Billing is by request count: the Worker emits one { operation, count: 1 } usage event per successful MCP tool call. Credit pricing per operation lives on payment-gateway, not in the Worker. MCP at /engines/memory/mcp matches the existing /engines/<id>/mcp convention.
/engines/memory/mcp; seven tools.remember (agent-curated) and ingest (turn capture for runtime hooks).mem0ai/mem0 on Railway. No source patch, no extension files. Config only + one init.sql.provider: "gemini" (google-genai SDK).vector(3072), hnsw=False — exact sequential scan, best retrieval quality.X-API-Key: <MEM0_ADMIN_API_KEY> — mem0's stock admin path.remember), ownership re-check, rate limits, recency decay.{ operation, count: 1 }. Failed requests (4xx/5xx, rate limits, ownership rejects) emit nothing. Memory bills by request, not by tokens (mem0's cost per call is bounded). Credit pricing per operation lives on payment-gateway, not in the Worker.remember({memory: "..."}) during reasoning. Worker hashes, checks D1 for prior hash hit, forwards to mem0 POST /memories with infer=False. No LLM call.ingest({messages: [...]}). Worker forwards to mem0.add(messages, user_id=..., infer=True). Mem0 runs its full native pipeline (top-10 search, LLM extraction with ADDITIVE_EXTRACTION_PROMPT, cross-memory hash dedup, batched embeddings, entity store with linked_memory_ids, prompt auto-updates). Mem0's LLM and embedder calls go to Google's Gemini API directly via its native gemini providers with the engine's GOOGLE_API_KEY_MEMORY; Google bills the dedicated "memory" GCP project at exact-token rates. After mem0 returns 2xx, the Worker emits one { operation: "ingest", count: 1 } usage event to payment-gateway. One round-trip per ingest.X-API-Key: <MEM0_ADMIN_API_KEY> — matched via secrets.compare_digest by mem0's stock verify_auth. All multi-tenancy enforcement and per-user usage emission lives in the Worker. Mem0 talks to Google's Gemini API directly via its native provider: "gemini" using the engine's GOOGLE_API_KEY_MEMORY; Google bills the dedicated "memory" GCP project at exact-token rates. Billing is by request count: the Worker emits one { operation, count: 1 } usage event per successful MCP tool call via the existing USAGE_INGEST_API_KEY path; failed requests are free. Credit pricing lives on payment-gateway. No engine-side billing machinery; no custom auth file; no subclasses; 100% vanilla mem0.
google-genai SDK). Google bills the dedicated "memory" GCP project. The Worker emits one usage event per successful MCP tool call (no event on failure) via the existing USAGE_INGEST_API_KEY path — payload { operation, count: 1 }.
remember.{ operation, count: 1 } usage event per successful MCP tool call — sole billing emitter, no engine-side billing code. Pricing per operation is configured on payment-gateway.infer=True: extraction with ADDITIVE_EXTRACTION_PROMPT, hash dedup, batched embeddings, entity store with linked_memory_ids.ADMIN_API_KEY path (secrets.compare_digest); no custom verifier.provider: "gemini" (google-genai SDK). response_format → Gemini's response_mime_type handled inside mem0's GeminiLLM (line 170).vector(3072), hnsw=False, exact sequential scan inside each user's partition. Best retrieval quality.init.sql seeds the vector extension and the default User row idempotently.src/)| Path | Responsibility |
|---|---|
src/mcp/engines/memory/ | Memory MCP server (image-gen pattern). Hosts all seven memory tools. Exports memoryServer and memoryMetadata. |
src/routes/engines/index.ts | One line added: enginesApp.route(`/${memoryMetadata.id}`, memoryServer). Same pattern as image-gen. |
src/mcp/engines/memory/scope.ts | Branded MemoryScope type; constructor reads JWT.sub + JWT.azp after verification. |
src/mcp/engines/memory/client.ts | Thin HTTP client around mem0's stock REST API. Methods take scope: MemoryScope. Attaches X-API-Key: <MEM0_ADMIN_API_KEY> (Wrangler secret) on every call. No wrapping endpoint shape; no token-exchange hop. |
src/mcp/engines/memory/dedupe.ts | Light idempotency for remember only (not ingest — mem0 handles dedup natively under infer=True). SHA-256 → D1 check; write hash after mem0 success; verify mem0 row on hash hit; nightly TTL sweep. |
src/mcp/engines/memory/ratelimit.ts | Per-user windows in GATEWAY_KV. 429 + Retry-After. |
src/mcp/engines/memory/search.ts | Recency decay on returned scores; re-sort; return top-k. Honors metadata.pinned. |
src/mcp/engines/memory/billing.ts | After each successful mem0 round-trip, emits one usage event to payment-gateway via USAGE_INGEST_API_KEY with payload { user_id, product_id: "memory", operation, count: 1, request_id }. No credit field — pricing per operation lives on payment-gateway. D1 outbox fallback on payment-gateway 5xx. |
infra/mem0/) — 100% vanilla mem0
The engine is the vanilla mem0ai/mem0 Docker image, unmodified. No COPY over source. No Python extension files. No factory-dict swaps. Zero lines of our own Python. Two things make this possible:
provider: "gemini" for both LLM and embedder (uses Google's google-genai SDK). It translates response_format={"type": "json_object"} to Gemini-native response_mime_type: "application/json", so no subclass is needed to coerce JSON output.hnsw=False uses mem0's stock pgvector schema (vector(<dim>)) with no index — exact sequential scan over the user's partition. Avoids pgvector's 2000-d HNSW cap, so no schema subclass is needed for native 3072-d embeddings. At v1 per-user volumes (hundreds to low thousands of memories) seq scan is sub-millisecond and gives better recall than HNSW would (exact vs. approximate).| File | Purpose |
|---|---|
Dockerfile | FROM mem0ai/mem0@sha256:<digest> (pinned by digest, not tag). Vanilla — no COPY, no RUN pip install beyond mem0's defaults. |
init.sql | Runs once against the Postgres add-on after provision: CREATE EXTENSION IF NOT EXISTS vector; + INSERT INTO users (username) VALUES ('memory-engine') ON CONFLICT DO NOTHING;. Idempotent. The User row is the single account mem0's require_auth resolves to under ADMIN_API_KEY auth. |
mem0.config.json / env | Mem0's native config: provider: "gemini" for both LLM and embedder; model gemini-3-flash with max_tokens=8000 for reasoning headroom; embedder model models/gemini-embedding-2; embedding_dims=3072; vector store pgvector with hnsw=False, embedding_model_dims=3072, cosine distance. History disabled. |
Dockerfile HEALTHCHECK | Boot-time self-test: container fires an unauthenticated GET /memories against itself and fails liveness on anything other than 401. Catches an upstream AUTH_DISABLED default flip or an accidentally-empty MEM0_ADMIN_API_KEY that would otherwise open the engine to the public. |
tests/test_admin_auth_wired.py | CI integration test: boots the built Docker image; sends POST /memories with no X-API-Key, with a bogus key, and with the real MEM0_ADMIN_API_KEY; asserts 401/401/2xx. Failure blocks the Railway deploy. |
railway.json | One service + Postgres add-on. init.sql mounted as a post-provision migration. No persistent volume. |
README.md | Deploy / rotate / backup / restore / DR-drill runbook. |
All seven tools live at /engines/memory/mcp, mounted via enginesApp exactly like image-gen. The Worker translates each tool to direct calls against mem0's native REST endpoints.
| Tool | Worker behavior | Mem0 call |
|---|---|---|
search_memory | Forward query; apply recency decay to non-pinned rows on the response. Worker emits one {operation: "search_memory", count: 1} event after success. | POST /memories/search |
remember | SHA-256 content hash → D1 check. Hit → verify mem0 row exists (cheap GET); if 404 delete orphan + fall through. Miss → forward; write hash only on mem0 2xx. Optional pinned: true opts out of recency decay. Worker emits one {operation: "remember", count: 1} event after success. | POST /memories (infer=False) — single curated fact, no extraction needed |
ingest | Forward to mem0 with infer=True. Mem0 runs its full native pipeline (top-10 search, LLM extraction, hash-dedup, batched embed + insert, entity store). Worker emits one {operation: "ingest", count: 1} event after mem0 returns 2xx. | one POST /memories with infer=True |
update_memory | GET /memories/{id} → verify user_id matches scope → forward. | GET then PUT /memories/{id} |
delete_memory | Same ownership check → forward. | GET then DELETE /memories/{id} |
list_memory | Forward with user_id filter; pass cursor. | GET /memories?user_id=... |
clear_all_memory | Validate confirm: true; atomic 5/hour check via per-user Durable Object (KV is too eventually-consistent for a destructive op); forward. | DELETE /memories?user_id=... |
billing.ts emits exactly one { operation, count: 1 } event per successful MCP tool call (mem0 returns 2xx and Worker post-processing — ownership check, decay, dedupe — succeeds). No event on any failure: rate limit reject (429), ownership mismatch / not_found, schema violation, payload too large, engine 5xx, timeout, mutex contention. Failed requests are free.
infer=False vs infer=True means in mem0
Mem0's add() API has two paths through it, controlled by the infer argument. The choice matters because it determines where the LLM call happens, which in turn determines who gets billed.
infer=False — raw store | infer=True — full pipeline (mem0's default) | |
|---|---|---|
| What it does | Each user message becomes one row, stored verbatim. System messages skipped. | Mem0 retrieves the user's top-10 existing memories, calls the LLM with ADDITIVE_EXTRACTION_PROMPT (new messages + existing memories + last-k context + profile summary). The LLM emits ADD/UPDATE/DELETE decisions; mem0 dispatches each. |
LLM calls per add() | Zero. | One extraction call, plus embeddings for whatever was added/updated. |
| Dedup against existing memories | None. Same content twice = two rows. | The LLM is shown existing memories and decides whether to ADD a new row, UPDATE an existing one, or DELETE a stale fact. |
| Cost per call | Embedding only (~tiny). | Extraction LLM + embeddings (~30× the infer=False cost in our setup). |
| Quality | Stores whatever you sent. If you sent a curated fact, good. If you sent a raw turn, you get raw turn rows. | Cleans up the input — splits compound facts, consolidates with existing knowledge, reconciles updates. |
Input: [{user: "I just adopted a Welsh Corgi named Otis. He loves fetch."}, {user: "He's 8 weeks."}]. Existing memory for the user: "User has a dog named Otis."
infer=False: two new rows stored verbatim. Existing "User has a dog named Otis" untouched. Total: 3 memories, partially redundant.infer=True: one LLM call extracts → UPDATE old="User has a dog named Otis" → "User adopted Otis, a Welsh Corgi, 8 weeks old" + ADD "Otis enjoys playing fetch". Total: 2 memories, deduplicated and consolidated.remember → infer=False. The agent has already curated a single fact. Running an LLM to "extract" it would paraphrase, split, or drop nuance. We just embed and store.ingest → infer=True. Mem0 runs its full native pipeline against the raw turn — extraction, hash-dedup, entity store, batched embeds, prompt auto-updates. We get the quality of mem0's pipeline (verified in mem0/memory/main.py:699-971).infer=True without re-implementing extraction in the Worker: mem0 talks to Google's Gemini API directly via its native gemini providers — Google bills the dedicated "memory" GCP project at exact-token rates. The Worker emits one usage event per successful MCP tool call (no emission on failure): { user_id, product_id: "memory", operation, count: 1 }. Memory bills by request (mem0's cost per call is bounded), not by tokens; payment-gateway maps operation → credit price via its own config.
ingest path)curl -sS -X POST "$GATEWAY_URL/engines/memory/mcp" \
-H "Authorization: Bearer $USER_TOKEN" \
-H "Content-Type: application/json" \
-d @- <<'EOF'
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "ingest",
"arguments": {
"messages": [
{ "role": "user", "content": "..." },
{ "role": "assistant", "content": "..." }
],
"metadata": { "session_id": "abc123" }
}
}
}
EOF
Validated end-to-end on 2026-05-18 / 19 against a local docker-compose stack: pgvector/pgvector:pg16 + mem0ai==2.0.2 + mem0's native provider: "gemini" pointed at Google's Gemini API with gemini-3-flash (LLM) and gemini-embedding-2 (embeds). Full add/search/update/delete cycle confirmed; example infer=True extraction from a three-message turn: "User adopted a Welsh Corgi named Otis around May 19, 2026, who loves playing fetch."
provider: "gemini" for both LLM (mem0/llms/gemini.py) and embedder (mem0/embeddings/gemini.py) — uses Google's google-genai SDK directly. The LLM provider translates response_format={"type": "json_object"} to Gemini-native response_mime_type: "application/json" (line 170 of gemini.py). Whitelist accepts "gemini" directly. This is what makes the engine 100% vanilla — no subclass, no factory-dict swap.
filters={...}.
mem0.search(query, user_id=alice) raises ValueError: Top-level entity parameters not supported in search(). Use filters={'user_id': '...'} instead. Also the top-k arg is top_k, not limit. The Mem0Client in the Worker uses the v2 shape.
google/gemini-embedding-2 outputs 3072 dimensions natively.
Probed directly via Google's Gemini API — response includes 3072-length vector. The model supports Matryoshka via the output_dimensionality parameter (mem0's stock GoogleGenAIEmbedding passes it through from config.embedding_dims), but we use the native dim in v1. pgvector's embedding_model_dims MUST match exactly — set to 3072.
google/gemini-3-flash is a reasoning model.
Responses include completion_tokens_details.reasoning_tokens in the usage block. A trivial echo prompt used 74 reasoning tokens + 5 output tokens. Mem0's default max_tokens=2000 may starve the model on a real ADDITIVE_EXTRACTION_PROMPT. Fix: set MEM0_LLM_MAX_TOKENS=8000 in the engine's mem0 config. Billing implications: Google bills the "memory" GCP project for the full completion line (reasoning included); the Worker emits one ingest usage event regardless of tokens consumed. Reasoning headroom is a quality concern, not a billing one.
vector type caps HNSW indexes at 2000 dimensions.
Native gemini-embedding-2 at 3072-d would fail table creation with column cannot have more than 2000 dimensions for hnsw index if HNSW were enabled. Resolution: ship v1 with hnsw=False — mem0's stock pgvector schema uses vector(3072) with no index, search runs exact sequential scan inside each user's partition. Fast enough at v1 per-user volumes; better recall than HNSW would give (seq scan is exact, HNSW is approximate). When a power user makes seq-scan latency painful, the upgrade path is to switch to halfvec(3072) + halfvec_cosine_ops + HNSW — a one-time online schema migration, not v1 work.
update wipes metadata.
Returns {"message": "Memory updated successfully!"} — not the row. Subsequent get(id) shows metadata: null even when the original had metadata. The Worker must explicitly re-pass metadata on every update to preserve it.
// add(infer=False)
{
"results": [
{ "id": "...", "memory": "User is allergic to peanuts.",
"event": "ADD", "actor_id": null, "role": "user" }
]
}
// add(infer=True) — three-message turn, one extracted fact
{
"results": [
{ "id": "...",
"memory": "User adopted a Welsh Corgi named Otis around May 19, 2026, who loves playing fetch.",
"event": "ADD" }
]
}
// search — note score, metadata round-trips, role on user-source memories
{
"results": [
{ "id": "...", "memory": "...", "hash": "...",
"metadata": { "pinned": true, "source_type": "explicit" },
"score": 0.8834746264057188,
"created_at": "...", "updated_at": "...",
"user_id": "...", "role": "user" }
]
}
// update — message only, no row data
{ "message": "Memory updated successfully!" }
Memories are separated by user. Three enforcement layers in the Worker, plus a Worker-side ownership re-check that substitutes for same-DB-transaction enforcement (which is no longer possible without engine code).
MemoryScope is a branded type whose only constructor reads JWT.sub post-verification. Mem0Client methods accept only MemoryScope. Passing a raw string is a compile-time error.
Any user_id-shaped field in MCP tool arguments is dropped before forwarding. Only the server-derived scope reaches the engine.
Inbound calls must carry X-API-Key: <MEM0_ADMIN_API_KEY>, matched by mem0's stock verify_auth via secrets.compare_digest (constant-time). The key is a Wrangler secret on the Worker side and a Railway env var on the engine side — same shared-secret posture as USAGE_INGEST_API_KEY. Non-gateway callers cannot reach the engine.
Update / delete / clear_all do GET /memories/{id} → verify user_id == scope.userId → mutate, wrapped by a KV mutex (SETNX svc_mutex:mem:{id} 2-second TTL) so concurrent Worker requests on the same id serialize. Eliminates the TOCTOU race that retry-on-fetch would otherwise expose. Cross-user id returns not_found.
Every per-user usage event to payment-gateway originates from the Worker. The engine never reaches payment-gateway and only sees the user_id string as request-body metadata. A Railway env leak (GOOGLE_API_KEY_MEMORY or MEM0_ADMIN_API_KEY) lets an attacker hit Google and our engine until we rotate, but cannot mint usage events for arbitrary users.
Mem0 itself partitions by user_id in its pgvector schema; every query requires at least one of user_id / agent_id / run_id and refuses to run without it. We rely on user_id only.
| Hop | Auth | Notes |
|---|---|---|
| Client → ai-gateway | User JWT (Keycloak) or sk-cs-* API key | Existing identity() middleware. |
| ai-gateway → engine | X-API-Key: <MEM0_ADMIN_API_KEY> (mem0's stock admin path) | Constant-time match via secrets.compare_digest; vanilla mem0, no source patch. |
engine → Google Gemini API (LLM + embeddings under infer=True) | GOOGLE_API_KEY_MEMORY (engine's own Google API key, "memory" GCP project) | Bills the GCP project at exact-token rates via mem0's native google-genai SDK. |
| Worker → payment-gateway (per-user usage events) | USAGE_INGEST_API_KEY (existing) | Worker emits one {operation, count:1} event per successful MCP tool call (no emission on failure). Credit pricing lives on payment-gateway. |
V1 ships with internal users only. Before flipping the switch to external, these items must land. They are not blocking the initial implementation but ARE blocking the external GA gate.
V1 has none — debugging "where did my memory go?" requires SSH-ing into Railway. Before external users, add a Worker-side audit table (one row per update/delete/clear_all) with nightly export to R2 object-lock for HIPAA/SOC2-Type-II evidence.
Mem0 history currently disabled (":memory:"). Before external users, a ~50 LOC custom PostgresHistoryManager subclass restores content-diff forensics without re-introducing a SQLite volume.
recall@5 stable for 14 daysAt staging-traffic volume; size the eval set to detect a 10pp regression with significance.
Either move Postgres backups to WAL-G shipping WAL to R2 every 5 min (5-min RPO, off-Railway redundancy), or wait for Railway's native PITR if/when it ships. v1's 24-hour Railway-native RPO is below the bar for irreversible personal-memory data at external-user scale.
RTO ≤ 4 h, RPO meeting the new posture.
Two things genuinely cannot be recovered on the Worker side under pure-mem0. Calling them out explicitly so future revisits know what they cost and how to bring them back.
ts_rank_cd over the tsvector index mem0 already creates). Workers cannot speak Postgres without Cloudflare Hyperdrive + Railway external connectivity, which is new infra not in v1 scope. v1 ships vector + recency decay only. If eval shows recall-at-top-5 regresses materially on exact-token queries (proper nouns, code identifiers, unusual phrases), the recovery paths are a reranker (Cohere/Jina, deferred) or adding Hyperdrive to let the Worker run BM25 directly.
Everything else (ownership check, per-user usage emission, rate limit, dedupe, recency decay) has a Worker-side alternative — see the Components and Multi-tenancy sections.
| State | Production backing | Rule |
|---|---|---|
| Mem0 vectors + payloads | Railway Postgres with pgvector | vector(3072), hnsw=False, exact sequential scan per user partition. Native gemini-embedding-2 dim preserved. |
| Mem0 history | Disabled (in-memory) | No reader in v1; revisit PostgresHistoryManager subclass before external users. |
| Postgres backups | Railway native scheduled (daily) | 6-day retention per Railway docs. RPO ~24 h. Tighter posture deferred to before-external-GA hardening. |
| Dedupe table | D1 (existing ai-gateway DB) | One migration: memory_content_hashes. Worker-managed. |
Engine LLM + embedding calls (mem0 internal under infer=True) | Google Gemini API directly | Engine key GOOGLE_API_KEY_MEMORY; charges the dedicated "memory" GCP project at exact-token rates via mem0's native google-genai SDK. Worker emits one {operation, count:1} usage event per successful MCP tool call; credit pricing lives on payment-gateway. |
DELETE can lose up to a full day of memory writes. Two targets describe the quality of this DR posture:
remember → search_memory → update_memory → delete_memory.
From a fresh JWT, try to read/update/delete another user's id; must receive not_found.
Tear down the FastAPI service while keeping Postgres; redeploy; verify memories survive (no volume to lose).
Restore latest snapshot + replay WAL into a scratch Railway project; run eval suite; record actual recovery time.
Versions pinned to the latest stable as of 2026-05-18.
| Concern | Library | Latest | Why this lib |
|---|---|---|---|
| MCP transport | @modelcontextprotocol/sdk | 1.29.0 | Reference SDK; same one image-gen uses. |
| Schemas | zod | 4.4.3 | MCP SDK + Hono target zod. |
| JWT / JWKS | jose | 6.2.3 | Pure Web Crypto, runs in V8 isolates. |
| HTTP server | hono | 4.12.19 | Workers-native router; existing gateway uses it. |
The engine is the vanilla mem0ai/mem0 Docker image plus an init.sql. We write zero lines of Python.
| Concern | Library | Latest | Why this lib |
|---|---|---|---|
| Memory primitives + REST server + admin auth + Gemini providers | mem0ai | 2.0.2 | Native ADMIN_API_KEY auth; native provider: "gemini" for both LLM and embedder. No wrapper, no subclass, no fork. |
getMemoryScope produces MemoryScope only from JWT.sub.Mem0Client rejects non-MemoryScope args (type-level).Mem0Client attaches X-API-Key on every request; never sends Authorization.not_found (engine never called).billing.ts emits exactly one usage event per successful mem0 round-trip; event shape { user_id, product_id: "memory", operation, count: 1, request_id } (no credit field — pricing on payment-gateway); no emission on engine 5xx; D1 outbox catches payment-gateway 5xx.pinned: true does not decay.backend_unavailable.add(infer=True) against the dev Postgres + Google Gemini returns a non-empty results array with event: "ADD".init.sql is idempotent: running twice doesn't duplicate the User row or the vector extension.response_format={"type": "json_object"} → Gemini's response_mime_type: "application/json" (parity smoke test against mem0/llms/gemini.py:170).hnsw=False schema check: \d memories shows vector column with no HNSW index. Search latency at 1k rows < 100 ms p99.X-API-Key → 401; bogus key → 401; real MEM0_ADMIN_API_KEY → 2xx with event: "ADD".
Eval harness (infra/mem0/eval/golden.jsonl): 50–100 hand-crafted memory/query pairs. CI runs the suite on PRs; recall@5 must not regress by more than 5 percentage points vs main.
Generate MEM0_ADMIN_API_KEY (32-byte random); mint a Google API key (GOOGLE_API_KEY_MEMORY) in a dedicated "memory" GCP project restricted to the Generative Language API; Railway project with Postgres and Railway native daily backups enabled; run init.sql against Postgres; deploy the vanilla mem0ai/mem0 Docker image.
Catalog shows memory; MCP tools list at /engines/memory/mcp; full cycle works; D1 dedupe rows appear; payment-gateway shows one {operation, count:1} usage event per successful MCP tool call (failures emit nothing).
Repeat smoke + DR drill.
Gate external rollout on the tighter DR posture from Hardening landing (WAL-G to R2 or Railway PITR) and eval recall@5 stable for 14 days.