Architecture patterns | Zep Documentation

Every Zep integration is shaped by a few key architectural choices — what context to store, how to ingest it, and how to retrieve it. This page first walks through those choices, then presents some of the most common architecture patterns that result from combining them. The patterns below are not exhaustive; your architecture may combine these choices differently depending on your use case.

Key architectural choices

These are the key architectural choices to consider before implementing Zep for your use case:

Context scope

Use case condition	Recommendation	Relevant Docs
You want to persist and retrieve user-specific context (for example: chat history, preferences, support interactions)	Use a user context graph	Create a Zep user
You want to persist and retrieve domain context (for example: company policies, product data, runbooks)	Use a standalone context graph	Create Graph
You want to persist and retrieve both user-specific context and domain context in the same workflow	Use user + standalone context graphs and assemble context from both	Create a Zep user, Create Graph

What data you persist to Zep

Use case condition	Recommendation	Relevant Docs
You want to persist conversational context from turns in an agent/chat loop	Use Thread API for chat ingestion and thread continuity	Create a Zep thread
You want to persist non-chat system context (CRM, logs, tickets, emails, docs, events)	Use Graph API for ingestion	Adding Business Data
You want to persist both conversational context and business/system context	Use Thread API + Graph API together	Create a Zep thread, Adding Business Data

Retrieval strategy

Use case condition	Recommendation	Relevant Docs
You want context retrieved deterministically, before every agent response	Retrieve or assemble a context block from Zep on each turn. Options include: Zep’s default context block, custom context templates, or advanced context block construction	Assembling Context Methods
You want retrieval to happen through agent tool calls	Expose `graph.search()` as a tool and let the LLM call it when needed	Searching the Graph

Retrieval strategy describes when and how context is pulled. What kinds of context exist — facts, entities, episodes, thread summaries, observations, and the user summary — is a separate decision. See Context Types for descriptions of each primitive and when to reach for it.

Deployment and governance

Each pattern above runs across Zep’s deployment models: Cloud (fully managed), Cloud + BYOK (you supply your own keys), and BYOC (Zep runs inside your VPC). The trust boundary moves with your deployment, so you can adopt any pattern without changing where your context lives. See Security & Compliance for details.

Pattern 1: ingesting conversations + user data

Problem: Your conversational agent forgets chat history between sessions and has no context about the user beyond the current conversation.

This is the most common pattern for chatbots and conversational assistants. Chat messages are persisted through the Thread API, and user-specific business data (CRM records, support history, etc.) is sent to the user’s context graph via the Graph API. Context is retrieved from the user’s context graph before each response.

Architecture diagram

Messages flow into Zep via the Thread API. Zep extracts facts and entities into the user’s context graph. Business data (CRM records, user events, etc.) can be sent directly to the graph via graph.add(). When you call get_user_context(), Zep uses the latest messages in the thread as a search query to assemble relevant context from the user’s context graph.

When to use

You need user-specific context in a conversational experience.
You want to persist chat messages to Zep using the Thread API.
You may also want to persist user-specific business data (CRM/events) using Graph API.
You want automatic context retrieval per turn via get_user_context().

How to implement

This is the pattern used in the Quick Start Guide, which walks through the complete end-to-end setup.

Pattern 2: ingesting domain data

Problem: Your agent needs access to company knowledge, business data, or other domain context that doesn’t come from a chat conversation.

Use this pattern when you want to populate Zep with domain knowledge, business data, or external data sources that exist outside of conversations. This is common for background data ingestion pipelines and agents that don’t have a traditional chat interface.

Architecture diagram

Data from any source is ingested directly into a context graph via graph.add(). The agent retrieves context via graph.search() without needing a thread.

When to use

You need domain knowledge such as company policies, product data, or runbooks.
You are ingesting business data, events, logs, or transcripts directly with Graph API.
You do not need thread-based chat persistence for this workflow.
You want deterministic retrieval via direct graph.search() calls in application code.

How to implement

This is the pattern used in the Give Your Agent Domain Knowledge cookbook, which walks through ingesting data into a graph and searching it end-to-end.

Related guides: Adding Business Data, Searching the Graph, Group Chat FAQ

Pattern 3: ingesting conversations + user data with tool-call retrieval

Problem: Your conversational agent forgets chat history between sessions and has no context about the user beyond the current conversation — but you want the agent to decide when to retrieve context rather than injecting it on every turn.

This pattern uses the same ingestion as Pattern 1 — messages are persisted through the Thread API and business data is sent to the graph — but retrieval happens through agent tool calls instead of automatically. The LLM decides whether and when to search based on the conversation context.

Architecture diagram

Ingestion is identical to Pattern 1: messages flow into Zep via the Thread API, facts are auto-extracted into the graph, and business data can be sent directly via graph.add(). The difference is retrieval — instead of calling get_user_context() on every turn, the agent exposes graph.search() as a tool and the LLM decides when to call it.

When to use

You are ingesting conversations and user data the same way as Pattern 1.
You want the LLM to decide when context retrieval is needed, rather than retrieving on every turn.
You are building a multi-tool agent where context search is one of several capabilities.
Your agent has variable context needs per interaction — some turns need retrieval, others don’t.

How to implement

Follow the Quick Start Guide for ingestion setup, but instead of calling get_user_context() on every turn, expose graph.search() as a tool in your agent framework and let the LLM call it when needed.

Related guides: Searching the Graph, LangGraph Integration

Pattern 4: ingesting conversations, user data, and domain data

Problem: Your conversational agent needs both user-specific context (chat history, preferences, account details) and domain context (company policies, product catalog, runbooks) to generate informed responses.

This pattern combines Patterns 1 and 2. Chat messages are persisted through the Thread API, user-specific business data goes to the user graph, and domain knowledge goes to a standalone graph. At retrieval time, the agent assembles context from both graphs and includes both in the LLM’s context window.

Architecture diagram

Chat messages flow into Zep via the Thread API, and facts are auto-extracted into the user graph. User-specific business data is sent to the user graph via graph.add(), while domain data (product catalogs, policies, etc.) is sent to a standalone graph. At retrieval time, the agent calls get_user_context() for user context and graph.search() on the standalone graph for domain context, then includes both in the LLM’s context window.

When to use

Your agent needs both user-specific context and domain knowledge in the same conversation.
You are persisting chat messages via the Thread API and user data to a user graph.
You are also ingesting domain data into a standalone graph shared across users.
You want to assemble context from both graphs before each agent response.

How to implement

This pattern combines the implementations from the Quick Start Guide (user context and thread ingestion), the Give Your Agent Domain Knowledge cookbook (standalone graph ingestion), and the Share Context Across Users Using Graphs cookbook (retrieving from both graphs and combining them into a single context block).

Pattern 5: low latency architecture

Problem: Your agent needs to provide fast, responsive interactions — and even small increases in context retrieval latency meaningfully degrade the user experience.

Zep retrieval is sub-200ms regardless of graph size or count. This pattern applies to any of the four architectures above. The difference is that it incorporates Zep’s latency optimizations to minimize the time between a user’s input and the agent’s response. This is especially important for voice and video agents, where latency is immediately perceptible, but it benefits any agent using Zep where user experience matters.

When to use

You are building a voice or video agent where response latency is directly felt by the user.
Your agent has strict latency requirements regardless of modality.
You want to minimize round trips and overlap Zep operations with other work in your agent loop.

How to implement

Follow the Performance Best Practices guide, which covers the key latency optimizations: requesting context in the same call as message ingestion, running graph operations concurrently, and warming the user cache ahead of retrieval.