Performance Optimization Guide
This guide covers best practices for optimizing Zep’s performance in production environments.
Reuse the Zep SDK Client
The Zep SDK client maintains an HTTP connection pool that enables connection reuse, significantly reducing latency by avoiding the overhead of establishing new connections. To optimize performance:
- Create a single client instance and reuse it across your application
- Avoid creating new client instances for each request or function
- Consider implementing a client singleton pattern in your application
- For serverless environments, initialize the client outside the handler function
Optimizing Memory Operations
The thread.add_messages
and thread.get_user_context
methods are optimized for conversational messages and low-latency retrieval. For optimal performance:
- Keep individual messages under 10K characters
- Use
graph.add
for larger documents, tool outputs, or business data - Consider chunking large documents before adding them to the graph (the
graph.add
endpoint has a 10,000 character limit) - Remove unnecessary metadata or content before persistence
- For bulk document ingestion, process documents in parallel while respecting rate limits
Use the Basic Context Block
Zep’s context block can either be in summarized or basic form (summarized by default). Retrieving basic results reduces latency (P95 < 200 ms) since this bypasses the final summarization step.
Get the Context Block sooner
Additionally, you can request the Context Block directly in the response to the thread.add_messages()
call.
This optimization eliminates the need for a separate thread.get_user_context()
, though this method always returns the basic Context Block type.
Read more about our Context Block.
In this scenario you can pass in the return_context=True
flag to the thread.add_messages()
method.
Zep will perform a user graph search right after persisting the memory and return the context relevant to the recently added memory.
Optimizing Search Queries
Zep uses hybrid search combining semantic similarity and BM25 full-text search. For optimal performance:
- Keep your queries concise. Queries are automatically truncated to 8,192 tokens (approximately 32,000 Latin characters)
- Longer queries may not improve search quality and will increase latency
- Consider breaking down complex searches into smaller, focused queries
- Use specific, contextual queries rather than generic ones
Best practices for search:
- Keep search queries concise and specific
- Structure queries to target relevant information
- Use natural language queries for better semantic matching
- Consider the scope of your search (graphs versus user graphs)
Summary
- Reuse Zep SDK client instances to optimize connection management
- Use appropriate methods for different types of content (
thread.add_messages
for conversations,graph.add
for large documents) - Keep search queries focused and under the token limit for optimal performance