This guide covers best practices for optimizing Zep’s performance in production environments.
Zep delivers sub-200ms context retrieval regardless of graph size or number of graphs. On public benchmarks, Zep records 155ms retrieval latency on LoCoMo (94.7% accuracy) and 162ms on LongMemEval (90.2% accuracy). The optimizations below are application-side techniques that preserve this baseline and minimize end-to-end latency in your agent loop.
The Zep SDK client maintains an HTTP connection pool that enables connection reuse, significantly reducing latency by avoiding the overhead of establishing new connections. To optimize performance:
The thread.add_messages and thread.get_user_context methods are optimized for conversational messages and low-latency retrieval. For optimal performance:
graph.add for larger documents, tool outputs, or business data (up to 10,000 characters per call)You can request the Context Block directly in the response to the thread.add_messages() call.
This optimization eliminates the need for a separate thread.get_user_context() call.
Read more about our Context Block.
In this scenario you can pass in the return_context=True flag to the thread.add_messages() method.
Zep will perform a user graph search right after persisting the data and return the context relevant to the recently added messages.
Instead of using thread.get_user_context, you might want to search the graph directly with custom parameters and construct your own custom context block. When doing this, you can search the graph and add data to the graph concurrently.
You would then need to construct a custom context block using the search results. Learn more about customizing your context block.
Zep uses hybrid retrieval combining semantic (vector) similarity, BM25 full-text search, and graph traversal in a single ranked result. For optimal performance:
Best practices for search:
Zep’s proprietary runtime, the Context Graph Engine, serves retrieval over a three-tier data layer. The highest tier is a “hot” cache where a user’s context retrieval is fastest. After several hours of no activity, a user’s data will be moved to a lower tier.
You can hint to Zep that a retrieval may be made soon, allowing Zep to move user data into cache ahead of this retrieval. A good time to do this is when a user logs in to your service or opens your app.
thread.add_messages for conversations, graph.add for large documents)