Quickstart
Run end-to-end memory evaluations using Zep’s evaluation framework
This guide shows you how to use Zep’s evaluation harness to systematically test your memory implementation.
Why use the evaluation harness?
With this evaluation harness, you can:
- Evaluate Zep’s performance for your use case: Test how well Zep retrieves relevant information and answers questions specific to your domain and conversation patterns.
- Systematically experiment with Zep ontologies, search strategies, and other capabilities: Compare different configurations to optimize retrieval accuracy and response quality.
- Develop a suite of tests that can be run in CI: Continuously evaluate your application for regressions, ensuring that changes to your data model or Zep configuration don’t degrade memory performance over time.
The harness provides objective metrics for context completeness and answer accuracy, enabling data-driven decisions about memory configuration and search strategies.
Steps
Clone the Zep repository
Clone the Zep repository that includes the evaluation harness:
Set up your environment
Install UV package manager for macOS/Linux:
For other platforms, visit the UV installation guide.
Install all required dependencies using UV:
Set up your API keys by copying the example file and adding your keys:
Edit .env and add your keys:
Get your Zep API key at app.getzep.com and OpenAI API key at platform.openai.com/api-keys.
Write down 3-5 example interactions
Most important step: This is the most critical part of the evaluation process. Take time to write down 3-5 specific examples that showcase how you want your agent to behave once it has memory. These examples will be dropped into an AI prompt in the next step to automatically generate your evaluation data.
For each example, simply note what the user asks and what the agent should respond with:
Use an AI coding assistant to update the test data
Use Cursor, Copilot, Claude Code, or another AI coding assistant to automatically update the test files based on your examples.
Provide this prompt to your AI assistant:
Run the ingestion script
Load your test conversations into Zep:
The ingestion process creates numbered run directories (e.g., 1_20251103T123456) containing manifest files that document created users, thread IDs, and configuration details.
For ingestion with a custom ontology:
Wait for graph processing to complete
After ingestion completes, the knowledge graph needs time to process all messages and extract facts, entities, and relationships. Graph processing happens sequentially to preserve the temporal sequence of events.
Processing time: 5-10 seconds per message. With 5 conversations of 6 messages each (30 messages total), expect processing to take approximately 2.5-5 minutes.
You can monitor processing status in the Zep dashboard or wait for the recommended time before proceeding to evaluation.
Run the evaluation script
Execute the evaluation pipeline:
To evaluate a specific run:
The script processes each test question through four automated steps:
- Search: Query Zep’s knowledge graph using a cross-encoder reranker to retrieve relevant information
- Evaluate context: Assess whether the retrieved information is sufficient to answer the test question (produces the primary metric: COMPLETE, PARTIAL, or INSUFFICIENT)
- Generate response: Use GPT-4o-mini with the retrieved context to generate an answer
- Grade answer: Evaluate the generated response against the golden answer using GPT-4o (produces the secondary metric: CORRECT or WRONG)
The context completeness evaluation (step 2) is the primary metric as it measures Zep’s core capability: retrieving relevant information. The answer grading (step 4) is secondary since it also depends on the LLM’s ability to use that context.
Results are saved to runs/{run_number}/evaluation_results_{timestamp}.json.
Interpret your results
The evaluation results include overall accuracy on the test questions and detailed per-test breakdown. Look at these key metrics:
- Context completeness: Whether Zep retrieved all necessary information (COMPLETE, PARTIAL, or INSUFFICIENT). This is your primary indicator of Zep’s retrieval performance.
- Answer accuracy: Whether the generated answer matched your golden answer criteria (CORRECT or WRONG). This measures both retrieval and generation quality.
- Per-user breakdown: Performance metrics for each user to identify patterns.
- Detailed test results: Individual test case results with retrieved context, generated answers, and the LLM judge’s reasoning.
The script prints overall scores and saves detailed results including which questions the agent answered correctly versus missed, along with the LLM judge’s reasoning for each evaluation.
Review results and iterate
Look at the evaluation results to identify any missed questions. For each incorrect answer:
- Check if the conversation data contains the necessary information
- Verify the golden_answer is clear and specific
- Review the retrieved context in the results JSON to understand what Zep found
- Adjust your conversations or test questions as needed
If context is consistently incomplete, consider adjusting your data ingestion strategy, search parameters, or graph configuration.
Iterate by modifying your data files, then re-run the ingestion and evaluation scripts.
Next steps
Once you have the basic evaluation working, consider these next steps:
-
Add more examples and variations: Expand your test set with additional examples and variations of existing scenarios to cover more edge cases.
-
Evaluate Zep’s performance with your existing agent: Once you’ve validated Zep’s retrieval capabilities with the evaluation harness, integrate Zep into your existing agent and evaluate end-to-end performance. Create test cases based on real user conversations from your application to reflect actual usage patterns. This helps you understand how Zep performs in your complete system, including your agent’s prompt engineering, tool calling, and response generation.
-
Define a custom ontology for your domain: Create entity and edge types tailored to your specific use case for better knowledge graph structure and retrieval. Use an AI coding assistant to define custom types based on your conversation data:
After updating ontology.py, run ingestion with the custom ontology flag:
Learn more about customizing graph structure.
-
Add background data: Ingest a larger dataset before your test conversations to evaluate retrieval performance when relevant information is buried in a larger knowledge graph.
-
Test with JSON and unstructured data: Add JSON documents, transcripts, or business data alongside conversations, then create test questions that require retrieving this non-conversational data. See Adding Data to the Graph.
-
Tune search strategy and graph parameters: Experiment with different rerankers, search scopes, and graph creation settings like ignoring assistant messages to optimize performance for your use case. You can customize the evaluation parameters in
zep_evaluate.py: