Quickstart
Evaluate Zep’s memory retrieval and question-answering capabilities
The Zep Eval Harness is an end-to-end evaluation framework for testing Zep’s memory retrieval and question-answering capabilities for general conversational scenarios. This guide will walk you through setting up and running the harness to evaluate your Zep implementation.
Prerequisites
Before getting started, ensure you have:
- Zep API Key: Available at app.getzep.com
- OpenAI API Key: Obtainable from platform.openai.com/api-keys
- UV Package Manager: The harness uses UV for Python dependency management
Installation
Install UV
Install UV package manager for macOS/Linux:
For other platforms, visit the UV installation guide.
Data structure
The harness expects data files in the following structure:
Users file
Location: data/users.json
Contains user information with fields: user_id, first_name, last_name, email, and optional metadata fields.
Conversations
Location: data/conversations/
Files named {user_id}_{conversation_id}.json containing:
conversation_iduser_idmessagesarray withrole,content, andtimestamp
Test cases
Location: data/test_cases/
Files named {user_id}_tests.json with test structure:
idcategoryquerygolden_answerrequires_telemetryflag
Optional telemetry
Location: data/telemetry/
Files named {user_id}_{data_type}.json containing any JSON data with a user_id field.
Running the evaluation
Understanding the evaluation pipeline
The harness performs four automated steps for each test case:
Evaluate context
Assess whether the retrieved information is sufficient to answer the test question. This produces the primary metric:
- COMPLETE: All necessary information present
- PARTIAL: Some relevant information, but incomplete
- INSUFFICIENT: Missing critical information
Configuration
You can customize the evaluation parameters in zep_evaluate.py:
The context completeness evaluation (step 2) is the primary metric as it measures Zep’s core capability: retrieving relevant information. The answer grading (step 4) is secondary since it also depends on the LLM’s ability to use that context.
Output metrics
The evaluation results include:
- Aggregate scores: Overall context completeness and answer accuracy rates
- Per-user breakdown: Performance metrics for each user
- Detailed test results: Individual test case results with context and answers
- Performance timing: Processing time for each step
Best practices
Design fair tests
Ensure the answer to each test question is present somewhere in the ingested data. Tests should evaluate Zep’s retrieval capabilities, not whether the information exists.
Account for processing time
Graph processing is asynchronous and typically takes 5-20 seconds per message. Episode processing time can vary significantly. Allow sufficient time between ingestion and evaluation.
Use multiple test categories
Categorize your test cases to understand performance across different types of queries (e.g., personal preferences, work history, recent events).
Monitor context completeness
Focus on the context completeness metric as your primary indicator of Zep’s performance. If context is consistently incomplete, consider adjusting your data ingestion strategy or search parameters.
Next steps
- Learn more about customizing your context block
- Explore graph search parameters to optimize retrieval
- Understand best practices for memory management