Quickstart

Evaluate Zep’s memory retrieval and question-answering capabilities

The Zep Eval Harness is an end-to-end evaluation framework for testing Zep’s memory retrieval and question-answering capabilities for general conversational scenarios. This guide will walk you through setting up and running the harness to evaluate your Zep implementation.

Prerequisites

Before getting started, ensure you have:

Installation

1

Clone the repository

Clone the Zep repository and navigate to the eval harness directory:

$git clone https://github.com/getzep/zep.git
>cd zep/zep-eval-harness
2

Install UV

Install UV package manager for macOS/Linux:

$curl -LsSf https://astral.sh/uv/install.sh | sh

For other platforms, visit the UV installation guide.

3

Install dependencies

Install all required dependencies using UV:

$uv sync
4

Configure environment variables

Copy the example environment file and add your API keys:

$cp .env.example .env

Edit the .env file to include your API keys:

$ZEP_API_KEY=your_zep_api_key_here
>OPENAI_API_KEY=your_openai_api_key_here

Data structure

The harness expects data files in the following structure:

Users file

Location: data/users.json

Contains user information with fields: user_id, first_name, last_name, email, and optional metadata fields.

Conversations

Location: data/conversations/

Files named {user_id}_{conversation_id}.json containing:

  • conversation_id
  • user_id
  • messages array with role, content, and timestamp

Test cases

Location: data/test_cases/

Files named {user_id}_tests.json with test structure:

  • id
  • category
  • query
  • golden_answer
  • requires_telemetry flag

Optional telemetry

Location: data/telemetry/

Files named {user_id}_{data_type}.json containing any JSON data with a user_id field.

Running the evaluation

1

Ingest data

Run the ingestion script to load your data into Zep:

$uv run zep_ingest.py

For ingestion with a custom ontology:

$uv run zep_ingest.py --custom-ontology

The ingestion process creates numbered run directories (e.g., 1_20251103T123456) containing manifest files that document created users, thread IDs, and configuration details.

2

Run evaluation

Evaluate the most recent ingestion run:

$uv run zep_evaluate.py

To evaluate a specific run:

$uv run zep_evaluate.py 1

Results are saved to runs/{run_number}/evaluation_results_{timestamp}.json.

Understanding the evaluation pipeline

The harness performs four automated steps for each test case:

2

Evaluate context

Assess whether the retrieved information is sufficient to answer the test question. This produces the primary metric:

  • COMPLETE: All necessary information present
  • PARTIAL: Some relevant information, but incomplete
  • INSUFFICIENT: Missing critical information
3

Generate response

Use GPT-4o-mini with the retrieved context to generate an answer to the test question.

4

Grade answer

Evaluate the generated response against the golden answer using GPT-4o. This produces the secondary metric:

  • CORRECT: Response matches golden answer
  • WRONG: Response does not match golden answer

Configuration

You can customize the evaluation parameters in zep_evaluate.py:

1# Search limits
2FACTS_LIMIT = 20 # Number of edges to return
3ENTITIES_LIMIT = 10 # Number of nodes to return
4EPISODES_LIMIT = 0 # Disabled by default
5
6# Reranker options: cross_encoder (default), rrf, or mmr

The context completeness evaluation (step 2) is the primary metric as it measures Zep’s core capability: retrieving relevant information. The answer grading (step 4) is secondary since it also depends on the LLM’s ability to use that context.

Output metrics

The evaluation results include:

  • Aggregate scores: Overall context completeness and answer accuracy rates
  • Per-user breakdown: Performance metrics for each user
  • Detailed test results: Individual test case results with context and answers
  • Performance timing: Processing time for each step

Best practices

Ensure the answer to each test question is present somewhere in the ingested data. Tests should evaluate Zep’s retrieval capabilities, not whether the information exists.

Graph processing is asynchronous and typically takes 5-20 seconds per message. Episode processing time can vary significantly. Allow sufficient time between ingestion and evaluation.

Categorize your test cases to understand performance across different types of queries (e.g., personal preferences, work history, recent events).

Focus on the context completeness metric as your primary indicator of Zep’s performance. If context is consistently incomplete, consider adjusting your data ingestion strategy or search parameters.

Next steps