Retrieval philosophy

Understanding Zep's approach to optimizing for recall and latency.

Zep’s retrieval system is designed with two primary goals: high recall and low latency. This is a deliberate architectural choice that differs from systems optimized for precision.

Understanding recall vs. precision

Think of recall and precision as two different ways to measure retrieval quality:

  • Recall measures completeness: ā€œDid we find all the relevant information?ā€
  • Precision measures accuracy: ā€œIs everything we returned actually relevant?ā€

In practical terms:

  • High recall means you get all the relevant results, but might also get some less relevant ones
  • High precision means everything returned is highly relevant, but you might miss important information

The tradeoff in practice

ApproachWhat You GetWhat You RiskBest For
Optimize for Recall (Zep’s approach)All relevant facts, plus some less relevant resultsLarger context with some noiseAgents that need complete information to make decisions; real-time applications
Optimize for PrecisionOnly highly relevant resultsMissing critical facts that could cause task failureUse cases where context size is severely constrained; manual review workflows

Example scenario

User query: ā€œWhat did we discuss about the Q2 marketing budget?ā€

Retrieval ApproachResults ReturnedOutcome
Recall-Optimized (Zep)• Q2 marketing budget discussion āœ“
• Related Q2 sales projections āœ“
• Q3 budget planning mention āš ļø
• Q2 hiring costs mentioning marketing āš ļø
Agent has complete context, including tangentially related information. Can successfully answer follow-up questions about budget revisions.
Precision-Optimized• Q2 marketing budget discussion āœ“
• Related Q2 sales projections āœ“
Clean, focused results, but missing a separate conversation about budget revisions that didn’t explicitly mention ā€œmarketing budget.ā€ Agent may provide incomplete information.

Why recall over precision?

Agents need comprehensive context to make informed decisions. Missing a critical fact can cause an agent to fail its task or provide incorrect information. By optimizing for recall, Zep ensures that relevant information is available to the agent, even if that means returning more results than strictly necessary.

The underlying principle: it’s better to provide complete information and let the agent or downstream LLM filter what’s relevant than to risk omitting something important.

Why latency matters

Real-time applications like conversational AI, live customer support, and interactive agents require fast responses. Zep’s retrieval architecture is optimized to return results in milliseconds, enabling seamless user experiences without perceptible delays.

Tuning the recall-precision tradeoff

The recall-optimized approach described here is how Zep is tuned out of the box. However, Zep provides several mechanisms to adjust this tradeoff for different use cases:

  • Limit search results: Control the maximum number of results returned
  • Apply filters: Narrow retrieval to specific time ranges, Entity and/or Edge labels, or other criteria
  • Adjust search parameters: Fine-tune ranking and relevance thresholds

These controls allow you to shift toward precision when your application demands it, while maintaining Zep’s fast retrieval performance.

Balancing context size

While recall is our priority, Zep does consider token count when returning results. We balance the size of the resulting context with the goal of providing complete information, but when in doubt, we err on the side of ensuring your agent has what it needs to succeed.