Document Collections — Zep Documentation

Document Collections are deprecated and have been removed from Zep Community Edition. We will be removing this feature from Zep Cloud in a future release.

Zep’s document vector store lets you embed and search documents using vector similarity search, Maximum Marginal Relevance Re-Ranking, and metadata filtering.

You can manage collections, ingest documents, and search using Zep’s SDKs, LangChain, or LlamaIndex.

zep-python supports asynchronous operations.

All methods come in sync and async flavors, with async methods prefixed with a.

For instance, zep-python offers both zep_client.memory.add_memory and zep_client.memory.add_memory.

Key Concepts

Collections

A Collection is a group of documents that use the same embedding strategy and model. Zep automatically creates embeddings for the documents you provide.

Documents

Documents are the texts you want to embed and search. You can add documents to collections and optionally assign them a unique ID and metadata. If you add metadata, it can help filter search results.

Initializing the Zep Client

For details on initializing the Zep client, check out the SDK documentation.

Creating a Collection

Python

TypeScript

1 client = AsyncZep(
2     api_key=API_KEY,
3 )
4 collection_name = "babbagedocs" # the name of your collection. alphanum values only
5 
6 collection = await client.document.add_collection(
7     collection_name,  # required
8     description="Babbage's Calculating Engine",  # optional
9     metadata={"foo": "bar"},  # optional metadata to associate with this collection
10 )

Loading an Existing Collection

Python

TypeScript

1 collection = await client.document.get_collection(collection_name)

Adding Documents to a Collection

Python

TypeScript

Langchain

1 chunks = read_chunks_from_file(file, max_chunk_size)  # your custom function to read chunks from a file
2 
3 documents = [
4     CreateDocumentRequest(
5         content=chunk,
6         document_id=f"{collection_name}-{i}",  # optional document ID
7         metadata={"bar": i},  # optional metadata
8     )
9     for i, chunk in enumerate(chunks)
10 ]
11 
12 uuids = client.document.add_documents(documents)

document_id is an optional identifier you can assign to each document. It’s handy for linking a document chunk with a specific ID you choose.

The metadata is an optional dictionary that holds metadata related to your document. Zep leverages this metadata for hybrid searches across a collection, enabling you to filter search results more effectively.

When you use document.add_documents, it returns a list of Zep UUIDs corresponding to the documents you’ve added to the collection.

Chunking your documents

Choosing the right chunking strategy is crucial and highly dependent on your specific needs. A variety of 3rd-party libraries, including Langchain, offer support for processing documents from numerous sources and dividing them into smaller segments suitable for embedding.

We recommend experimenting with various extractors, chunking strategies, sizes, and overlaps to discover the optimal approach for your project.

Monitoring Embedding Progress

The process of embedding documents in Zep is asynchronous. To keep track of your collection’s embedding progress, you can periodically check the collection’s status:

Python

TypeScript

1 import time
2 
3 while True:
4     c = await client.document.get_collection(collection_name)
5     print(
6         "Embedding status: "
7         f"{c.document_embedded_count}/{c.document_count} documents embedded"
8     )
9     time.sleep(1)
10     if c.status == "ready":
11         break

Once the collection’s status changes to ready, it means all documents have been successfully embedded and are now searchable.

Searching a Collection with Hybrid Vector Search

Zep enables hybrid vector search across your collections, allowing you to pinpoint the most relevant documents based on semantic similarity. Additionally, you have the option to refine your search by filtering through document metadata.

You can initiate a search using either a text query or an embedding vector, depending on your needs.

Zep’s Collection and Memory search support semantic search queries, JSONPath-based metadata filters, and a combination of both. Memory search also supports querying by message creation date.

Python

TypeScript

Langchain

1 # search for documents using only a query string
2 query = "the moon"
3 results = await client.document.search(collection_name, text=query, limit=5)
4 
5 # hybrid search for documents using a query string and metadata filter
6 metadata_query = {
7     "where": {"jsonpath": '$[*] ? (@.baz == "qux")'},
8 }
9 results = await client.document.search(
10     collection_name, text=query, metadata=metadata_query, limit=5
11 )

metadata is an optional dictionary of JSONPath filters used to match on metadata associated with your documents.

limit is an optional integer indicating the maximum number of results to return.

Retrieving Documents by UUID

Zep supports retrieving a list of documents by Zep UUID:

Python

TypeScript

1 docs_to_get = uuids[40:50]
2 documents = await client.document.batch_get_documents(
3     collection_name, uuids=docs_to_get
4 )

Other Common Operations

This section covers additional common operations you might need to perform, such as listing all collections within your client’s scope. The examples above demonstrate how to create an index on a collection and list all collections for both Python and TypeScript.

Updating a Collection’s Description or Metadata

Python

TypeScript

1 client.document.update_collection(
2     collection_name,
3     description="Charles Babbage's Babbage's Calculating Engine 2",
4     metadata={"newfoo": "newbar"},
5 )

Batch Update Documents’ ID or Metadata

Python

TypeScript

1 await client.docyment.batch_update_documents(
2     collection_name,
3     request=[
4         UpdateDocumentListRequest(
5             uuid_="uuid",
6             document_id="new_id",
7             metadata={"foo": "bar"},
8         )
9     ],
10 )

Deleting Documents

Zep supports deleting documents from a collection by UUID:

Python

TypeScript

1 await client.document.delete_document(collection_name, document_uuid)

Deleting a Collection

Deleting a collection will delete all documents in the collection, as well as the collection itself.

Python

TypeScript

1 await client.document.delete_collection(collection_name)

1	client = AsyncZep(
2	api_key=API_KEY,
3	)
4	collection_name = "babbagedocs" # the name of your collection. alphanum values only
5
6	collection = await client.document.add_collection(
7	collection_name, # required
8	description="Babbage's Calculating Engine", # optional
9	metadata={"foo": "bar"}, # optional metadata to associate with this collection
10	)

1	chunks = read_chunks_from_file(file, max_chunk_size) # your custom function to read chunks from a file
2
3	documents = [
4	CreateDocumentRequest(
5	content=chunk,
6	document_id=f"{collection_name}-{i}", # optional document ID
7	metadata={"bar": i}, # optional metadata
8	)
9	for i, chunk in enumerate(chunks)
10	]
11
12	uuids = client.document.add_documents(documents)

1	import time
2
3	while True:
4	c = await client.document.get_collection(collection_name)
5	print(
6	"Embedding status: "
7	f"{c.document_embedded_count}/{c.document_count} documents embedded"
8	)
9	time.sleep(1)
10	if c.status == "ready":
11	break

1	# search for documents using only a query string
2	query = "the moon"
3	results = await client.document.search(collection_name, text=query, limit=5)
4
5	# hybrid search for documents using a query string and metadata filter
6	metadata_query = {
7	"where": {"jsonpath": '$[*] ? (@.baz == "qux")'},
8	}
9	results = await client.document.search(
10	collection_name, text=query, metadata=metadata_query, limit=5
11	)

1	docs_to_get = uuids[40:50]
2	documents = await client.document.batch_get_documents(
3	collection_name, uuids=docs_to_get
4	)

1	client.document.update_collection(
2	collection_name,
3	description="Charles Babbage's Babbage's Calculating Engine 2",
4	metadata={"newfoo": "newbar"},
5	)

1	await client.docyment.batch_update_documents(
2	collection_name,
3	request=[
4	UpdateDocumentListRequest(
5	uuid_="uuid",
6	document_id="new_id",
7	metadata={"foo": "bar"},
8	)
9	],
10	)