Dialog Tools

Structured Data Extraction

Structured Data Extraction is a low-latency, high-fidelity tool for extracting the data you need from Chat History stored in Zep.

Structured Data Extraction for Python requires pydantic version 2 installed and is not compatible with pydantic v1.

Many business and consumer apps need to extract structured data from conversation between an Assistant and human user. Often, the extracted data is the objective of the conversation. Consider completing a sales order, making a reservation, or a leave request. All of these tasks require progressively collecting data from the conversation.

Often, you will want to identify the data values you have collected and which values you still need to collect in order to prompt the LLM to request the latter.

This can be a slow and inaccurate exercise, and frustrating to your users. If you’re making multiple calls to an LLM to extract and validate data on every chat turn, you’re likely adding seconds to your response time.

Zep’s structured data extraction (SDE) is a low-latency, high-fidelity tool for extracting the data you need from Chat History stored in Zep. For many multi-field extraction tasks you can expect latency of under 400ms, with the addition of fields increasing latency sub-linearly.

Quick Start

An end-to-end SDE example (in Python) can be found in the Zep By Example repo.

The example covers:

  • defining a model using many of the field types that SDE supports
  • extracting data from a Chat History
  • and provides an example of how to merge newly extracted data with an already partially populated model.

SDE vs JSON Mode

Many model providers offer a JSON inference mode which guarantees that the output will be well-formed JSON. There are, however, no guarantees that the field values will conform to the JSON Schema you define, nor that the field values are correct (vs being hallucinated). Additionally, all fields are extracted in a single inference call, with additional fields adding linearly or greater to extraction latency.

SDE’s Preprocessing, Guided LLM Output, and Validation

Zep uses a combination of dialog preprocessing, guided LLM output, and post-inference validation to ensure that the extracted data is in the format you expect and is valid given the current dialog. When using a structured Field Type (ZepText excluded), you will not receive back data in an incorrect format.

While there are limits to the accuracy of extraction when the conversation is very nuanced or ambiguous, with careful crafting of field descriptions, you can achieve high accuracy in most cases.

Concurrent Extraction Scales Sub-Linearly

SDE’s extraction latency scales sub-linearly with the number of fields in your model. That is, you may add additional fields with low marginal increase in latency. You can expect extraction times of 400ms or lower when extracting fairly complex models for a 500 character dialog (which includes both message content and your Role and RoleType designations).

Defining Your Model

To extract data with Zep, you will need to define a model of the data you require from a Chat History. Each model is composed of a set of fields, each of which has a type and description. Key to successful extraction of data is careful construction of the field description.

1from pydantic import Field
2from zep_cloud.extractor import ZepModel, ZepText, ZepEmail, ZepDate
3
4class SalesLead(ZepModel):
5 company_name: Optional[ZepText] = Field(
6 description="The company name", default=None
7 )
8 lead_name: Optional[ZepText] = Field(
9 description="The lead's name", default=None
10 )
11 lead_email: Optional[ZepEmail] = Field(
12 description="The lead's email", default=None
13 )
14 lead_phone: Optional[ZepPhoneNumber] = Field(
15 description="The lead's phone number", default=None
16 )
17 budget: Optional[ZepFloat] = Field(
18 description="The lead's budget for the product", default=None
19 )
20 product_name: Optional[ZepRegex] = Field(
21 description="The name of the product the lead is interested in",
22 pattern=r"(TimeMachine|MagicTransporter)", default=None
23 )
24 zip_code: Optional[ZepZipCode] = Field(
25 description="The company zip code", default=None
26 )

When using Python, your model will subclass ZepModel. Zep builds on pydantic and requires correctly typing fields and using the Field class from pydantic to define the field description, default value, and pattern when using a ZepRegex field.

Executing an Extraction

To execute an extraction, you will need to call the extract method on the memory client. This method requires a session_id and a model schema that specifies the types and structures of data to be extracted based on field descriptions.

The lastN parameter, or Python equivalent last_n, specifies the number prior messages in the Session’s Chart History to look back at for data extraction.

The validate parameter specifies whether to optionally run an additional validation step on the extracted data.

The currentDateTime parameter, or Python equivalent current_date_time, specifies your user’s current date and time. This is used when extracting dates and times from relative phrases like “yesterday” or “last week” and to correctly set the timezone of the extracted data.

1extracted_data: SalesLead = client.memory.extract(
2 session_id,
3 SalesLead,
4 last_n=8,
5 validate=False,
6 current_date_time=datetime.now(ZoneInfo('America/New_York'))
7)

Using Progressive Data Extraction To Guide LLMs

Your application may need to collect a number of fields in order to accomplish a task. You can guide the LLM through this process by calling extract on every chat turn, identifying which fields are still needed, providing a partially populated model to the LLM, and directing the LLM to collect the remaining data.

Example Prompt
1You have already collected the following data:
2- Company name: Acme Inc.
3- Lead name: John Doe
4- Lead email: [email protected]
5
6You still need to collect the following data:
7- Lead phone number
8- Lead budget
9- Product name
10- Zip code
11
12Do not ask for all fields at once. Rather, work the fields
13into your conversation with the user and gradually collect the data.

As each field is populated, you may copy these values into an immutable data structure. Alternatively, if existing values change as the conversation progresses, you can apply a heuristic informed by your business rules to update the data structure with the new values.

Supported Field Types

Zep supports a wide variety of field types natively. Where Zep does not support a native field type, you can use a ZepRegex field to extract a string that matches a structure you define.

TypeDescriptionPython TypeTypeScript Type
TextPlain text values without a set format.ZepTextzepFields.text
NumberInteger values.ZepNumberzepFields.number
FloatFloating-point numbers.ZepFloatzepFields.float
RegexStrings matching a regex pattern.ZepRegexzepFields.regex
DateTimeDate and time values returned as an ISO 8601 string using your local timezone.ZepDateTimezepFields.dateTime
DateDate values returned as an ISO 8601 string using your local timezone.ZepDatezepFields.date
EmailEmail addresses.ZepEmailzepFields.email
PhonePhone numbers in North American Numbering Plan format.ZepPhoneNumberzepFields.phoneNumber
Zip CodePostal codes in North American ZIP or ZIP+4 format, if available.ZepZipCodezepFields.zipCode

Improving Accuracy

Extraction accuracy may be improved by experimenting with different descriptions and using Zep’s built-in field validation.

Improving Descriptions

When describing fields, ensure that you’ve been both specific and clear as to what value you’d like to extract. You may also provide few-shot examples in your description.

Bad ❌Good ✅
namethe name of the customer
phonethe customer’s phone number
addressstreet address
addresspostal address
product nameproduct name: “WidgetA” or “WidgetB”

Validating Extracted Data

When validation is enabled on your extract call, Zep will run an additional LLM validation step on the extracted data. This provides improved accuracy and reduces the risk of hallucinated values. The downside to enabling field validation is increased extraction latency and an increased risk of false negatives (empty fields where the data may be present in the dialog).

We recommend running without field validation first to gauge accuracy and latency and only enable field validation if you’ve determined that it is needed given your use case.

Working with Dates

Zep understands a wide variety of date and time formats, including relative times such as “yesterday” or “last week”. It is also able to parse partial dates and times, such as “at 3pm” or “on the 15th”. All dates and times are returned in ISO 8601 format and use the timezone of the currentDateTime parameter passed to the extract call.

If you are extracting datetime and date fields it is important that you provide a currentDateTime value in your extract call and ensure that it is in the correct timezone for your user (or the base timezone your application uses internally).

Extracting from Speech Transcripts

Zep is able to understand and extract data from machine-translated transcripts. Spelled out numbers and dates will be parsed as if written language. Utterances such as “uh” or “um” are ignored.

DescriptionFromTo
Apartment size in square feetIt is a three bedroom with approximately one thousand two hundred and fifty two square feet1252
Meeting date and timeI’m available on the uh fifteenth at uh three pm2024-06-15T15:00:00
The user’s phone numberIt’s uh two five five two three four five six seven uh eight(255) 234-5678
We are constantly improving transcript extraction. Let us know if you have a use case where this does not work well!

Multilingual Data Support

Zep’s Structured Data Extraction supports most major languages.

Tips, Tricks, and Best Practices

Limit the number of Messages from which you extract data

If your use case is latency sensitive, limit the number of messages from which you extract data. The higher the last N messages, the longer the extraction will take.

Always make fields optional in Python models

Always make fields optional in your Python model. This will prevent runtime errors when the data is not present in the conversation.

Using Regex when Zep doesn’t support your data type

The ZepRegex field type is a swiss army knife for extracting data. It allows you to extract any string that matches a regex pattern defined by you.

1class OrderInfo(ZepModel):
2 order_id: Optional[ZepRegex] = Field(
3 description="The order ID in format ABC-12345",
4 pattern=r"[A-Z]{3}-\d{5}"
5 )

Implementing Enum Fields

The ZepRegex field type can be used to extract data from a list of enums provided in a capture group.

1order_currency: Optional[ZepRegex] = Field(
2 description="The order currency: USD, GBP, or UNKNOWN",
3 default=None,
4 pattern=r"(UNKNOWN|USD|GBP)"
5)

Results in:

"USD"

Comma Separated Lists

You can extract comma seperated lists using the ZepRegex field type:

1brand_preferences: Optional[ZepRegex] = Field(
2 description="The customer's preferred brands as a comma-separated list",
3 default=None,
4 pattern=r"\w+(, \w+)+"
5)

Results in:

"Nike, Adidas, Puma"

Unsupported Regex Patterns

The following Regex tokens and features are unsupported when using the Regex field type:

  • Start of and end of string anchors (^ and $) and absolute positioning (\A and \Z).
  • Named groups ((?P<name>...)).
  • Backreferences (\g<name>).
  • Lookaheads and lookbehinds ((?=...), (?!...), (?<=...), (?<!...)).
  • Conditional expressions ((?(condition)yes|no)).