Rate limits

How the Zep API limits request rates and how to handle them

Rate limits

How the Zep API limits request rates and how to handle them

The Zep API enforces rate limits on incoming requests. Rate limits are measured in requests per minute (RPM) and applied per account.

The exact RPM limit for your account depends on your plan. See the Zep pricing page for details.

Rate limit headers

Every response from the Zep API includes headers that describe your current rate limit state. Inspect these headers to monitor usage and pace your client before you hit the limit.

Header	Description
`X-RateLimit-Limit`	The per-minute request limit for your account.
`X-RateLimit-Remaining`	The number of requests remaining in the current window.
`X-RateLimit-Reset`	Unix timestamp (in seconds) at which the current window resets.
`X-RateLimit-Increment`	The cost of the current request, in units of the limit. Always `1`.
`Retry-After`	Number of seconds to wait before retrying. Only set on `429` responses.

Reading rate limit headers from the SDK

The Zep SDKs do not return response headers from a normal method call. To read headers, use the SDK’s raw response accessor, which returns both the parsed response data and the raw HTTP response.

1 response = client.thread.with_raw_response.add_messages(
2     thread_id="thread_123",
3     messages=messages,
4 )
5 
6 remaining = response.headers.get("x-ratelimit-remaining")
7 reset = response.headers.get("x-ratelimit-reset")
8 data = response.data

Handling 429 responses

When you exceed your rate limit, the Zep API returns HTTP 429 Too Many Requests. The SDK surfaces this as a typed error whose response headers include Retry-After, indicating how many seconds to wait before retrying.

Catch the error, read Retry-After, wait, and retry.

1 import time
2 from zep_cloud import ApiError
3 
4 try:
5     client.thread.add_messages(thread_id="thread_123", messages=messages)
6 except ApiError as err:
7     if err.status_code == 429:
8         retry_after = int(err.headers.get("retry-after", "1"))
9         time.sleep(retry_after)
10         # retry your call

Combine Retry-After with exponential backoff and jitter to avoid synchronized retries when many clients are throttled at the same time.

Pacing requests proactively

To avoid hitting 429 responses in the first place, use X-RateLimit-Remaining and X-RateLimit-Reset to pace your requests:

If X-RateLimit-Remaining is approaching zero, slow your request rate or pause until the window resets.
The current window ends at the Unix timestamp in X-RateLimit-Reset. After this time, a fresh allowance is available.

This is particularly useful for bulk operations, such as batch ingestion, where you control the cadence of outgoing requests.