Rate limits

How the Zep API limits request rates and how to handle them

The Zep API enforces rate limits on incoming requests to ensure consistent performance and reliability across all accounts. Rate limits are measured in requests per minute (RPM) and applied per account.

The exact RPM limit for your account depends on your plan. See the Zep pricing page for details.

Rate limit headers

Every response from the Zep API includes headers that describe your current rate limit state. Inspect these headers to monitor usage and pace your client before you hit the limit.

HeaderDescription
X-RateLimit-LimitThe per-minute request limit for your account.
X-RateLimit-RemainingThe number of requests remaining in the current window.
X-RateLimit-ResetUnix timestamp (in seconds) at which the current window resets.
X-RateLimit-IncrementThe cost of the current request, in units of the limit. Always 1.
Retry-AfterNumber of seconds to wait before retrying. Only set on 429 responses.

Reading rate limit headers from the SDK

The Zep SDKs do not return response headers from a normal method call. To read headers, use the SDK’s raw response accessor, which returns both the parsed response data and the raw HTTP response.

1response = client.thread.with_raw_response.add_messages(
2 thread_id="thread_123",
3 messages=messages,
4)
5
6remaining = response.headers.get("x-ratelimit-remaining")
7reset = response.headers.get("x-ratelimit-reset")
8data = response.data

Handling 429 responses

When you exceed your rate limit, the Zep API returns HTTP 429 Too Many Requests. The SDK surfaces this as a typed error whose response headers include Retry-After, indicating how many seconds to wait before retrying.

Catch the error, read Retry-After, wait, and retry.

1import time
2from zep_cloud import ApiError
3
4try:
5 client.thread.add_messages(thread_id="thread_123", messages=messages)
6except ApiError as err:
7 if err.status_code == 429:
8 retry_after = int(err.headers.get("retry-after", "1"))
9 time.sleep(retry_after)
10 # retry your call

For best results, combine Retry-After with exponential backoff and jitter to avoid synchronized retries when many clients are throttled at the same time.

Pacing requests proactively

To avoid hitting 429 responses in the first place, use X-RateLimit-Remaining and X-RateLimit-Reset to pace your requests:

  • If X-RateLimit-Remaining is approaching zero, slow your request rate or pause until the window resets.
  • The current window ends at the Unix timestamp in X-RateLimit-Reset. After this time, a fresh allowance is available.

This is particularly useful for bulk operations, such as batch ingestion, where you control the cadence of outgoing requests.