Skip to main content

Embeddings

Embedding models are neural networks that encode information into representative vectors that can be used for tasks like semantic retrieval, clustering, and recommender systems.

zembed-1

zembed-1 is ZeroEntropy’s flagship, state-of-the-art, open-weight, multilingual embedding model. You can read more about its performance in this blog post.
zembed-1 is the default embedding model used in zsearch, ZeroEntropy’s search engine.
There are multiple ways to use zembed-1:
  • Calling the models/embed API endpoint, which is available via the Python and Node SDKs.
  • Downloading the weights from HuggingFace and self-hosting the model.
  • On the AWS Marketplace through SageMaker.
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()

response = zclient.models.embed(
    model="zembed-1",
    input_type="query",  # "query" or "document"
    input="What is retrieval augmented generation?",  # string or list[str]
    dimensions=2560,  # 2560 (default), 1280, 640, 320, 160, 80, or 40
    encoding_format="float",  # "float" or "base64"
    latency="fast",  # "fast" or "slow"; omit for auto
)
There are three parameters you can configure when using zembed-1:
  • Latency mode: Control the trade-off between latency and throughput based on your use case.
  • Embedding type: Specify whether you are embedding a query or a passage to take advantage of asymmetrical retrieval.
  • Embedding size: Choose an output dimension from the available options: 2560 (default), 1280, 640, 320, 160, 80, or 40.
Higher-dimension embeddings yield greater accuracy at the cost of increased storage.
To read more about how to think about these trade-offs, you can refer to our blog. For guidance on how to inference the model, you can check out the examples, as well as our cookbook here.

Rerankers

Rerankers are cross-encoder neural networks that can boost the accuracy of any search system. You can read more about what rerankers are and when they are most useful in this blog post.

zerank-2 and zerank-1

zerank-2 is our flagship state-of-the-art reranker, you can read more about its performance at this blog post. zerank-1 and zerank-1-small are our first generation of SOTA rerankers. All our rerankers can be called using:
  • Using the models/rerank API endpoint, which is callable via the Python and Node SDKs.
  • By passing in the reranker query parameter into top-snippets
  • Downloading from our HuggingFace and self-hosting the models.
  • On the AWS Marketplace through SageMaker.
  • zerank-1-small is also available on Baseten.
We’ve open-sourced zerank-1-small under an Apache 2.0 license, and it is also available through HuggingFace and Baseten.Our flagship model zerank-2 can be downloaded from HuggingFace under a non-commercial license. To use in a commercial setting, contact us at [email protected] and we’ll get you a license ASAP!

Using the ZeroEntropy SDK

# Create an API Key at https://dashboard.zeroentropy.dev
# pip install zeroentropy
from zeroentropy import ZeroEntropy

# Initialize the ZeroEntropy client (reads ZEROENTROPY_API_KEY from env)
zclient = ZeroEntropy()

response = zclient.models.rerank(
    model="zerank-2",
    query="What is 2+2?",
    documents=[
        "4",
        "The answer is definitely 1 million.",
    ],
)
print(response.model_dump_json(indent=4))

Using top-snippets

When querying for /top-snippets from a ZeroEntropy collection, you can easily apply the reranker and get a significantly better ranking. Scores from a reranker are deterministic and more readily interpretable, which is another benefit over just hybrid search.
from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()

# Assuming you have already added documents to the collection "pdfs"
response = zclient.queries.top_snippets(
    collection_name="pdfs",
    query="What is Retrieval Augmented Generation?",
    k=10,
    reranker="zerank-2", # All K results will be reranked using our reranker.
)

print(response.results)

Ratelimiting and Pricing

Rate limits

Each API key is limited to 2,500,000 UTF-8 bytes per minute on the default latency mode "fast", both for embedding and reranking.
ModelLatency ModeTPMRPM
zembed-1"fast"2,500,000 UTF-8 bytes100
zembed-1"slow"25,000,000 UTF-8 bytes100
zerank-2"fast"2,500,000 UTF-8 bytes100
zerank-2"slow"25,000,000 UTF-8 bytes100
zerank-1"fast"2,500,000 UTF-8 bytes100
zerank-1"slow"25,000,000 UTF-8 bytes100
zerank-1-small"fast"2,500,000 UTF-8 bytes100
zerank-1-small"slow"25,000,000 UTF-8 bytes100
A reranker request consumes bytes based on the number of documents and the total length of the input. The formula is:
Total bytes = 150 
+ len(query.encode("utf-8")) 
+ len(document.encode("utf-8"))
This is calculated per document, so the query is counted once for each document you pass in. For example, if you send a request with 10 documents, the total usage is:
10 × len(query.encode("utf-8"))
+ ∑ len(document_i.encode("utf-8")) for i in 1…10
An embedding request consumes bytes based on the total length of the input being embedded, whether it is a document or a query.
If you exceed your RPM or TPM limit:
  • Your requests will still be served.
  • However, they will be throttled to a high-throughput, high-latency mode.
  • You may experience latency of several seconds per request.
  • In this degraded mode, throughput can go up to 25,000,000 bytes per minute, but with reduced responsiveness.
To avoid throttling, keep your per-minute usage below the 2MB soft limit.

Pricing

Our pricing is simple and transparent.
ModelPrice per 1000 TokensPrice per 1M Tokens
zembed-1$0.000050$0.050
zerank-2$0.000025$0.025
zerank-1$0.000025$0.025
zerank-1-small$0.000025$0.025

Deployment Options

All of our models are open-weight and available through different deployment options. For help choosing the right option for your use case, reach out to our team.
The fastest way to get started. Fully managed infrastructure with no deployment overhead.
  • SDKs: Python | Node
  • Authentication: API key via dashboard. All requests authenticated over TLS. SSO SAML through Okta available for enterprise customers.
  • Regions available: US-East, US-West, Europe.
  • Rate limits: You can refer to the rate limits shown above.
  • Latency: We benchmarked our models latency in this open-source repository.
  • Status Page: Visit our Status Page to monitor Uptime.
    pip install zeroentropy
Your data is never used for model training.
MSA, DPA, and BAA available on request.
See our Trust Portal for SOC 2 Type II, Pentest, and other compliance documentation.