Skip to main content

Best practices for Mosaic AI Vector Search

This article gives some tips for how to use Mosaic AI Vector Search most effectively.

Recommendations for optimizing latency

  • Use the service principal authorization flow to take advantage of network-optimized routes. Service principal authorization can improve per-query performance by up to 100 msec when compared to personal access tokens.

  • Use the latest version of the Python SDK.

  • When testing, start with a concurrency of around 16 to 32. Higher concurrency does not yield a higher throughput.

  • Use a model served with provisioned throughput (for example, bge-large-en or a fine tuned version), instead of a pay-per-token foundation model.

  • Make sure you get the index only once, not on every query. Calling client.get_index(...).similarity_search(...) has increased latency. Instead, use the following:

    Python
    # Initialize index
    index = client.get_index(...)

    # Then later, for every query
    index.similarity_search(...)

    This is important when using the Vector Search index in MLFlow environments, where you can create the index object when you create the endpoint, and then reuse it for every query.

When to use GPUs

  • Use CPUs only for basic testing and for small datasets (up to 100s of rows).
  • For GPU compute type, Databricks recommends using GPU-small or GPU-medium.
  • For GPU compute scale-out, choosing more concurrency might improve ingestion times, but it depends on factors such as total dataset size and index metadata.

Working with images, video, or non-text data

  • Pre-compute the embeddings and use a Delta Sync Index with self-managed embeddings.
  • Don’t store binary formats such as images as metadata, as this adversely affects latency. Instead, store the path of the file as metadata.

Embedding sequence length

  • Check the embedding model sequence length to make sure documents are not being truncated. For example, BGE supports a context of 512 tokens. For longer context requirements, use gte-large-en-v1.5.

Use Triggered sync mode to reduce costs

  • The most cost-effective option for updating a vector search index is Triggered. Only select Continuous if you need to incrementally sync the index to changes in the source table with a latency of seconds. Both sync modes perform incremental updates – only data that has changed since the last sync is processed.