LangChain on Databricks for LLM development

Important

These are experimental features and the API definitions might change.

This article describes the LangChain integrations that facilitate the development and deployment of large language models (LLMs) on Databricks.

With these LangChain integrations you can:

  • Use Databricks-served models as LLMs or embeddings in your LangChain application.

  • Integrate Mosaic AI Vector Search for vector storage and retrieval.

  • Manage and track your LangChain models and performance in MLflow experiments.

  • Trace the development and production phases of your LangChain application with MLflow Tracing.

  • Seamlessly load data from a PySpark DataFrame with the PySpark DataFrame loader.

  • Interactively query your data using natural language with the Spark DataFrame Agent or Databricks SQL Agent.

What is LangChain?

LangChain is a software framework designed to help create applications that utilize large language models (LLMs). LangChain’s strength lies in its wide array of integrations and capabilities. It includes API wrappers, web scraping subsystems, code analysis tools, document summarization tools, and more. It also supports large language models from OpenAI, Anthropic, HuggingFace, etc. out of the box along with various data sources and types.

Leverage MLflow for LangChain development

LangChain is available as an MLflow flavor, which enables users to harness MLflow’s robust tools for experiment tracking and observability in both development and production environments directly within Databricks. For more details and guidance on using MLflow with LangChain, see the MLflow LangChain flavor documentation.

MLflow on Databricks offers additional features that distinguish it from the open-source version, enhancing your development experience with the following capabilities:

  • Fully managed MLflow Tracking Server: Instantly available within your Databricks workspace, allowing you to start tracking experiments without setup delays.

  • Seamless integration with Databricks Notebooks: Experiments are automatically linked to notebooks, streamlining the tracking process.

  • MLflow Tracing on Databricks: Provides production-level monitoring with inference table integration, ensuring end-to-end observability from development to production.

  • Model lifecycle management with Unity Catalog: Centralized control over access, auditing, lineage, and model discovery across your workspaces.

By leveraging these features, you can optimize the development, monitoring, and management of your LangChain-based projects, making Databricks a premier choice for MLflow-powered AI initiatives.

Requirements

  • Databricks Runtime 13.3 ML or above.

  • Install the LangChain Databricks integration package and Databricks SQL connector. Databricks also recommends pip installing the latest version of LangChain to ensure you have the most recent updates.

    • %pip install --upgrade langchain-databricks langchain-community langchain databricks-sql-connector

Use Databricks served models as LLMs or embeddings

If you have an LLM or embeddings model served using Databricks Model Serving, you can use it directly within LangChain in the place of OpenAI, HuggingFace, or any other LLM provider.

To use a model serving endpoint as an LLM or embeddings model in LangChain you need:

  • A registered LLM or embeddings model deployed to a Databricks model serving endpoint.

    • Alternatively, you can use the models made available by Foundation Model APIs, a curated list of open-source models deployed within your workspace and ready for immediate use.

  • CAN QUERY permission to the endpoint.

Chat models

The following example shows how to use the Meta’s Llama 3.1 70B Instruct model as an LLM component in LangChain using the Foundation Models API.


from langchain_databricks import ChatDatabricks

chat_model = ChatDatabricks(
    endpoint="databricks-meta-llama-3-1-70b-instruct"
    temperature=0.1,
    max_tokens=250,
)
chat_model.invoke("How to use Databricks?")

You can replace the endpoint to your custom model deployed on the serving endpoint. Additional examples such as streaming, async invocation and function calling can be found in the LangChain documentation.

Embeddings

The following example shows how to use the databricks-bge-large-en embedding model as an embeddings component in LangChain using the Foundation Models API.


from langchain_databricks import DatabricksEmbeddings

embeddings = DatabricksEmbeddings(endpoint="databricks-bge-large-en")

Additional details can be found in the LangChain documentation

LLMs

Warning

Completion models are considered a legacy feature. Most modern models utilize the chat completion interface and should be used with the ChatModel component instead.

The following example shows how to use your completion model API as an LLM component in LangChain.

from langchain_community.llms import Databricks

llm = Databricks(endpoint_name="falcon-7b-instruct", model_kwargs={"temperature": 0.1, "max_tokens": 100})
llm("How are you?")

Use Mosaic AI Vector Search as vector store

Mosaic AI Vector Search is a serverless similarity search engine on Databricks, enabling you to store vector representations of your data, including metadata, in a vector database. You can create auto-updating vector search indexes from Delta tables managed by Unity Catalog and query them via a simple API to retrieve the most similar vectors.

To use this feature in LangChain, create a DatabricksVectorSearch instance:

from langchain_databricks import DatabricksVectorSearch

vector_store = DatabricksVectorSearch(index_name="<YOUR_VECTOR_SEARCH_INDEX_NAME>")
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
retriever.invoke("What is Databricks?")

Refer to the DatabricksVectorSearch documentation for further details.

Use Unity Catalog function as tools

Note

The Unity Catalog function integration is in the langchain-community package. You must install it using %pip install langchain-community to access its functionality. This integration will migrate to langchain-databricks package in an upcoming release.

You can expose SQL or Python functions in Unity Catalog as tools for your LangChain agent. For full guidance on creating Unity Catalog functions and using them in LangChain, see the Databricks UC Toolkit documentation.

Load data with the PySpark DataFrame loader

The PySpark DataFrame loader in LangChain simplifies loading data from a PySpark DataFrame with a single method.

from langchain.document_loaders import PySparkDataFrameLoader

loader = PySparkDataFrameLoader(spark, wikipedia_dataframe, page_content_column="text")
documents = loader.load()

The following notebook showcases an example where the PySpark DataFrame loader is used to create a retrieval based chatbot that is logged with MLflow, which in turn allows the model to be interpreted as a generic Python function for inference with mlflow.pyfunc.load_model().

PySpark DataFrame loader and MLflow in Langchain notebook

Open notebook in new tab

Spark DataFrame Agent

The Spark DataFrame Agent in LangChain allows interaction with a Spark DataFrame, optimized for question answering. LangChain’s Spark DataFrame Agent documentation provides a detailed example of how to create and use the Spark DataFrame Agent with a DataFrame.

from langchain.agents import create_spark_dataframe_agent

df = spark.read.csv("/databricks-datasets/COVID/coronavirusdataset/Region.csv", header=True, inferSchema=True)
display(df)

agent = create_spark_dataframe_agent(llm=OpenAI(temperature=0), df=df, verbose=True)
...

The following notebook demonstrates how to create and use the Spark DataFrame Agent to help you gain insights on your data.

Use LangChain to interact with a Spark DataFrame notebook

Open notebook in new tab

Databricks SQL Agent

With the Databricks SQL Agent any Databricks users can interact with a specified schema in Unity Catalog and generate insights on their data.

Important

The Databricks SQL Agent can only query tables, and does not create tables.

In the following example the database instance is created within the SQLDatabase.from_databricks(catalog="...", schema="...") command and the agent and required tools are created by SQLDatabaseToolkit(db=db, llm=llm) and create_sql_agent(llm=llm, toolkit=toolkit, **kwargs), respectively.

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain import OpenAI
from langchain_databricks import ChatDatabricks

# Note: Databricks SQL connections eventually time out. We set pool_pre_ping: True to
# try to ensure connection health is checked before a SQL query is made
db = SQLDatabase.from_databricks(catalog="samples", schema="nyctaxi", engine_args={"pool_pre_ping": True})
llm = ChatDatabricks(
    endpoint="databricks-meta-llama-3-1-70b-instruct",
    temperature=0.1,
    max_tokens=250,
)

toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

agent.run("What is the longest trip distance and how long did it take?")

The following notebook demonstrates how to create and use the Databricks SQL Agent to help you better understand the data in your database.

Use LangChain to interact with a SQL database notebook

Open notebook in new tab