LangChain on Databricks for LLM development

Important

These are experimental features and the API definitions might change.

This article describes the LangChain integrations that facilitate the development and deployment of large language models (LLMs) on Databricks.

With these LangChain integrations you can:

Seamlessly load data from a PySpark DataFrame with the PySpark DataFrame loader.
Interactively query your data using natural language with the Spark DataFrame Agent or Databricks SQL Agent.
Wrap your Databricks served model as a large language model (LLM) in LangChain.

What is LangChain?

LangChain is a software framework designed to help create applications that utilize large language models (LLMs). LangChain’s strength lies in its wide array of integrations and capabilities. It includes API wrappers, web scraping subsystems, code analysis tools, document summarization tools, and more. It also supports large language models from OpenAI, Anthropic, HuggingFace, etc. out of the box along with various data sources and types.

LangChain is available as an experimental MLflow flavor which allows LangChain customers to leverage the robust tools and experiment tracking capabilities of MLflow directly from the Databricks environment. See the LangChain flavor MLflow documentation.

Requirements

Databricks Runtime 13.3 ML and above.
Databricks recommends pip installing the latest version of LangChain to ensure you have the most recent updates.
- %pip install --upgrade langchain

Load data with the PySpark DataFrame loader

The PySpark DataFrame loader in LangChain simplifies loading data from a PySpark DataFrame with a single method.

from langchain.document_loaders import PySparkDataFrameLoader

loader = PySparkDataFrameLoader(spark, wikipedia_dataframe, page_content_column="text")
documents = loader.load()

The following notebook showcases an example where the PySpark DataFrame loader is used to create a retrieval based chatbot that is logged with MLflow, which in turn allows the model to be interpreted as a generic Python function for inference with mlflow.pyfunc.load_model().

PySpark DataFrame loader and MLFlow in Langchain notebook

Open notebook in new tab

Spark DataFrame Agent

The Spark DataFrame Agent in LangChain allows interaction with a Spark DataFrame, optimized for question answering. LangChain’s Spark DataFrame Agent documentation provides a detailed example of how to create and use the Spark DataFrame Agent with a DataFrame.

from langchain.agents import create_spark_dataframe_agent

df = spark.read.csv("/databricks-datasets/COVID/coronavirusdataset/Region.csv", header=True, inferSchema=True)
display(df)

agent = create_spark_dataframe_agent(llm=OpenAI(temperature=0), df=df, verbose=True)
...

The following notebook demonstrates how to create and use the Spark DataFrame Agent to help you gain insights on your data.

Use LangChain to interact with a Spark DataFrame notebook

Open notebook in new tab

Databricks SQL Agent

The Databricks SQL Agent is a variant of the standard SQL Database Agent that LangChain provides and is considered a more powerful variant of the Spark DataFrame Agent.

With the Databricks SQL Agent any Databricks users can interact with a specified schema in Unity Catalog and generate insights on their data.

Important

The Databricks SQL Agent can only query tables, and does not create tables.

In the following example the database instance is created within the SQLDatabase.from_databricks(catalog="...", schema="...") command and the agent and required tools are created by SQLDatabaseToolkit(db=db, llm=llm) and create_sql_agent(llm=llm, toolkit=toolkit, **kwargs), respectively.

from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain import OpenAI

db = SQLDatabase.from_databricks(catalog="samples", schema="nyctaxi")
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=.7)
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent = create_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

agent.run("What is the longest trip distance and how long did it take?")

Note

OpenAI models require a paid subscription, if the free subscription hits a rate limit.

The following notebook demonstrates how to create and use the Databricks SQL Agent to help you better understand the data in your database.

Use LangChain to interact with a SQL database notebook

Open notebook in new tab

Wrap Databricks served models as LLMs

If you have an LLM that you created on Databricks, you can use it directly within LangChain in the place of OpenAI, HuggingFace, or any other LLM provider.

This integration supports two endpoint types:

Model serving endpoints recommended for production and development.
Cluster driver proxy app, recommended for interactive development.

Wrap a model serving endpoint

You can wrap Databricks endpoints as LLMs in LangChain. To wrap a model serving endpoint as an LLM in LangChain you need:

A registered LLM deployed to a Databricks model serving endpoint.
CAN QUERY permission to the endpoint.

Oftentimes, models require or recommend important parameters, like temperature or max_tokens. The following example shows how to input those parameters with a deployed model named falcon-7b-instruct. Additional details can be found on the Wrapping a serving endpoint LangChain documentation.

from langchain.llms import Databricks

llm = Databricks(endpoint_name="falcon-7b-instruct", model_kwargs={"temperature": 0.1, "max_tokens": 100})
llm("How are you?")

Wrap a cluster driver proxy application

To wrap a cluster driver proxy application as an LLM in LangChain you need:

An LLM loaded on a Databricks interactive cluster in “single user” or “no isolation shared” mode.
A local HTTP server running on the driver node to serve the model at “/” using HTTP POST with JSON input/output.
An app uses a port number between [3000, 8000] and listens to the driver IP address or simply 0.0.0.0 instead of localhost only.
The CAN ATTACH TO permission to the cluster.

See the LangChain documentation for Wrapping a cluster driver proxy app for an example.