%pip install --upgrade langchain faiss-cpu mlflow # For GPU clusters use the following # %pip install --upgrade langchain faiss-gpu mlflow
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
Requirement already satisfied: langchain in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (0.0.235)
Requirement already satisfied: faiss-cpu in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (1.7.4)
Requirement already satisfied: mlflow in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (2.5.0)
Requirement already satisfied: dataclasses-json<0.6.0,>=0.5.7 in /databricks/python3/lib/python3.10/site-packages (from langchain) (0.5.9)
Requirement already satisfied: pydantic<2,>=1 in /databricks/python3/lib/python3.10/site-packages (from langchain) (1.10.6)
Requirement already satisfied: requests<3,>=2 in /databricks/python3/lib/python3.10/site-packages (from langchain) (2.28.1)
Requirement already satisfied: numexpr<3.0.0,>=2.8.4 in /databricks/python3/lib/python3.10/site-packages (from langchain) (2.8.4)
Requirement already satisfied: openapi-schema-pydantic<2.0,>=1.2 in /databricks/python3/lib/python3.10/site-packages (from langchain) (1.2.4)
Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in /databricks/python3/lib/python3.10/site-packages (from langchain) (3.8.4)
Requirement already satisfied: langsmith<0.0.8,>=0.0.7 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (from langchain) (0.0.7)
Requirement already satisfied: tenacity<9.0.0,>=8.1.0 in /databricks/python3/lib/python3.10/site-packages (from langchain) (8.1.0)
Requirement already satisfied: numpy<2,>=1 in /databricks/python3/lib/python3.10/site-packages (from langchain) (1.21.5)
Requirement already satisfied: SQLAlchemy<3,>=1.4 in /databricks/python3/lib/python3.10/site-packages (from langchain) (1.4.39)
Requirement already satisfied: async-timeout<5.0.0,>=4.0.0 in /databricks/python3/lib/python3.10/site-packages (from langchain) (4.0.2)
Requirement already satisfied: PyYAML>=5.4.1 in /databricks/python3/lib/python3.10/site-packages (from langchain) (6.0)
Requirement already satisfied: querystring-parser<2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (from mlflow) (1.2.4)
Requirement already satisfied: pyarrow<13,>=4.0.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (8.0.0)
Requirement already satisfied: gitpython<4,>=2.1.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.1.27)
Requirement already satisfied: cloudpickle<3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2.0.0)
Requirement already satisfied: packaging<24 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (21.3)
Requirement already satisfied: sqlparse<1,>=0.4.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (0.4.2)
Requirement already satisfied: alembic!=1.10.0,<2 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (from mlflow) (1.11.1)
Requirement already satisfied: Jinja2<4,>=2.11 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2.11.3)
Requirement already satisfied: docker<7,>=4.0.0 in /local_disk0/.ephemeral_nfs/envs/pythonEnv-d99e329b-1aa4-455e-8f42-f00c2383c3de/lib/python3.10/site-packages (from mlflow) (6.1.3)
Requirement already satisfied: markdown<4,>=3.3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.3.4)
Requirement already satisfied: scikit-learn<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.1.1)
Requirement already satisfied: importlib-metadata!=4.7.0,<7,>=3.7.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (4.11.3)
Requirement already satisfied: protobuf<5,>=3.12.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.19.4)
Requirement already satisfied: scipy<2 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.9.1)
Requirement already satisfied: Flask<3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.1.2+db1)
Requirement already satisfied: pandas<3 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (1.4.4)
Requirement already satisfied: matplotlib<4 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (3.5.2)
Requirement already satisfied: click<9,>=7.0 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (8.0.4)
Requirement already satisfied: gunicorn<21 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (20.1.0)
Requirement already satisfied: entrypoints<1 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (0.4)
Requirement already satisfied: pytz<2024 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (2022.1)
Requirement already satisfied: databricks-cli<1,>=0.8.7 in /databricks/python3/lib/python3.10/site-packages (from mlflow) (0.17.7)
Requirement already satisfied: multidict<7.0,>=4.5 in /databricks/python3/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (6.0.4)
Requirement already satisfied: aiosignal>=1.1.2 in /databricks/python3/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.3.1)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (2.0.4)
Requirement already satisfied: attrs>=17.3.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (21.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /databricks/python3/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.4.0)
Requirement already satisfied: yarl<2.0,>=1.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp<4.0.0,>=3.8.3->langchain) (1.9.2)
Requirement already satisfied: typing-extensions>=4 in /databricks/python3/lib/python3.10/site-packages (from alembic!=1.10.0,<2->mlflow) (4.3.0)
Requirement already satisfied: Mako in /databricks/python3/lib/python3.10/site-packages (from alembic!=1.10.0,<2->mlflow) (1.2.0)
Requirement already satisfied: urllib3<2.0.0,>=1.26.7 in /databricks/python3/lib/python3.10/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (1.26.11)
Requirement already satisfied: pyjwt>=1.7.0 in /usr/lib/python3/dist-packages (from databricks-cli<1,>=0.8.7->mlflow) (2.3.0)
Requirement already satisfied: tabulate>=0.7.7 in /databricks/python3/lib/python3.10/site-packages (from databricks-cli<1,>=0.8.7->mlflow) (0.8.10)
Requirement already satisfied: six>=1.10.0 in /usr/lib/python3/dist-packages (from databricks-cli<1,>=0.8.7->mlflow) (1.16.0)
Requirement already satisfied: oauthlib>=3.1.0 in /usr/lib/python3/dist-packages (from databricks-cli<1,>=0.8.7->mlflow) (3.2.0)
Requirement already satisfied: typing-inspect>=0.4.0 in /databricks/python3/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain) (0.9.0)
Requirement already satisfied: marshmallow<4.0.0,>=3.3.0 in /databricks/python3/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain) (3.19.0)
Requirement already satisfied: marshmallow-enum<2.0.0,>=1.5.1 in /databricks/python3/lib/python3.10/site-packages (from dataclasses-json<0.6.0,>=0.5.7->langchain) (1.5.1)
Requirement already satisfied: websocket-client>=0.32.0 in /databricks/python3/lib/python3.10/site-packages (from docker<7,>=4.0.0->mlflow) (0.58.0)
Requirement already satisfied: Werkzeug>=0.15 in /databricks/python3/lib/python3.10/site-packages (from Flask<3->mlflow) (2.0.3)
Requirement already satisfied: itsdangerous>=0.24 in /databricks/python3/lib/python3.10/site-packages (from Flask<3->mlflow) (2.0.1)
Requirement already satisfied: gitdb<5,>=4.0.1 in /databricks/python3/lib/python3.10/site-packages (from gitpython<4,>=2.1.0->mlflow) (4.0.10)
Requirement already satisfied: setuptools>=3.0 in /databricks/python3/lib/python3.10/site-packages (from gunicorn<21->mlflow) (63.4.1)
Requirement already satisfied: zipp>=0.5 in /databricks/python3/lib/python3.10/site-packages (from importlib-metadata!=4.7.0,<7,>=3.7.0->mlflow) (3.8.0)
Requirement already satisfied: MarkupSafe>=0.23 in /databricks/python3/lib/python3.10/site-packages (from Jinja2<4,>=2.11->mlflow) (2.0.1)
Requirement already satisfied: python-dateutil>=2.7 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (0.11.0)
Requirement already satisfied: pyparsing>=2.2.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (1.4.2)
Requirement already satisfied: pillow>=6.2.0 in /databricks/python3/lib/python3.10/site-packages (from matplotlib<4->mlflow) (9.2.0)
Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.10/site-packages (from requests<3,>=2->langchain) (2022.9.14)
Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.10/site-packages (from requests<3,>=2->langchain) (3.3)
Requirement already satisfied: joblib>=1.0.0 in /databricks/python3/lib/python3.10/site-packages (from scikit-learn<2->mlflow) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /databricks/python3/lib/python3.10/site-packages (from scikit-learn<2->mlflow) (2.2.0)
Requirement already satisfied: greenlet!=0.4.17 in /databricks/python3/lib/python3.10/site-packages (from SQLAlchemy<3,>=1.4->langchain) (1.1.1)
Requirement already satisfied: smmap<6,>=3.0.1 in /databricks/python3/lib/python3.10/site-packages (from gitdb<5,>=4.0.1->gitpython<4,>=2.1.0->mlflow) (5.0.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in /databricks/python3/lib/python3.10/site-packages (from typing-inspect>=0.4.0->dataclasses-json<0.6.0,>=0.5.7->langchain) (0.4.3)
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
from langchain.document_loaders import PySparkDataFrameLoader from langchain.text_splitter import RecursiveCharacterTextSplitter loader = PySparkDataFrameLoader(spark, wikipedia_dataframe, page_content_column="text") documents = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=0) texts = text_splitter.split_documents(documents) print(f"Number of documents: {len(texts)}")
Number of documents: 421
query = "Who is Harrison Schmitt" result = retrieval_qa({"query": query}) print("Result:", result["result"])
Result: Harrison Hagan "Jack" Schmitt is an American geologist, retired NASA astronaut, university professor and former U.S. senator from New Mexico. He was the twelfth person to set foot on the Moon, and the second-to-last person to step off of the Moon. He is also the first and only professional scientist to have flown beyond low Earth orbit and to have visited the Moon.
import mlflow persist_directory = "langchain/faiss_index" db.save_local(persist_directory) def load_retriever(persist_directory): embeddings = OpenAIEmbeddings() db = FAISS.load_local(persist_directory, embeddings) return db.as_retriever() # Log the RetrievalQA chain with mlflow.start_run() as mlflow_run: logged_model = mlflow.langchain.log_model( retrieval_qa, "retrieval_qa_chain", loader_fn=load_retriever, persist_dir=persist_directory, )
model_uri = f"runs:/{ mlflow_run.info.run_id }/retrieval_qa_chain" loaded_pyfunc_model = mlflow.pyfunc.load_model(model_uri) langchain_input = {"query": "Who is Harrison Schmitt"} loaded_pyfunc_model.predict([langchain_input])
Out[72]: [' Harrison Schmitt is an American geologist, retired NASA astronaut, university professor and former U.S. senator from New Mexico. He was the twelfth person to set foot on the Moon and is the second-to-last person to step off the Moon. He was influential within the community of geologists supporting the Apollo program and, before starting his own preparations for an Apollo mission, had been one of the scientists training those Apollo astronauts chosen to visit the lunar surface. He was appointed as Secretary of the New Mexico Energy, Minerals and Natural Resources Department in the cabinet of Governor Susana Martinez, but was forced to give up the appointment the following month after refusing to submit to a required background investigation.']
PySpark DataFrame Loader and MLFlow in Langchain
This notebook showcases the integration between PySpark and Langchain and includes how to:
Requirements