English SDK for Apache Spark

Note

This article covers the English SDK for Apache Spark. This English SDK for Apache Spark is not supported directly by Databricks. To provide feedback, ask questions, and report issues, use the Issues tab in the English SDK for Apache Spark repository in GitHub.

The English SDK for Apache Spark takes English instructions and compiles them into Spark objects. Its goal is to make Spark more user-friendly and accessible, which enables you to focus your efforts on extracting insights from your data.

The following information includes an example that describes how you can use a Databricks Python notebook to call the English SDK for Apache Spark. This example uses a plain English question to guide the English SDK for Apache Spark to run a SQL query on a table from your Databricks workspace.

Requirements

  • Databricks has found that GPT-4 works optimally with the English SDK for Apache Spark. This article uses GPT-4 and assumes that you have an OpenAI API key that is associated with an OpenAI billing plan. To start an OpenAI billing plan, sign in at https://platform.openai.com/account/billing/overview, click Start payment plan, and follow the on-screen directions. After you start an OpenAI billing plan, to generate an OpenAI API key, sign in at https://platform.openai.com/account/api-keys and click Create new secret key.

  • This example uses a Databricks Python notebook that you can use in a Databricks workspace that is connected to a Databricks cluster.

Step 1: Install the Python package for the English SDK for Apache Spark

In the notebook’s first cell, run the following code, which installs on the attached compute resource the latest version of the Python package for the English SDK for Apache Spark:

%pip install pyspark-ai --upgrade

Step 2: Restart the Python kernel to use the updated package

In the notebook’s second cell, run the following code, which restarts the Python kernel to use the updated Python package for the English SDK for Apache Spark and its updated package dependencies:

dbutils.library.restartPython()

Step 3: Set your OpenAI API key

In the notebook’s third cell, run the following code, which sets an environment variable named OPENAI_API_KEY to the value of your OpenAI API key. The English SDK for Apache Spark uses this OpenAPI key to authenticate with OpenAI. Replace <your-openai-api-key> with the value of your OpenAI API key:

import os

os.environ['OPENAI_API_KEY'] = '<your-openai-api-key>'

Important

In this example, for speed and ease of use, you hard-code your OpenAI API key into the notebook. In production scenarios, it is a security best practice not to hard-code your OpenAI API key into your notebooks. One alternative approach is to set this environment variable on the attached cluster. See Environment variables.

Step 4: Set and activate the LLM

In the notebook’s fourth cell, run the following code, which sets the LLM that you want the English SDK for Apache Spark to use and then activates the English SDK for Apache Spark with the selected model. For this example, you use GPT-4. By default, the English SDK for Apache Spark looks for an environment variable named OPENAI_API_KEY and uses its value to authenticate with OpenAI to use GPT-4:

from langchain.chat_models import ChatOpenAI
from pyspark_ai import SparkAI

chatOpenAI = ChatOpenAI(model = 'gpt-4')

spark_ai = SparkAI(llm = chatOpenAI)
spark_ai.activate()

Tip

To use GPT-4 as the default LLM, you can simplify this code as follows:

from pyspark_ai import SparkAI

spark_ai = SparkAI()
spark_ai.activate()

Step 5: Create a source DataFrame

In the notebook’s fifth cell, run the following code, which selects all of the data in the samples.nyctaxi.trips table from your Databricks workspace and stores this data in a DataFrame that is optimized to work with the English SDK for Apache Spark. This DataFrame is represented here by the variable df:

df = spark_ai._spark.sql("SELECT * FROM samples.nyctaxi.trips")

Step 6: Query the DataFrame by using a plain English question

In the notebook’s sixth cell, run the following code, which asks the English SDK for Apache Spark to print the average trip distance, to the nearest tenth, for each day during January of 2016.

df.ai.transform("What was the average trip distance for each day during the month of January 2016? Print the averages to the nearest tenth.").display()

The English SDK for Apache Spark prints its analysis and final answer as follows:

> Entering new AgentExecutor chain...
Thought: This can be achieved by using the date function to extract the date from the timestamp and then grouping by the date.
Action: query_validation
Action Input: SELECT DATE(tpep_pickup_datetime) as pickup_date, ROUND(AVG(trip_distance), 1) as avg_trip_distance FROM spark_ai_temp_view_2a0572 WHERE MONTH(tpep_pickup_datetime) = 1 AND YEAR(tpep_pickup_datetime) = 2016 GROUP BY pickup_date ORDER BY pickup_date
Observation: OK
Thought:I now know the final answer.
Final Answer: SELECT DATE(tpep_pickup_datetime) as pickup_date, ROUND(AVG(trip_distance), 1) as avg_trip_distance FROM spark_ai_temp_view_2a0572 WHERE MONTH(tpep_pickup_datetime) = 1 AND YEAR(tpep_pickup_datetime) = 2016 GROUP BY pickup_date ORDER BY pickup_date

> Finished chain.

The English SDK for Apache Spark runs its final answer and prints the results as follows:

+-----------+-----------------+
|pickup_date|avg_trip_distance|
+-----------+-----------------+
| 2016-01-01|              3.1|
| 2016-01-02|              3.0|
| 2016-01-03|              3.2|
| 2016-01-04|              3.0|
| 2016-01-05|              2.6|
| 2016-01-06|              2.6|
| 2016-01-07|              3.0|
| 2016-01-08|              2.9|
| 2016-01-09|              2.8|
| 2016-01-10|              3.0|
| 2016-01-11|              2.8|
| 2016-01-12|              2.9|
| 2016-01-13|              2.7|
| 2016-01-14|              3.3|
| 2016-01-15|              3.0|
| 2016-01-16|              3.0|
| 2016-01-17|              2.7|
| 2016-01-18|              2.9|
| 2016-01-19|              3.1|
| 2016-01-20|              2.8|
+-----------+-----------------+
only showing top 20 rows

Next steps

  • Try creating the DataFrame, represented in this example by the variable df, with different data.

  • Try using different plain English questions for the df.ai.transform function.

  • Try using different GPT-4 models. See GPT-4.

  • Explore additional code examples. See the following additional resources.

Additional resources