Databricks Assistant: sample tasks
Preview
This feature is in Public Preview.
Databricks Assistant works as an AI-based companion pair-programmer to make you more efficient as you create notebooks, queries, and files. It can help you rapidly answer questions by generating, optimizing, completing, explaining, and fixing code and queries.
For general information about Databricks Assistant, see Databricks Assistant FAQ.
The prompt you provide can significantly change the output of the assistant. Try adding one of the following to your prompts:
“No explanatory text” when generating code.
“Explain the code to me step by step”.
“Show me two/three options that I can try”.
“Be concise”.
You can also experiment with the following types of queries:
Write a SQL UDF to reverse a string.
Add a date filter to this query to restrict results to the last 30 days.
Help me plot a graph from the results of a SQL query. The query results are in the format of a Pandas DataFrame. The x-axis should be labeled ‘Week’ and the y-axis should be labeled ‘Distinct weekly users’.
Generate code examples
Analyze data
Starting code:
import pandas as pd
# Read the sample NYC Taxi Trips dataset and load it into a DataFrame
df = spark.read.table('samples.nyctaxi.trips')
Assistant prompt:
generate pandas code to convert the pyspark dataframe to a pandas dataframe and select the 10 most expensive trips from df based on the fare_amount column
Transform or optimize code examples
Translate Pandas to PySpark
Starting code:
import pandas as pd
# Convert Spark DataFrame to Pandas DataFrame
pdf = df.toPandas()
# Select the 10 most expensive trips based on the fare_amount column
most_expensive_trips = pdf.nlargest(10, 'fare_amount')
# Show the result
most_expensive_trips
Assistant prompt:
convert this code to PySpark
Generate more efficient code
Assistant prompt:
Show me a code example of inefficient python code, explain why it is inefficient, and then show me an improved version of that code that is more efficient. Explain why it is more efficient, then give me a list of strings to test this out with and the code to benchmark trying each one out.
Assistant prompt:
Write me a function to benchmark the execution of code in this cell, then give me another way to write this code that is more efficient and would perform better in the benchmark.
Complete code examples
You can use LakeSense to generate code from comments in a cell.
On macOS, press
shift
+option
+space
orcontrol
+option
+space
directly in a cell.On Windows, press
ctrl
+shift
+space
directly in a cell.
To accept the suggested code, press tab
.
Explain code examples
Basic code explanation
Starting code:
PySpark code that gets the total number of trips and sum of the fare amounts between the pickup and dropoff zip codes.
import pyspark.sql.functions as F
fare_by_route = df.groupBy(
'pickup_zip', 'dropoff_zip'
).agg(
F.sum('fare_amount').alias('total_fare'),
F.count('fare_amount').alias('num_trips')
).sort(F.col('num_trips').desc())
display(fare_by_route)
Assistant prompt:
Explain what this code does
Fast documentation lookups
Assistant prompt:
When should I use repartition() vs. coalesce() in Apache Spark?
Assistant prompt:
What is the difference between the various pandas_udf functions (in PySpark and Pandas on Spark/Koalas) and when should I choose each? Can you show me an example of each with the diamonds dataset?
Fix code examples
Debugging
Starting code:
This is the same code used in the basic code explanation example, but missing the import statement. It throws the error “This throws the error: NameError: name ‘F’ is not defined”.
fare_by_route = df.groupBy(
'pickup_zip', 'dropoff_zip'
).agg(
F.sum('fare_amount').alias('total_fare'),
F.count('fare_amount').alias('num_trips')
).sort(F.col('num_trips').desc())
display(fare_by_route)
Assistant prompt:
How do I fix this error? What is 'F'?
Help with errors
Starting code:
This code throws the error “AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION]”.
from pyspark.sql.functions import col
# create a dataframe with two columns: a and b
df = spark.range(5).select(col('id').alias('a'), col('id').alias('b'))
# try to select a non-existing column c
df.select(col('c')).show()
Assistant prompt:
Why am I getting this error and how do I fix it?