Databricks Assistant: sample tasks

Preview

This feature is in Public Preview.

Databricks Assistant works as an AI-based companion pair-programmer to make you more efficient as you create notebooks, queries, and files. It can help you rapidly answer questions by generating, optimizing, completing, explaining, and fixing code and queries.

For general information about Databricks Assistant, see Databricks Assistant FAQ.

The prompt you provide can significantly change the output of the assistant. Try adding one of the following to your prompts:

  • “No explanatory text” when generating code.

  • “Explain the code to me step by step”.

  • “Show me two/three options that I can try”.

  • “Be concise”.

You can also experiment with the following types of queries:

  • Write a SQL UDF to reverse a string.

  • Add a date filter to this query to restrict results to the last 30 days.

  • Help me plot a graph from the results of a SQL query. The query results are in the format of a Pandas DataFrame. The x-axis should be labeled ‘Week’ and the y-axis should be labeled ‘Distinct weekly users’.

Generate code examples

Analyze data

Starting code:

import pandas as pd

# Read the sample NYC Taxi Trips dataset and load it into a DataFrame
df = spark.read.table('samples.nyctaxi.trips')

Assistant prompt:

generate pandas code to convert the pyspark dataframe to a pandas dataframe and select the 10 most expensive trips from df based on the fare_amount column

Create a DataFrame reader

Starting code:

View the data in the bikeSharing dataset.

display(dbutils.fs.ls("dbfs:/databricks-datasets/bikeSharing/data-001/"))

Assistant prompt:

Generate code to read the day.csv file in the bikeSharing dataset

Transform or optimize code examples

Translate Pandas to PySpark

Starting code:

import pandas as pd

# Convert Spark DataFrame to Pandas DataFrame
pdf = df.toPandas()

# Select the 10 most expensive trips based on the fare_amount column
most_expensive_trips = pdf.nlargest(10, 'fare_amount')

# Show the result
most_expensive_trips

Assistant prompt:

convert this code to PySpark

Generate more efficient code

Assistant prompt:

Show me a code example of inefficient python code, explain why it is inefficient, and then show me an improved version of that code that is more efficient. Explain why it is more efficient, then give me a list of strings to test this out with and the code to benchmark trying each one out.

Assistant prompt:

Write me a function to benchmark the execution of code in this cell, then give me another way to write this code that is more efficient and would perform better in the benchmark.

Complete code examples

You can use LakeSense to generate code from comments in a cell.

  • On macOS, press shift + option + space or control + option + space directly in a cell.

  • On Windows, press ctrl + shift + space directly in a cell.

To accept the suggested code, press tab.

Reverse a string

Starting code:

# Write code to reverse a string.

Perform exploratory data analysis

Starting code:

# Load the wine dataset into a DataFrame from sklearn, bucket the data into 3 groups by quality, then visualize in a plotly barchart.

Explain code examples

Basic code explanation

Starting code:

PySpark code that gets the total number of trips and sum of the fare amounts between the pickup and dropoff zip codes.

import pyspark.sql.functions as F

fare_by_route = df.groupBy(
'pickup_zip', 'dropoff_zip'
).agg(
    F.sum('fare_amount').alias('total_fare'),
    F.count('fare_amount').alias('num_trips')
).sort(F.col('num_trips').desc())

display(fare_by_route)

Assistant prompt:

Explain what this code does

Fast documentation lookups

Assistant prompt:

When should I use repartition() vs. coalesce() in Apache Spark?

Assistant prompt:

What is the difference between the various pandas_udf functions (in PySpark and Pandas on Spark/Koalas) and when should I choose each?  Can you show me an example of each with the diamonds dataset?

Fix code examples

Debugging

Starting code:

This is the same code used in the basic code explanation example, but missing the import statement. It throws the error “This throws the error: NameError: name ‘F’ is not defined”.

fare_by_route = df.groupBy(
'pickup_zip', 'dropoff_zip'
).agg(
    F.sum('fare_amount').alias('total_fare'),
    F.count('fare_amount').alias('num_trips')
).sort(F.col('num_trips').desc())

display(fare_by_route)

Assistant prompt:

How do I fix this error? What is 'F'?

Help with errors

Starting code:

This code throws the error “AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION]”.

from pyspark.sql.functions import col

# create a dataframe with two columns: a and b
df = spark.range(5).select(col('id').alias('a'), col('id').alias('b'))

# try to select a non-existing column c
df.select(col('c')).show()

Assistant prompt:

Why am I getting this error and how do I fix it?