Tutorial: Create, run, and test dbt models locally
This tutorial walks you through how to create, run, and test dbt models locally. You can also run dbt projects as Databricks job tasks. For more information, see Use dbt transformations in Lakeflow Jobs.
Before you begin
To follow this tutorial, you must first connect your Databricks workspace to dbt Core. For more information, see Connect to dbt Core.
Step 1: Create and run models
In this step, you use your favorite text editor to create models, which are select statements that create either a new view (the default) or a new table in a database, based on existing data in that same database. This procedure creates a model based on the sample diamonds table from the Sample datasets.
Use the following code to create this table.
DROP TABLE IF EXISTS diamonds;
CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")
- 
In the project's modelsdirectory, create a file nameddiamonds_four_cs.sqlwith the following SQL statement. This statement selects only the carat, cut, color, and clarity details for each diamond from thediamondstable. Theconfigblock instructs dbt to create a table in the database based on this statement.{{ config(
 materialized='table',
 file_format='delta'
 ) }}SQLselect carat, cut, color, clarity
 from diamondstipFor additional configoptions such as using the Delta file format and themergeincremental strategy, see Databricks configurations in the dbt documentation.
- 
In the project's modelsdirectory, create a second file nameddiamonds_list_colors.sqlwith the following SQL statement. This statement selects unique values from thecolorscolumn in thediamonds_four_cstable, sorting the results in alphabetical order first to last. Because there is noconfigblock, this model instructs dbt to create a view in the database based on this statement.SQLselect distinct color
 from {{ ref('diamonds_four_cs') }}
 sort by color asc
- 
In the project's modelsdirectory, create a third file nameddiamonds_prices.sqlwith the following SQL statement. This statement averages diamond prices by color, sorting the results by average price from highest to lowest. This model instructs dbt to create a view in the database based on this statement.SQLselect color, avg(price) as price
 from diamonds
 group by color
 order by price desc
- 
With the virtual environment activated, run the dbt runcommand with the paths to the three preceding files. In thedefaultdatabase (as specified in theprofiles.ymlfile), dbt creates one table nameddiamonds_four_csand two views nameddiamonds_list_colorsanddiamonds_prices. dbt gets these view and table names from their related.sqlfile names.Bashdbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sqlConsole...
 ... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]
 ... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]
 ... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]
 ... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]
 ... | 3 of 3 START view model default.diamonds_prices...................... [RUN]
 ... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]
 ... |
 ... | Finished running 1 table model, 2 view models ...
 Completed successfully
 Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
- 
Run the following SQL code to list information about the new views and to select all rows from the table and views. If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a query. SQLSHOW views IN default;Console+-----------+----------------------+-------------+
 | namespace | viewName | isTemporary |
 +===========+======================+=============+
 | default | diamonds_list_colors | false |
 +-----------+----------------------+-------------+
 | default | diamonds_prices | false |
 +-----------+----------------------+-------------+SQLSELECT * FROM diamonds_four_cs;Console+-------+---------+-------+---------+
 | carat | cut | color | clarity |
 +=======+=========+=======+=========+
 | 0.23 | Ideal | E | SI2 |
 +-------+---------+-------+---------+
 | 0.21 | Premium | E | SI1 |
 +-------+---------+-------+---------+
 ...SQLSELECT * FROM diamonds_list_colors;Console+-------+
 | color |
 +=======+
 | D |
 +-------+
 | E |
 +-------+
 ...SQLSELECT * FROM diamonds_prices;Console+-------+---------+
 | color | price |
 +=======+=========+
 | J | 5323.82 |
 +-------+---------+
 | I | 5091.87 |
 +-------+---------+
 ...
Step 2: Create and run more complex models
In this step, you create more complex models for a set of related data tables. These data tables contain information about a fictional sports league of three teams playing a season of six games. This procedure creates the data tables, creates the models, and runs the models.
- 
Run the following SQL code to create the necessary data tables. If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a query. The tables and views in this step start with zzz_to help identify them as part of this example. You do not need to follow this pattern for your own tables and views.SQLDROP TABLE IF EXISTS zzz_game_opponents;
 DROP TABLE IF EXISTS zzz_game_scores;
 DROP TABLE IF EXISTS zzz_games;
 DROP TABLE IF EXISTS zzz_teams;
 CREATE TABLE zzz_game_opponents (
 game_id INT,
 home_team_id INT,
 visitor_team_id INT
 ) USING DELTA;
 INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
 INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
 INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
 INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
 INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
 INSERT INTO zzz_game_opponents VALUES (6, 3, 2);
 -- Result:
 -- +---------+--------------+-----------------+
 -- | game_id | home_team_id | visitor_team_id |
 -- +=========+==============+=================+
 -- | 1 | 1 | 2 |
 -- +---------+--------------+-----------------+
 -- | 2 | 1 | 3 |
 -- +---------+--------------+-----------------+
 -- | 3 | 2 | 1 |
 -- +---------+--------------+-----------------+
 -- | 4 | 2 | 3 |
 -- +---------+--------------+-----------------+
 -- | 5 | 3 | 1 |
 -- +---------+--------------+-----------------+
 -- | 6 | 3 | 2 |
 -- +---------+--------------+-----------------+
 CREATE TABLE zzz_game_scores (
 game_id INT,
 home_team_score INT,
 visitor_team_score INT
 ) USING DELTA;
 INSERT INTO zzz_game_scores VALUES (1, 4, 2);
 INSERT INTO zzz_game_scores VALUES (2, 0, 1);
 INSERT INTO zzz_game_scores VALUES (3, 1, 2);
 INSERT INTO zzz_game_scores VALUES (4, 3, 2);
 INSERT INTO zzz_game_scores VALUES (5, 3, 0);
 INSERT INTO zzz_game_scores VALUES (6, 3, 1);
 -- Result:
 -- +---------+-----------------+--------------------+
 -- | game_id | home_team_score | visitor_team_score |
 -- +=========+=================+====================+
 -- | 1 | 4 | 2 |
 -- +---------+-----------------+--------------------+
 -- | 2 | 0 | 1 |
 -- +---------+-----------------+--------------------+
 -- | 3 | 1 | 2 |
 -- +---------+-----------------+--------------------+
 -- | 4 | 3 | 2 |
 -- +---------+-----------------+--------------------+
 -- | 5 | 3 | 0 |
 -- +---------+-----------------+--------------------+
 -- | 6 | 3 | 1 |
 -- +---------+-----------------+--------------------+
 CREATE TABLE zzz_games (
 game_id INT,
 game_date DATE
 ) USING DELTA;
 INSERT INTO zzz_games VALUES (1, '2020-12-12');
 INSERT INTO zzz_games VALUES (2, '2021-01-09');
 INSERT INTO zzz_games VALUES (3, '2020-12-19');
 INSERT INTO zzz_games VALUES (4, '2021-01-16');
 INSERT INTO zzz_games VALUES (5, '2021-01-23');
 INSERT INTO zzz_games VALUES (6, '2021-02-06');
 -- Result:
 -- +---------+------------+
 -- | game_id | game_date |
 -- +=========+============+
 -- | 1 | 2020-12-12 |
 -- +---------+------------+
 -- | 2 | 2021-01-09 |
 -- +---------+------------+
 -- | 3 | 2020-12-19 |
 -- +---------+------------+
 -- | 4 | 2021-01-16 |
 -- +---------+------------+
 -- | 5 | 2021-01-23 |
 -- +---------+------------+
 -- | 6 | 2021-02-06 |
 -- +---------+------------+
 CREATE TABLE zzz_teams (
 team_id INT,
 team_city VARCHAR(15)
 ) USING DELTA;
 INSERT INTO zzz_teams VALUES (1, "San Francisco");
 INSERT INTO zzz_teams VALUES (2, "Seattle");
 INSERT INTO zzz_teams VALUES (3, "Amsterdam");
 -- Result:
 -- +---------+---------------+
 -- | team_id | team_city |
 -- +=========+===============+
 -- | 1 | San Francisco |
 -- +---------+---------------+
 -- | 2 | Seattle |
 -- +---------+---------------+
 -- | 3 | Amsterdam |
 -- +---------+---------------+
- 
In the project's modelsdirectory, create a file namedzzz_game_details.sqlwith the following SQL statement. This statement creates a table that provides the details of each game, such as team names and scores. Theconfigblock instructs dbt to create a table in the database based on this statement.SQL-- Create a table that provides full details for each game, including
 -- the game ID, the home and visiting teams' city names and scores,
 -- the game winner's city name, and the game date.{{ config(
 materialized='table',
 file_format='delta'
 ) }}SQL-- Step 4 of 4: Replace the visitor team IDs with their city names.
 select
 game_id,
 home,
 t.team_city as visitor,
 home_score,
 visitor_score,
 -- Step 3 of 4: Display the city name for each game's winner.
 case
 when
 home_score > visitor_score
 then
 home
 when
 visitor_score > home_score
 then
 t.team_city
 end as winner,
 game_date as date
 from (
 -- Step 2 of 4: Replace the home team IDs with their actual city names.
 select
 game_id,
 t.team_city as home,
 home_score,
 visitor_team_id,
 visitor_score,
 game_date
 from (
 -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
 select
 g.game_id,
 go.home_team_id,
 gs.home_team_score as home_score,
 go.visitor_team_id,
 gs.visitor_team_score as visitor_score,
 g.game_date
 from
 zzz_games as g,
 zzz_game_opponents as go,
 zzz_game_scores as gs
 where
 g.game_id = go.game_id and
 g.game_id = gs.game_id
 ) as all_ids,
 zzz_teams as t
 where
 all_ids.home_team_id = t.team_id
 ) as visitor_ids,
 zzz_teams as t
 where
 visitor_ids.visitor_team_id = t.team_id
 order by game_date desc
- 
In the project's modelsdirectory, create a file namedzzz_win_loss_records.sqlwith the following SQL statement. This statement creates a view that lists team win-loss records for the season.SQL-- Create a view that summarizes the season's win and loss records by team.
 -- Step 2 of 2: Calculate the number of wins and losses for each team.
 select
 winner as team,
 count(winner) as wins,
 -- Each team played in 4 games.
 (4 - count(winner)) as losses
 from (
 -- Step 1 of 2: Determine the winner and loser for each game.
 select
 game_id,
 winner,
 case
 when
 home = winner
 then
 visitor
 else
 home
 end as loser
 from {{ ref('zzz_game_details') }}
 )
 group by winner
 order by wins desc
- 
With the virtual environment activated, run the dbt runcommand with the paths to the two preceding files. In thedefaultdatabase (as specified in theprofiles.ymlfile), dbt creates one table namedzzz_game_detailsand one view namedzzz_win_loss_records. dbt gets these view and table names from their related.sqlfile names.Bashdbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sqlConsole...
 ... | 1 of 2 START table model default.zzz_game_details.................... [RUN]
 ... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]
 ... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]
 ... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]
 ... |
 ... | Finished running 1 table model, 1 view model ...
 Completed successfully
 Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
- 
Run the following SQL code to list information about the new view and to select all rows from the table and view. If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a query. SQLSHOW VIEWS FROM default LIKE 'zzz_win_loss_records';Console+-----------+----------------------+-------------+
 | namespace | viewName | isTemporary |
 +===========+======================+=============+
 | default | zzz_win_loss_records | false |
 +-----------+----------------------+-------------+SQLSELECT * FROM zzz_game_details;Console+---------+---------------+---------------+------------+---------------+---------------+------------+
 | game_id | home | visitor | home_score | visitor_score | winner | date |
 +=========+===============+===============+============+===============+===============+============+
 | 1 | San Francisco | Seattle | 4 | 2 | San Francisco | 2020-12-12 |
 +---------+---------------+---------------+------------+---------------+---------------+------------+
 | 2 | San Francisco | Amsterdam | 0 | 1 | Amsterdam | 2021-01-09 |
 +---------+---------------+---------------+------------+---------------+---------------+------------+
 | 3 | Seattle | San Francisco | 1 | 2 | San Francisco | 2020-12-19 |
 +---------+---------------+---------------+------------+---------------+---------------+------------+
 | 4 | Seattle | Amsterdam | 3 | 2 | Seattle | 2021-01-16 |
 +---------+---------------+---------------+------------+---------------+---------------+------------+
 | 5 | Amsterdam | San Francisco | 3 | 0 | Amsterdam | 2021-01-23 |
 +---------+---------------+---------------+------------+---------------+---------------+------------+
 | 6 | Amsterdam | Seattle | 3 | 1 | Amsterdam | 2021-02-06 |
 +---------+---------------+---------------+------------+---------------+---------------+------------+SQLSELECT * FROM zzz_win_loss_records;Console+---------------+------+--------+
 | team | wins | losses |
 +===============+======+========+
 | Amsterdam | 3 | 1 |
 +---------------+------+--------+
 | San Francisco | 2 | 2 |
 +---------------+------+--------+
 | Seattle | 1 | 3 |
 +---------------+------+--------+
Step 3: Create and run tests
In this step, you create tests, which are assertions you make about your models. When you run these tests, dbt tells you if each test in your project passes or fails.
There are two type of tests. Schema tests, applied in YAML, return the number of records that do not pass an assertion. When this number is zero, all records pass, therefore the tests pass. Data tests are specific queries that must return zero records to pass.
- 
In the project's modelsdirectory, create a file namedschema.ymlwith the following content. This file includes schema tests that determine whether the specified columns have unique values, are not null, have only the specified values, or a combination.YAMLversion: 2
 models:
 - name: zzz_game_details
 columns:
 - name: game_id
 tests:
 - unique
 - not_null
 - name: home
 tests:
 - not_null
 - accepted_values:
 values: ['Amsterdam', 'San Francisco', 'Seattle']
 - name: visitor
 tests:
 - not_null
 - accepted_values:
 values: ['Amsterdam', 'San Francisco', 'Seattle']
 - name: home_score
 tests:
 - not_null
 - name: visitor_score
 tests:
 - not_null
 - name: winner
 tests:
 - not_null
 - accepted_values:
 values: ['Amsterdam', 'San Francisco', 'Seattle']
 - name: date
 tests:
 - not_null
 - name: zzz_win_loss_records
 columns:
 - name: team
 tests:
 - unique
 - not_null
 - relationships:
 to: ref('zzz_game_details')
 field: home
 - name: wins
 tests:
 - not_null
 - name: losses
 tests:
 - not_null
- 
In the project's testsdirectory, create a file namedzzz_game_details_check_dates.sqlwith the following SQL statement. This file includes a data test to determine whether any games happened outside of the regular season.SQL-- This season's games happened between 2020-12-12 and 2021-02-06.
 -- For this test to pass, this query must return no results.
 select date
 from {{ ref('zzz_game_details') }}
 where date < '2020-12-12'
 or date > '2021-02-06'
- 
In the project's testsdirectory, create a file namedzzz_game_details_check_scores.sqlwith the following SQL statement. This file includes a data test to determine whether any scores were negative or any games were tied.SQL-- This sport allows no negative scores or tie games.
 -- For this test to pass, this query must return no results.
 select home_score, visitor_score
 from {{ ref('zzz_game_details') }}
 where home_score < 0
 or visitor_score < 0
 or home_score = visitor_score
- 
In the project's testsdirectory, create a file namedzzz_win_loss_records_check_records.sqlwith the following SQL statement. This file includes a data test to determine whether any teams had negative win or loss records, had more win or loss records than games played, or played more games than were allowed.SQL-- Each team participated in 4 games this season.
 -- For this test to pass, this query must return no results.
 select wins, losses
 from {{ ref('zzz_win_loss_records') }}
 where wins < 0 or wins > 4
 or losses < 0 or losses > 4
 or (wins + losses) > 4
- 
With the virtual environment activated, run the dbt testcommand.Bashdbt test --models zzz_game_details zzz_win_loss_recordsConsole...
 ... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]
 ... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]
 ...
 ... |
 ... | Finished running 19 tests ...
 Completed successfully
 Done. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19
Step 4: Clean up
You can delete the tables and views you created for this example by running the following SQL code.
If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a query.
DROP TABLE zzz_game_opponents;
DROP TABLE zzz_game_scores;
DROP TABLE zzz_games;
DROP TABLE zzz_teams;
DROP TABLE zzz_game_details;
DROP VIEW zzz_win_loss_records;
DROP TABLE diamonds;
DROP TABLE diamonds_four_cs;
DROP VIEW diamonds_list_colors;
DROP VIEW diamonds_prices;
Troubleshooting
For information about common issues when using dbt Core with Databricks and how to resolve them, see Getting help on the dbt Labs website.
Next steps
Run dbt Core projects as Databricks job tasks. See Use dbt transformations in Lakeflow Jobs.
Additional resources
Explore the following resources on the dbt Labs website: