Tutorial: Create, run, and test dbt models locally

This tutorial walks you through how to create, run, and test dbt models locally. You can also run dbt projects as Databricks job tasks. For more information, see Use dbt transformations in Lakeflow Jobs.

Before you begin

To follow this tutorial, you must first connect your Databricks workspace to dbt Core. For more information, see Connect to dbt Core.

Step 1: Create and run models

In this step, you use your favorite text editor to create models, which are select statements that create either a new view (the default) or a new table in a database, based on existing data in that same database. This procedure creates a model based on the sample diamonds table from the Sample datasets.

Use the following code to create this table.

SQL
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

In the project's models directory, create a file named diamonds_four_cs.sql with the following SQL statement. This statement selects only the carat, cut, color, and clarity details for each diamond from the diamonds table. The config block instructs dbt to create a table in the database based on this statement.
```
{{ config(
  materialized='table',
  file_format='delta'
) }}
```
SQL
```
select carat, cut, color, clarity
from diamonds
```
tip
For additional config options such as using the Delta file format and the merge incremental strategy, see Databricks configurations in the dbt documentation.
In the project's models directory, create a second file named diamonds_list_colors.sql with the following SQL statement. This statement selects unique values from the colors column in the diamonds_four_cs table, sorting the results in alphabetical order first to last. Because there is no config block, this model instructs dbt to create a view in the database based on this statement.
SQL
```
select distinct color
from {{ ref('diamonds_four_cs') }}
sort by color asc
```
In the project's models directory, create a third file named diamonds_prices.sql with the following SQL statement. This statement averages diamond prices by color, sorting the results by average price from highest to lowest. This model instructs dbt to create a view in the database based on this statement.
SQL
```
select color, avg(price) as price
from diamonds
group by color
order by price desc
```

With the virtual environment activated, run the dbt run command with the paths to the three preceding files. In the default database (as specified in the profiles.yml file), dbt creates one table named diamonds_four_cs and two views named diamonds_list_colors and diamonds_prices. dbt gets these view and table names from their related .sql file names.

Bash
dbt run --model models/diamonds_four_cs.sql models/diamonds_list_colors.sql models/diamonds_prices.sql

Console
...
... | 1 of 3 START table model default.diamonds_four_cs.................... [RUN]
... | 1 of 3 OK created table model default.diamonds_four_cs............... [OK ...]
... | 2 of 3 START view model default.diamonds_list_colors................. [RUN]
... | 2 of 3 OK created view model default.diamonds_list_colors............ [OK ...]
... | 3 of 3 START view model default.diamonds_prices...................... [RUN]
... | 3 of 3 OK created view model default.diamonds_prices................. [OK ...]
... |
... | Finished running 1 table model, 2 view models ...

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

Run the following SQL code to list information about the new views and to select all rows from the table and views.

If you are connecting to a cluster, you can run this SQL code from a notebook that is connected to the cluster, specifying SQL as the default language for the notebook. If you are connecting to a SQL warehouse, you can run this SQL code from a query.

SQL
SHOW views IN default;

Console
+-----------+----------------------+-------------+
| namespace | viewName             | isTemporary |
+===========+======================+=============+
| default   | diamonds_list_colors | false       |
+-----------+----------------------+-------------+
| default   | diamonds_prices      | false       |
+-----------+----------------------+-------------+

SQL
SELECT * FROM diamonds_four_cs;

Console
+-------+---------+-------+---------+
| carat | cut     | color | clarity |
+=======+=========+=======+=========+
| 0.23  | Ideal   | E     | SI2     |
+-------+---------+-------+---------+
| 0.21  | Premium | E     | SI1     |
+-------+---------+-------+---------+
...

SQL
SELECT * FROM diamonds_list_colors;

Console
+-------+
| color |
+=======+
| D     |
+-------+
| E     |
+-------+
...

SQL
SELECT * FROM diamonds_prices;

Console
+-------+---------+
| color | price   |
+=======+=========+
| J     | 5323.82 |
+-------+---------+
| I     | 5091.87 |
+-------+---------+
...

Step 2: Create and run more complex models

In this step, you create more complex models for a set of related data tables. These data tables contain information about a fictional sports league of three teams playing a season of six games. This procedure creates the data tables, creates the models, and runs the models.

Run the following SQL code to create the necessary data tables.

The tables and views in this step start with zzz_ to help identify them as part of this example. You do not need to follow this pattern for your own tables and views.

SQL
DROP TABLE IF EXISTS zzz_game_opponents;
DROP TABLE IF EXISTS zzz_game_scores;
DROP TABLE IF EXISTS zzz_games;
DROP TABLE IF EXISTS zzz_teams;

CREATE TABLE zzz_game_opponents (
game_id INT,
home_team_id INT,
visitor_team_id INT
) USING DELTA;

INSERT INTO zzz_game_opponents VALUES (1, 1, 2);
INSERT INTO zzz_game_opponents VALUES (2, 1, 3);
INSERT INTO zzz_game_opponents VALUES (3, 2, 1);
INSERT INTO zzz_game_opponents VALUES (4, 2, 3);
INSERT INTO zzz_game_opponents VALUES (5, 3, 1);
INSERT INTO zzz_game_opponents VALUES (6, 3, 2);

-- Result:
-- +---------+--------------+-----------------+
-- | game_id | home_team_id | visitor_team_id |
-- +=========+==============+=================+
-- | 1       | 1            | 2               |
-- +---------+--------------+-----------------+
-- | 2       | 1            | 3               |
-- +---------+--------------+-----------------+
-- | 3       | 2            | 1               |
-- +---------+--------------+-----------------+
-- | 4       | 2            | 3               |
-- +---------+--------------+-----------------+
-- | 5       | 3            | 1               |
-- +---------+--------------+-----------------+
-- | 6       | 3            | 2               |
-- +---------+--------------+-----------------+

CREATE TABLE zzz_game_scores (
game_id INT,
home_team_score INT,
visitor_team_score INT
) USING DELTA;

INSERT INTO zzz_game_scores VALUES (1, 4, 2);
INSERT INTO zzz_game_scores VALUES (2, 0, 1);
INSERT INTO zzz_game_scores VALUES (3, 1, 2);
INSERT INTO zzz_game_scores VALUES (4, 3, 2);
INSERT INTO zzz_game_scores VALUES (5, 3, 0);
INSERT INTO zzz_game_scores VALUES (6, 3, 1);

-- Result:
-- +---------+-----------------+--------------------+
-- | game_id | home_team_score | visitor_team_score |
-- +=========+=================+====================+
-- | 1       | 4               | 2                  |
-- +---------+-----------------+--------------------+
-- | 2       | 0               | 1                  |
-- +---------+-----------------+--------------------+
-- | 3       | 1               | 2                  |
-- +---------+-----------------+--------------------+
-- | 4       | 3               | 2                  |
-- +---------+-----------------+--------------------+
-- | 5       | 3               | 0                  |
-- +---------+-----------------+--------------------+
-- | 6       | 3               | 1                  |
-- +---------+-----------------+--------------------+

CREATE TABLE zzz_games (
game_id INT,
game_date DATE
) USING DELTA;

INSERT INTO zzz_games VALUES (1, '2020-12-12');
INSERT INTO zzz_games VALUES (2, '2021-01-09');
INSERT INTO zzz_games VALUES (3, '2020-12-19');
INSERT INTO zzz_games VALUES (4, '2021-01-16');
INSERT INTO zzz_games VALUES (5, '2021-01-23');
INSERT INTO zzz_games VALUES (6, '2021-02-06');

-- Result:
-- +---------+------------+
-- | game_id | game_date  |
-- +=========+============+
-- | 1       | 2020-12-12 |
-- +---------+------------+
-- | 2       | 2021-01-09 |
-- +---------+------------+
-- | 3       | 2020-12-19 |
-- +---------+------------+
-- | 4       | 2021-01-16 |
-- +---------+------------+
-- | 5       | 2021-01-23 |
-- +---------+------------+
-- | 6       | 2021-02-06 |
-- +---------+------------+

CREATE TABLE zzz_teams (
team_id INT,
team_city VARCHAR(15)
) USING DELTA;

INSERT INTO zzz_teams VALUES (1, "San Francisco");
INSERT INTO zzz_teams VALUES (2, "Seattle");
INSERT INTO zzz_teams VALUES (3, "Amsterdam");

-- Result:
-- +---------+---------------+
-- | team_id | team_city     |
-- +=========+===============+
-- | 1       | San Francisco |
-- +---------+---------------+
-- | 2       | Seattle       |
-- +---------+---------------+
-- | 3       | Amsterdam     |
-- +---------+---------------+

In the project's models directory, create a file named zzz_game_details.sql with the following SQL statement. This statement creates a table that provides the details of each game, such as team names and scores. The config block instructs dbt to create a table in the database based on this statement.

SQL
-- Create a table that provides full details for each game, including
-- the game ID, the home and visiting teams' city names and scores,
-- the game winner's city name, and the game date.

{{ config(
  materialized='table',
  file_format='delta'
) }}

SQL
-- Step 4 of 4: Replace the visitor team IDs with their city names.
select
  game_id,
  home,
  t.team_city as visitor,
  home_score,
  visitor_score,
  -- Step 3 of 4: Display the city name for each game's winner.
  case
    when
      home_score > visitor_score
        then
          home
    when
      visitor_score > home_score
        then
          t.team_city
  end as winner,
  game_date as date
from (
  -- Step 2 of 4: Replace the home team IDs with their actual city names.
  select
    game_id,
    t.team_city as home,
    home_score,
    visitor_team_id,
    visitor_score,
    game_date
  from (
    -- Step 1 of 4: Combine data from various tables (for example, game and team IDs, scores, dates).
    select
      g.game_id,
      go.home_team_id,
      gs.home_team_score as home_score,
      go.visitor_team_id,
      gs.visitor_team_score as visitor_score,
      g.game_date
    from
      zzz_games as g,
      zzz_game_opponents as go,
      zzz_game_scores as gs
    where
      g.game_id = go.game_id and
      g.game_id = gs.game_id
  ) as all_ids,
    zzz_teams as t
  where
    all_ids.home_team_id = t.team_id
) as visitor_ids,
  zzz_teams as t
where
  visitor_ids.visitor_team_id = t.team_id
order by game_date desc

In the project's models directory, create a file named zzz_win_loss_records.sql with the following SQL statement. This statement creates a view that lists team win-loss records for the season.

SQL
-- Create a view that summarizes the season's win and loss records by team.

-- Step 2 of 2: Calculate the number of wins and losses for each team.
select
  winner as team,
  count(winner) as wins,
  -- Each team played in 4 games.
  (4 - count(winner)) as losses
from (
  -- Step 1 of 2: Determine the winner and loser for each game.
  select
    game_id,
    winner,
    case
      when
        home = winner
          then
            visitor
      else
        home
    end as loser
  from {{ ref('zzz_game_details') }}
)
group by winner
order by wins desc

With the virtual environment activated, run the dbt run command with the paths to the two preceding files. In the default database (as specified in the profiles.yml file), dbt creates one table named zzz_game_details and one view named zzz_win_loss_records. dbt gets these view and table names from their related .sql file names.

Bash
dbt run --model models/zzz_game_details.sql models/zzz_win_loss_records.sql

Console
...
... | 1 of 2 START table model default.zzz_game_details.................... [RUN]
... | 1 of 2 OK created table model default.zzz_game_details............... [OK ...]
... | 2 of 2 START view model default.zzz_win_loss_records................. [RUN]
... | 2 of 2 OK created view model default.zzz_win_loss_records............ [OK ...]
... |
... | Finished running 1 table model, 1 view model ...

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Run the following SQL code to list information about the new view and to select all rows from the table and view.

SQL
SHOW VIEWS FROM default LIKE 'zzz_win_loss_records';

Console
+-----------+----------------------+-------------+
| namespace | viewName             | isTemporary |
+===========+======================+=============+
| default   | zzz_win_loss_records | false       |
+-----------+----------------------+-------------+

SQL
SELECT * FROM zzz_game_details;

Console
+---------+---------------+---------------+------------+---------------+---------------+------------+
| game_id | home          | visitor       | home_score | visitor_score | winner        | date       |
+=========+===============+===============+============+===============+===============+============+
| 1       | San Francisco | Seattle       | 4          | 2             | San Francisco | 2020-12-12 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 2       | San Francisco | Amsterdam     | 0          | 1             | Amsterdam     | 2021-01-09 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 3       | Seattle       | San Francisco | 1          | 2             | San Francisco | 2020-12-19 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 4       | Seattle       | Amsterdam     | 3          | 2             | Seattle       | 2021-01-16 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 5       | Amsterdam     | San Francisco | 3          | 0             | Amsterdam     | 2021-01-23 |
+---------+---------------+---------------+------------+---------------+---------------+------------+
| 6       | Amsterdam     | Seattle       | 3          | 1             | Amsterdam     | 2021-02-06 |
+---------+---------------+---------------+------------+---------------+---------------+------------+

SQL
SELECT * FROM zzz_win_loss_records;

Console
+---------------+------+--------+
| team          | wins | losses |
+===============+======+========+
| Amsterdam     | 3    | 1      |
+---------------+------+--------+
| San Francisco | 2    | 2      |
+---------------+------+--------+
| Seattle       | 1    | 3      |
+---------------+------+--------+

Step 3: Create and run tests

In this step, you create tests, which are assertions you make about your models. When you run these tests, dbt tells you if each test in your project passes or fails.

There are two type of tests. Schema tests, applied in YAML, return the number of records that do not pass an assertion. When this number is zero, all records pass, therefore the tests pass. Data tests are specific queries that must return zero records to pass.

In the project's models directory, create a file named schema.yml with the following content. This file includes schema tests that determine whether the specified columns have unique values, are not null, have only the specified values, or a combination.

YAML
version: 2

models:
  - name: zzz_game_details
    columns:
      - name: game_id
        tests:
          - unique
          - not_null
      - name: home
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: visitor
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: home_score
        tests:
          - not_null
      - name: visitor_score
        tests:
          - not_null
      - name: winner
        tests:
          - not_null
          - accepted_values:
              values: ['Amsterdam', 'San Francisco', 'Seattle']
      - name: date
        tests:
          - not_null
  - name: zzz_win_loss_records
    columns:
      - name: team
        tests:
          - unique
          - not_null
          - relationships:
              to: ref('zzz_game_details')
              field: home
      - name: wins
        tests:
          - not_null
      - name: losses
        tests:
          - not_null

In the project's tests directory, create a file named zzz_game_details_check_dates.sql with the following SQL statement. This file includes a data test to determine whether any games happened outside of the regular season.

SQL
-- This season's games happened between 2020-12-12 and 2021-02-06.
-- For this test to pass, this query must return no results.

select date
from {{ ref('zzz_game_details') }}
where date < '2020-12-12'
or date > '2021-02-06'

In the project's tests directory, create a file named zzz_game_details_check_scores.sql with the following SQL statement. This file includes a data test to determine whether any scores were negative or any games were tied.

SQL
-- This sport allows no negative scores or tie games.
-- For this test to pass, this query must return no results.

select home_score, visitor_score
from {{ ref('zzz_game_details') }}
where home_score < 0
or visitor_score < 0
or home_score = visitor_score

In the project's tests directory, create a file named zzz_win_loss_records_check_records.sql with the following SQL statement. This file includes a data test to determine whether any teams had negative win or loss records, had more win or loss records than games played, or played more games than were allowed.
SQL
```
-- Each team participated in 4 games this season.
-- For this test to pass, this query must return no results.

select wins, losses
from {{ ref('zzz_win_loss_records') }}
where wins < 0 or wins > 4
or losses < 0 or losses > 4
or (wins + losses) > 4
```

With the virtual environment activated, run the dbt test command.

Bash
dbt test --models zzz_game_details zzz_win_loss_records

Console
...
... | 1 of 19 START test accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [RUN]
... | 1 of 19 PASS accepted_values_zzz_game_details_home__Amsterdam__San_Francisco__Seattle [PASS ...]
...
... |
... | Finished running 19 tests ...

Completed successfully

Done. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19

Step 4: Clean up

You can delete the tables and views you created for this example by running the following SQL code.

SQL
DROP TABLE zzz_game_opponents;
DROP TABLE zzz_game_scores;
DROP TABLE zzz_games;
DROP TABLE zzz_teams;
DROP TABLE zzz_game_details;
DROP VIEW zzz_win_loss_records;

DROP TABLE diamonds;
DROP TABLE diamonds_four_cs;
DROP VIEW diamonds_list_colors;
DROP VIEW diamonds_prices;

Troubleshooting

For information about common issues when using dbt Core with Databricks and how to resolve them, see Getting help on the dbt Labs website.

Next steps

Run dbt Core projects as Databricks job tasks. See Use dbt transformations in Lakeflow Jobs.

Additional resources

Explore the following resources on the dbt Labs website:

Before you begin​

Step 1: Create and run models​

Step 2: Create and run more complex models​

Step 3: Create and run tests​

Step 4: Clean up​

Troubleshooting​

Next steps​

Additional resources​