Get started with Databricks as a data engineer

The goal of a data engineer is to take data in its most raw form, enrich it, and make it easily available to other authorized users, typically data scientists and data analysts. This quickstart walks you through ingesting data, transforming it, and writing it to a table for easy consumption.

Before you begin

Before you can run through this quickstart, you must have:

Data Science & Engineering UI

Landing page

From the sidebar at the left and the Common Tasks list on the landing page, you access fundamental Databricks Data Science & Engineering entities: the Workspace, clusters, tables, notebooks, jobs, and libraries. The Workspace is the special root folder that stores your Databricks assets, such as notebooks and libraries, and the data that you import.

Use the sidebar

You can access all of your Databricks assets using the sidebar. The sidebar’s contents depend on the selected persona: Data Science & Engineering, Machine Learning, or SQL.

  • By default, the sidebar appears in a collapsed state and only the icons are visible. Move your cursor over the sidebar to expand to the full view.

  • To change the persona, click the icon below the Databricks logo Databricks logo, and select a persona.

    change persona
  • To pin a persona so that it appears the next time you log in, click pin persona next to the persona. Click it again to remove the pin.

  • Use Menu options at the bottom of the sidebar to set the sidebar mode to Auto (default behavior), Expand, or Collapse.

Get help

To get help, click Help icon Help in the lower left corner.

Help menu

Step 1: Create a cluster

In order to do exploratory data analysis and data engineering, you must first create a cluster of computation resources to execute commands against.

  1. Log into Databricks and make sure you’re in the Data Science & Engineering workspace.

    See Data Science & Engineering UI.

  2. In the sidebar, click compute icon Compute.

  3. On the Compute page, click Create Cluster.

    Create cluster
  4. On the Create Cluster page, specify the cluster name Quickstart, accept the remaining defaults, and click Create Cluster.

Step 2: Ingest data

The easiest way to ingest your data into Databricks is to use the Create Table Wizard. In the sidebar, click Data Icon Data and then click the Create Table button.

Create table

On the Create New Table dialog, drag and drop a CSV file from your computer into the Files section. If you need an example file to test, download the diamonds dataset to your local computer and drag it to upload.

Create new table
  1. Click the Create Table with UI button.

  2. Select the Quickstart cluster you created in step 2.

  3. Click the Preview Table button.

  4. Scroll down to see the Specify Table Attributes section and preview the data.

  5. Select the First row is header option.

  6. Select the Infer Schema option.

  7. Click Create Table.

You have successfully created a Delta Lake table that can be queried.

Additional data ingestion options

Alternatively, you can click the Create Table in Notebook button to inspect and modify code in a notebook to create a table. You can use this technique to generate code for ingesting data from other data sources such as Redshift, Kinesis, or JDBC by clicking the Other Data Sources selector.

If there are other data sources to ingest data from, like Salesforce, you can easily leverage Databricks partner by clicking Partner Connect button Partner Connect in the sidebar. When you select a partner from Partner Connect, you can connect the partner’s application to Databricks and even start a free trial if you are not already a customer of the partner. See Databricks Partner Connect guide.

Step 3: Query data

A notebook is a collection of cells that run computations on a cluster. To create a notebook in the workspace:

  1. In the sidebar, click Workspace Icon Workspace.

  2. In the Workspace folder, select Down Caret Create > Notebook.

    Create notebook
  3. On the Create Notebook dialog, enter a name and select Python in the Default Language drop-down.

  4. Click Create. The notebook opens with an empty cell at the top.

  5. Enter the following code in the first cell and run it by clicking SHIFT+ENTER.

     df = table("diamonds_csv")

    The notebook displays a table of diamond color and average price.

    Run command.

  6. Create another cell, this time using the %sql magic command to enter a SQL query:

    select * from diamonds_csv

    You can use the %sql, %r, %python, or %scala magic commands at the beginning of a cell to override the notebook’s default language.

  7. Click SHIFT+ENTER to run the command.

Step 4: Visualize data

Display a chart of the average diamond price by color.

  1. Click the Bar chart icon Chart Button.

  2. Click Plot Options.

    • Drag color into the Keys box.

    • Drag price into the Values box.

    • In the Aggregation drop-down, select AVG.

      Select aggregation
  3. Click Apply to display the bar chart.

    Apply chart type

Step 5: Transform data

The best way to create trusted and scalable data pipelines is to use Delta Live Tables.

To learn how to build an effective pipeline and run it end to end, follow the steps in the Delta Live Table Quickstart.

Step 6: Set up data governance

To control access to a table in Databricks:

  1. Use the persona switcher in the sidebar to switch to the Databricks SQL environment.

    Click the icon below the Databricks logo Databricks logo and select SQL.

    change to Databricks SQL
  2. Click the Data Icon Data in the sidebar.

#. In the drop-down list at the top right, select a SQL endpoint, such as Starter Endpoint. in the sidebar.

  1. Filter for the diamondscsv_ table you created in Step 2.

    Type dia in the text box following the default database.

    Open Data Explorer
  2. On the Permissions tab, click the Grant button

  3. Give All Users the ability to SELECT and READ_METADATA for the table.

    Grant permissions
  4. Click OK.

Now all users can query the table that you created.

Step 9: Schedule a job

You can schedule a job to run a data processing task in a Databricks cluster with scalable resources. Your job can consist of a single task or be a large, multi-task application with complex dependencies.

To learn how to create a job that orchestrates tasks to read and process a sample dataset, follow the steps in the Jobs quickstart.