The goal of a data engineer is to take data in its most raw form, enrich it, and make it easily available to other authorized users, typically data scientists and data analysts. This quickstart walks you through ingesting data, transforming it, and writing it to a table for easy consumption.
Before you can run through this quickstart, you must have:
Access to a Databricks workspace. For more information, see Sign up for a free trial and Set up your Databricks account and create a workspace.
Permission to create a cluster in that Databricks workspace.
From the sidebar at the left and the Common Tasks list on the landing page, you access fundamental Databricks Data Science & Engineering entities: the Workspace, clusters, tables, notebooks, jobs, and libraries. The Workspace is the special root folder that stores your Databricks assets, such as notebooks and libraries, and the data that you import.
In order to do exploratory data analysis and data engineering, you must first create a cluster of computation resources to execute commands against.
Log into Databricks and make sure you’re in the Data Science & Engineering workspace.
In the sidebar, click Compute.
On the Compute page, click Create Cluster.
On the Create Cluster page, specify the cluster name Quickstart, accept the remaining defaults, and click Create Cluster.
The easiest way to ingest your data into Databricks is to use the Create Table Wizard. In the sidebar, click Data and then click the Create Table button.
On the Create New Table dialog, drag and drop a CSV file from your computer into the Files section. If you need an example file to test, download the diamonds dataset to your local computer and drag it to upload.
Click the Create Table with UI button.
Select the Quickstart cluster you created in step 2.
Click the Preview Table button.
Scroll down to see the Specify Table Attributes section and preview the data.
Select the First row is header option.
Select the Infer Schema option.
Click Create Table.
You have successfully created a Delta Lake table that can be queried.
Alternatively, you can click the Create Table in Notebook button to inspect and modify code in a notebook to create a table. You can use this technique to generate code for ingesting data from other data sources such as Redshift, Kinesis, or JDBC by clicking the Other Data Sources selector.
If there are other data sources to ingest data from, like Salesforce, you can easily leverage Databricks partner by clicking Partner Connect in the sidebar. When you select a partner from Partner Connect, you can connect the partner’s application to Databricks and even start a free trial if you are not already a customer of the partner. See Databricks Partner Connect guide.
A notebook is a collection of cells that run computations on a cluster. To create a notebook in the workspace:
In the sidebar, click Workspace.
In the Workspace folder, select Create > Notebook.
On the Create Notebook dialog, enter a name and select Python in the Default Language drop-down.
Click Create. The notebook opens with an empty cell at the top.
Enter the following code in the first cell and run it by clicking SHIFT+ENTER.
df = table("diamonds_csv") display(df)
The notebook displays a table of diamond color and average price.
Create another cell, this time using the
%sqlmagic command to enter a SQL query:
%sql select * from diamonds_csv
You can use the %sql, %r, %python, or %scala magic commands at the beginning of a cell to override the notebook’s default language.
Click SHIFT+ENTER to run the command.
Display a chart of the average diamond price by color.
Click the Bar chart icon .
Click Plot Options.
Drag color into the Keys box.
Drag price into the Values box.
In the Aggregation drop-down, select AVG.
Click Apply to display the bar chart.
The best way to create trusted and scalable data pipelines is to use Delta Live Tables.
To learn how to build an effective pipeline and run it end to end, follow the steps in the Delta Live Table Quickstart.
To control access to a table in Databricks:
Use the persona switcher in the sidebar to switch to the Databricks SQL environment.
Click the icon below the Databricks logo and select SQL.
Click the Data in the sidebar.
#. In the drop-down list at the top right, select a SQL endpoint, such as Starter Endpoint. in the sidebar.
Filter for the diamondscsv_ table you created in Step 2.
diain the text box following the default database.
On the Permissions tab, click the Grant button
All Usersthe ability to SELECT and READ_METADATA for the table.
Now all users can query the table that you created.
You can schedule a job to run a data processing task in a Databricks cluster with scalable resources. Your job can consist of a single task or be a large, multi-task application with complex dependencies.
To learn how to create a job that orchestrates tasks to read and process a sample dataset, follow the steps in the Jobs quickstart.
Learn about more tools that can help you perform data engineering tasks in Databricks: