This article introduces the set of fundamental concepts you need to understand in order to use Databricks effectively.
The workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.
This section describes the objects contained in the Databricks workspace folders.
A web-based interface to documents that contain runnable commands, visualizations, and narrative text.
An interface that provides organized access to visualizations.
A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries and you can add your own.
A collection of MLflow runs for training a machine learning model.
This section describes the interfaces that Databricks supports for accessing your assets: UI, API, and command-line (CLI).
The Databricks UI provides an easy-to-use graphical interface to workspace folders and their contained objects, data objects, and computational resources.
This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.
A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Databricks.
A collection of information that is organized so that it can be easily accessed, managed, and updated.
A representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs.
The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an existing external Hive metastore.
This section describes concepts that you need to know to run computations in Databricks.
A set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job.
- You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
- The Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.
A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.
The set of core components that run on the clusters managed by Databricks. Databricks offers several types of runtimes:
- Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
- Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
- Databricks Runtime for Genomics is a version of Databricks Runtime optimized for working with genomic and biomedical data.
- Databricks Light is the Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.
A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.
Databricks identifies two types of workloads subject to different pricing schemes: data engineering (job) and data analytics (all-purpose).
- Data engineering An (automated) workload runs on a job cluster which the Databricks job scheduler creates for each workload.
- Data analytics An (interactive) workload runs on an all-purpose cluster. Interactive workloads typically run commands within a Databricks notebook. However, running a job on an existing job cluster is also treated as an interactive workload.
The state for a REPL environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.
This section describes concepts that you need to know to train machine learning models.
A mathematical function that represents the relationship between a set of predictors and an outcome. Machine learning consists of training and inference steps. You train a model using an existing dataset, and then use that model to predict the outcomes (inference) of new data.
A collection of parameters, metrics, and tags related to training a machine learning model.
The primary unit of organization and access control for runs; all MLflow runs belong to an experiment. An experiment lets you visualize, search, and compare runs, as well as download run artifacts or metadata for analysis in other tools.