This topic introduces the set of fundamental concepts you need to understand in order to use Databricks effectively.
A Workspace is the root folder for Databricks. The Workspace stores notebooks, libraries, dashboards, and experiments.
This section describes the objects that you work with in a Databricks Workspace.
- Databricks File System (DBFS)
- A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Databricks.
- A web-based interface to documents that contain runnable commands, visualizations, and narrative text.
- Code that runs in a notebook. A command operates on files and tables. Commands can be run in sequence, referring to the output of one or more previously run commands.
- A graphical rendering of table data and and the output of notebook commands.
- An interface that provides organized access to visualizations.
- A package of code available to the execution context running on your cluster. Databricks runtimes include many libraries and you can add your own.
- Container for notebooks, dashboards, libraries, and experiments.
- A package of notebooks that can be exported from and imported into Databricks.
- Databricks Runtime
- The set of core components that run on the clusters managed by Databricks. Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
- A collection of MLflow runs for training a machine learning model.
This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.
- A collection of information that is organized so that it can be easily accessed, managed, and updated.
- A representation of structured data. You query tables with Spark SQL and Apache Spark APIs. A table typically consists of multiple partitions.
- A portion of a table. By splitting a large table into smaller, individual tables, queries that access only a fraction of the data can run faster because there is less data to scan.
- The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored.
This section describes concepts that you need to know to run analytic and machine learning computations in Databricks.
- A set of computation resources and configurations on which you run notebooks and jobs.
- Execution context
- The state for a REPL environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.
- A way of running a notebook or library either immediately or on a scheduled basis.
- Databricks runtime
The set of core components that run on the clusters managed by Databricks. Databricks offers several types of runtimes:
- Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
- Databricks Runtime for Machine Learning is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
This section describes concepts that you need to know to train machine learning models.
- A set of known dimensions that serves as the framework for training machines to make predictions. The initial structure imposed upon a function.
- Trained Model
- The outcome of the training process. A mathematical mapping from input to output.
- A collection of parameters, metrics, and tags related to training a machine learning model.