Exploratory data analysis on Databricks: Tools and techniques

This article describes tools and techniques for exploratory data analysis (EDA) on Databricks.

What is EDA and why is it useful?

Exploratory data analysis (EDA) includes methods for exploring data sets to summarize their main characteristics and identify any problems with the data. Using statistical methods and visualizations, you can learn about a data set to determine its readiness for analysis and inform what techniques to apply for data preparation. EDA can also influence which algorithms you choose to apply for training ML models.

What are the EDA tools in Databricks?

Databricks has built-in analysis and visualization tools for working with data.

The Databricks Runtime and Databricks Runtime ML provide pre-built environments that have popular data exploration libraries already installed. You can see the list of the built-in libraries in the release notes.

In addition, the following articles show examples of visualization tools in Databricks:

EDA in Databricks SQL

Databricks SQL also has data visualization and exploration tools. Here are some helpful articles: