Exploratory data analysis on Databricks: Tools and techniques

This article describes tools and techniques for exploratory data analysis (EDA) on Databricks.

What is EDA and why is it useful?

Exploratory data analysis (EDA) includes methods for exploring data sets to summarize their main characteristics and identify any problems with the data. Using statistical methods and visualizations, you can learn about a data set to determine its readiness for analysis and inform what techniques to apply for data preparation. EDA can also influence which algorithms you choose to apply for training ML models.

What are the EDA tools in Databricks?

Databricks has built-in analysis and visualization tools for working with data.

The Databricks Runtime and Databricks Runtime ML provide pre-built environments that have popular data exploration libraries already installed. You can see the list of the built-in libraries in the release notes.

In addition, the following articles show examples of visualization tools in Databricks:

With Databricks, you can combine SQL and Python to explore data. In a Databricks Python notebook, table results from a SQL language cell are automatically made available as a Python DataFrame. For details, see Explore SQL cell results in Python notebooks.

EDA in Databricks SQL

Databricks SQL also has data visualization and exploration tools. Here are some helpful articles: