Databricks SQL visualizations that use X and Y axes are called charts. There are eight different types of charts. Because the types are similar, you can often switch seamlessly between them to find the one that best conveys your meaning. In the following animation, all eight types were built from the same SQL query result:
Your query should return at least two columns: one column of values for the X axis and one column of values for the Y axis. It can also return values for trace grouping, displaying error bars, and bubble sizes.
The charts in the preceding animation were all produced from the following tabular result:
Once your query returns the right columns, start by setting your X and Y axis values. The visualization preview updates automatically; you don’t need to save the visualization to see how a change affects its appearance.
Tabs in the visualization editor screen give you fine-grained control over the rest of chart.
Use the X Axis and Y Axis tabs to modify the axes ranges and labels.
Use the Series tab to change your data aliases, z-index behavior, assign traces between the left- and right- Y axes. It also lets you combine different trace forms on one chart, like in the following chart.
Use the Colors tab to change the appearance of the traces on your charts.
Use the Data Labels to configure what displays when you hover your mouse over a chart.
Use the Group by setting to generate multiple traces against the same X and Y axes. This setting groups records into distinct traces instead of drawing one line. Almost every time you see multiple line or bar colors in a chart, it’s because the query results included a grouping column.
As shown in the following example, the grouping column is used to sort
(x,y) pairs together.
Group by is often easier than writing queries that return multiple Y columns for an X value. The following two data sets are identical.
Use the Group by column for melted data sets. Use multiple Y-columns for pivoted data sets.
Databricks SQL can “stack” your Y axis values on top of one another. The name is borrowed from stacked bar charts, but it can be useful with area charts as well. The following image shows the same data, unstacked on the left and stacked on the right.
Each Y axis value is displayed as the sum of itself and the Y values “beneath” it.
Stacking and grouping are related. You won’t stack data unless you have also grouped it.
You can use the Series tab to control the order in which traces are stacked. You can also control it by adding an
ORDER BY statement to your query. The stack follows the order in which your group names first appear in your query result. Stacking is available only for line, bar, and area charts.
For certain chart types, Databricks SQL can draw error bars around your data points using values from your query result.
- Error bars are always symmetrical. The distance above and below a given
(x,y)pair is always the same.
- Errors are the same color as their target trace.
- Errors are shown for all traces or no traces. You cannot configure them to appear on some traces and not others.
- The values in your errors column are charted on the same axis as their associated trace. This means your error values must be absolute. You cannot, for example, have errors expressed in percentages for Y values expressed in hundreds.
Errors are not aggregated when you stack records; an error bar will be shown for each trace. You can work around this by only providing non-zero error values for those records where the error should be displayed prominently. In the preceding example a flat error bar is shown at every trace point, but only the
Paid trace error bars have any length.
Each chart type is useful for certain kinds of presentation. You can mix and match multiple types on the same chart as needed:
Line: present change in one or more metrics over time.
Bar: present change in metrics over time or to show proportionality, like a pie chart. You can combine bar charts using stacking.
Area: combine the line and bar chart to show how one or more groups’ numeric values change over the progression of a second variable, typically that of time. They are often used to show sales funnel changes through time. You can combine area charts using stacking.
Pie: show proportionality between metrics. They are not meant for conveying time series data.
Scatter: excel at showing many groups of data points. A scatter chart is just like a line plot, but without the connecting lines. A scatter chart is more precise but less useful for time series data.
Scatter charts are necessary for visualizations where some groups appear just once. The line chart does not display singleton values because it can only show data where two or more points are present. One option is to force singletons into scatter type on the Series tab of the visualization editor while keeping other traces in line type.
Bubble: are scatter charts where the size of each point marker reflects a relevant metric.
Heatmap: blend features of bar charts, stacking, and bubble charts. You can choose from several built-in color schemes. A heatmap cannot be grouped since the entire chart is one trace.
Box: automatically show the distribution of data points across grouped categories.
Databricks SQL can make some obscure shapes if your query returns two or more rows with the same X axis value. This often happens in SQL if you unintentionally
JOIN a table with a one-to-many relationship.
In this example a vertical line is drawn because there are two records for 1 January. You can resolve this by filtering out the doubled entries on the X axis or revise the query to include the grouping field, as shown in the following image.
Databricks SQL automatically determines most common X axis scales: timestamps, linear, and logarithms. If it can’t parse your X column into an ordered series, it falls back to treating each X value as a “category”. This can have mixed results:
If you see shapes you don’t expect you can check whether your X axis has been sorted on the X Axis tab. Toggle the Sort Values option. If Sort Values is disabled, Databricks SQL retains the ordering of the source query.
These two charts come from the same base data. The only difference is whether or not Databricks SQL sorted the X axis values.