charts-and-graphs-python(Python)

Loading...

Chart and Graph Types with Python

This notebook covers the various charts and graphs that are built into Databricks.

While Python is used to generate the test data displayed in the visualizations in this notebook, all the information about how to configure these charts & graphs applies to all notebooks.

A Table View is the most basic way to view data.

Only the first 1000 rows are displayed in the table view.

from pyspark.sql import Row
 
array = map(lambda x: Row(key="k_%04d" % x, value = x), range(1, 5001))
largeDataFrame = spark.createDataFrame(sc.parallelize(array))
largeDataFrame.registerTempTable("largeTable")
display(spark.sql("select * from largeTable"))
 
key
value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
k_0001
1
k_0002
2
k_0003
3
k_0004
4
k_0005
5
k_0006
6
k_0007
7
k_0008
8
k_0009
9
k_0010
10
k_0011
11
k_0012
12
k_0013
13
k_0014
14
k_0015
15
k_0016
16
k_0017
17
k_0018
18
k_0019
19
k_0020
20
k_0021
21
k_0022
22

Showing all 5000 rows.

Configure tables with Plot Options....

  • The Keys section is for specifying the control variable which is typically displayed as the X-Axis on many of the graph types. Most graphs can plot about 1000 values for the keys, but again - it varies for different graphs.
  • The Values section is for specifying the observed variable and is typically displayed on the Y-Axis. This also tends to be an observed numerical value on most graph types.
  • The Series groupings section is for specifying ways to break out the data - for a bar graph - each series grouping has a different color for the bars with a legend to denote that value of each series grouping. Many of the graph types can only handle series groupings that has 10 or less unique values.

Some graph types also allow specifying even more options - and those will be discussed as applicable.

A Pivot table is another way to view data in a table format.

Instead of just returning the raw results of the table - it can automatically sort, count total or give the average of the data stored in the table.

  • Read more about Pivot Tables here: http://en.wikipedia.org/wiki/Pivot_table
  • For a Pivot Table, key, series grouping and value fields can be specified.
  • The Key is the first column, and there will be one row per key in the Pivot Table.
  • There will be additional column for each unique value for the Series Grouping.
  • The table will contain the Values field in the cells. Value must be a numerical field that can be combined using aggregation functions.
  • Cell in the Pivot Table are calculated from multiple rows of the original table.
    • Select SUM, AVG, MIN, MAX, or COUNT as the way to combine the original rows into that cell.
  • Pivoting is done on the server side of Databricks Cloud to calculate the cell values.

To create a Pivot table, click the Graph icon below a result and select Pivot.

# Click on the Plot Options Button...to see how this pivot table was configured.
from pyspark.sql import Row
 
largePivotSeries = map(lambda x: Row(key="k_%03d" % (x % 200), series_grouping = "group_%d" % (x % 3), value = x), range(1, 5001))
largePivotDataFrame = spark.createDataFrame(sc.parallelize(largePivotSeries))
largePivotDataFrame.registerTempTable("table_to_be_pivoted")
display(spark.sql("select * from table_to_be_pivoted"))
 
key
group_1
group_2
group_0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
k_001
21609
20008
18408
k_002
18416
21618
20016
k_003
20024
18424
21627
k_004
21636
20032
18432
k_005
18440
21645
20040
k_006
20048
18448
21654
k_007
21663
20056
18456
k_008
18464
21672
20064
k_009
20072
18472
21681
k_010
21690
20080
18480
k_011
18488
21699
20088
k_012
20096
18496
21708
k_013
21717
20104
18504
k_014
18512
21726
20112
k_015
20120
18520
21735
k_016
21744
20128
18528
k_017
18536
21753
20136
k_018
20144
18544
21762
k_019
21771
20152
18552
k_020
18560
21780
20160
k_021
20168
18568
21789

Another way to think of a pivot table is that it does a group by on your original table by the key & series grouping, but instead of outputting (key, series_grouping, aggregation_function(value)) tuples, it outputs a table where the schema is the key and every unique value for the series grouping.

  • See the results of group_by statement below, which contains all the data that is in the pivot table above, but the schema of the results is different.
%sql select key, series_grouping, sum(value) from table_to_be_pivoted group by key, series_grouping order by key, series_grouping
 
key
series_grouping
sum(value)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
k_000
group_0
21600
k_000
group_1
20000
k_000
group_2
23400
k_001
group_0
18408
k_001
group_1
21609
k_001
group_2
20008
k_002
group_0
20016
k_002
group_1
18416
k_002
group_2
21618
k_003
group_0
21627
k_003
group_1
20024
k_003
group_2
18424
k_004
group_0
18432
k_004
group_1
21636
k_004
group_2
20032
k_005
group_0
20040
k_005
group_1
18440

Showing all 600 rows.

A Bar Chart is a type of visual pivot table graph and a great basic way to visualize data.

  • Plot Options... was used to configure the graph below.
  • The Key is Year and appears on the X-Axis.
  • The Series groupings is Product and there is a different color to denote each of those.
  • The Values is salesAmount and appears on the Y-Axis.
  • Sum was selected as the aggregation method, which means rows are summed for pivoting.
from pyspark.sql import Row
salesEntryDataFrame = spark.createDataFrame(sc.parallelize([
  Row(category="fruits_and_vegetables", product="apples", year=2012, salesAmount=100.50),
  Row(category="fruits_and_vegetables", product="oranges", year=2012, salesAmount=100.75),
  Row(category="fruits_and_vegetables", product="apples", year=2013, salesAmount=200.25),
  Row(category="fruits_and_vegetables", product="oranges", year=2013, salesAmount=300.65),
  Row(category="fruits_and_vegetables", product="apples", year=2014, salesAmount=300.65),
  Row(category="fruits_and_vegetables", product="oranges", year=2015, salesAmount=100.35),
  Row(category="butcher_shop", product="beef", year=2012, salesAmount=200.50),
  Row(category="butcher_shop", product="chicken", year=2012, salesAmount=200.75),
  Row(category="butcher_shop", product="pork", year=2013, salesAmount=400.25),
  Row(category="butcher_shop", product="beef", year=2013, salesAmount=600.65),
  Row(category="butcher_shop", product="beef", year=2014, salesAmount=600.65),
  Row(category="butcher_shop", product="chicken", year=2015, salesAmount=200.35),
  Row(category="misc", product="gum", year=2012, salesAmount=400.50),
  Row(category="misc", product="cleaning_supplies", year=2012, salesAmount=400.75),
  Row(category="misc", product="greeting_cards", year=2013, salesAmount=800.25),
  Row(category="misc", product="kitchen_utensils", year=2013, salesAmount=1200.65),
  Row(category="misc", product="cleaning_supplies", year=2014, salesAmount=1200.65),
  Row(category="misc", product="cleaning_supplies", year=2015, salesAmount=400.35)
]))
salesEntryDataFrame.registerTempTable("test_sales_table")
display(spark.sql("select * from test_sales_table"))
20122013201420150500100015002000
fruits_and_vegetablesfruits_and_vegetablesbutcher_shopbutcher_shopmiscmiscyearsalesAmountcategorycategory

Tip: Hover over each bar in the chart below to see the exact values plotted.

A Line Graph is another example of a pivot table graph that can highlight trends for your data set.

  • Plot Options... was used to configure the graph below.
  • The Key is Year and appears on the X-Axis.
  • The Series groupings is Category and there is different color to denote each of those.
  • The Values is salesAmount and appears on the Y-Axis.
  • Sum is selected as the aggregation method
%sql select cast(string(year) as date) as year, category, salesAmount from test_sales_table
Jan 2012Mar 2012May 2012Jul 2012Sep 2012Nov 2012Jan 2013Mar 2013May 2013Jul 2013Sep 2013Nov 2013Jan 2014Mar 2014May 2014Jul 2014Sep 2014Nov 2014Jan 20150500100015002000
fruits_and_vegetablesfruits_and_vegetablesbutcher_shopbutcher_shopmiscmiscyearsalesAmountcategorycategory

A Pie Chart is pivot table graph type that can allow you to see what percentage of the whole your values represent.

  • NOTE: As opposed to the previous examples, Key & Series Groupings have been switched.
  • Plot Options... was used to configure the graph below.
  • The Key is Category and one color is used for each product.
  • The Series groupings is Year and there is different pie chart for each year.
  • The Values is salesAmount and is used to calculate the percentage of the pie.
  • Sum is selected as the aggregation method.
%sql select * from test_sales_table
fruits_and_vegetablesbutcher_shopmisc14%29%57%14%29%57%14%29%57%14%29%57%fruits_and_vegetablesbutcher_shopmisc2012201320142015

A Map Graph is a way to visualize your data on a map.

  • Plot Options... was used to configure the graph below.
  • Keys should contain the field with the location.
  • Series groupings is always ignored for World Map graphs.
  • Values should contain exactly one field with a numerical value.
  • Since there can multiple rows with the same location key, choose "Sum", "Avg", "Min", "Max", "COUNT" as the way to combine the values for a single key.
  • Different values are denoted by color on the map, and ranges are always spaced evenly.

Tip: Apply a smoothing function to your graph if your values are not evenly distributed.

from pyspark.sql import Row
stateRDD = spark.createDataFrame(sc.parallelize([
  Row(state="MO", value=1), Row(state="MO", value=10),
  Row(state="NH", value=4),
  Row(state="MA", value=8),
  Row(state="NY", value=4),
  Row(state="CA", value=7)
]))
stateRDD.registerTempTable("test_state_table")
display(spark.sql("Select * from test_state_table"))
+−N/A0-22-44-66-88-1010-12

To plot a graph of the world, use country codes in ISO 3166-1 alpha-3 format as the key.

from pyspark.sql import Row
worldRDD = spark.createDataFrame(sc.parallelize([
  Row(country="USA", value=1000),
  Row(country="JPN", value=23),
  Row(country="GBR", value=23),
  Row(country="FRA", value=21),
  Row(country="TUR", value=3)
]))
display(worldRDD)
+−N/A0-200200-400400-600600-800800-1000

A Scatter Plot allows you to see if there is a correlation between two variables.

  • Plot Options... was selected to configure the graph below.
  • Keys will be used to color the points on the graphs - with a legend on the side.
  • Series Grouping is ignored.
  • Value must contain at least two numberical fields. This graph has a, b, and c as the values.
  • The diagonal of the resulting plot is the kernal density plot of the variable.
  • The row always has the variable in the Y-Axis, and the column has the variable in the X-Axis.
from pyspark.sql import Row
scatterPlotRDD = spark.createDataFrame(sc.parallelize([
  Row(key="k1", a=0.2, b=120, c=1), Row(key="k1", a=0.4, b=140, c=1), Row(key="k1", a=0.6, b=160, c=1), Row(key="k1", a=0.8, b=180, c=1),
  Row(key="k2", a=0.2, b=220, c=1), Row(key="k2", a=0.4, b=240, c=1), Row(key="k2", a=0.6, b=260, c=1), Row(key="k2", a=0.8, b=280, c=1),
  Row(key="k1", a=1.8, b=120, c=1), Row(key="k1", a=1.4, b=140, c=1), Row(key="k1", a=1.6, b=160, c=1), Row(key="k1", a=1.8, b=180, c=1),
  Row(key="k2", a=1.8, b=220, c=2), Row(key="k2", a=1.4, b=240, c=2), Row(key="k2", a=1.6, b=260, c=2), Row(key="k2", a=1.8, b=280, c=2),
  Row(key="k1", a=2.2, b=120, c=1), Row(key="k1", a=2.4, b=140, c=1), Row(key="k1", a=2.6, b=160, c=1), Row(key="k1", a=2.8, b=180, c=1),
  Row(key="k2", a=2.2, b=220, c=3), Row(key="k2", a=2.4, b=240, c=3), Row(key="k2", a=2.6, b=260, c=3), Row(key="k2", a=2.8, b=280, c=3)
]))
display(scatterPlotRDD)
keyk1k1k2k2a1.001.201.401.601.802.002.202.402.602.803.001.002.00150200250b1201401601802002202402602801.001.502.002.503.00c

LOESS Fit Curves for Scatter Plots

LOESS is a method of performing local regression on your data to produce a smooth estimation curve that describes the data trend of your scatter plot. It does this by interpolating a curve within its neighborhood of data points. The LOESS fit curve is controlled by a bandwidth parameter that specifies how many neighboring points should be used to smooth the plot. A high bandwidth parameter (close to 1) gives a very smooth curve that may miss the general trend, while a low bandwidth parameter (close to 0) does not smooth the plot much.

LOESS fit curves are now available for scatter plots. Here is an example of how you can create a LOESS fit for your scatter plots.

NOTE: If your dataset has more than 5000 data points, the LOESS fit is computed using the first 5000 points.

import numpy as np
import math
 
# Create data points for scatter plot
np.random.seed(0)
points = sc.parallelize(range(0,1000)).map(lambda x: (x/100.0, 4 * math.sin(x/100.0) + np.random.normal(4,1))).toDF()

You can turn this data into a scatter plot using the controls on the bottom left of the display table.

plot-menu-pick-scatter

You can now access the LOESS fit option when you select Plot Options:

screen shot 2015-10-13 at 3 43 16 pm

You can experiment with the bandwith parameter to see how the curve adapts to noisy data.

Once you accept the change, you will see the LOESS fit on your scatter plot!

display(points)
-2.00-1.50-1.00-500m0.00500m1.001.502.002.503.003.504.004.505.005.506.006.507.007.508.008.509.009.5010.00.002.004.006.008.00_1_2

A Histogram allows you to determine the distribution of values.

  • Plot Options... was selected to configure the graph below.
  • Value should contain exactly one field.
  • Series Grouping is always ignored.
  • Keys can support up to 2 fields.
    • When no key is specified, exactly one histogram is output.
    • When 2 fields are specified, then there is a trellis of histograms.
  • Aggregation is not applicable.
  • Number of bins is a special option that appears only for histogram plots, and controls the number of bins in the histogram.
  • Bins are computed on the serverside for histograms, so it can plot all the rows in a table.
from pyspark.sql import Row
# Hover over the entry in the histogram to read off the exact valued plotted.
histogramRDD = spark.createDataFrame(sc.parallelize([
  Row(key1="a", key2="x", val=0.2), Row(key1="a", key2="x", val=0.4), Row(key1="a", key2="x", val=0.6), Row(key1="a", key2="x", val=0.8), Row(key1="a", key2="x", val=1.0), 
  Row(key1="b", key2="z", val=0.2), Row(key1="b", key2="x", val=0.4), Row(key1="b", key2="x", val=0.6), Row(key1="b", key2="y", val=0.8), Row(key1="b", key2="x", val=1.0), 
  Row(key1="a", key2="x", val=0.2), Row(key1="a", key2="y", val=0.4), Row(key1="a", key2="x", val=0.6), Row(key1="a", key2="x", val=0.8), Row(key1="a", key2="x", val=1.0), 
  Row(key1="b", key2="x", val=0.2), Row(key1="b", key2="x", val=0.4), Row(key1="b", key2="x", val=0.6), Row(key1="b", key2="z", val=0.8), Row(key1="b", key2="x", val=1.0)]))
display(histogramRDD)
0.000.200.400.600.801.00ax0.200.300.400.500.600.700.800.901.00bz0.000.200.400.600.801.000.000.200.400.600.801.000.200.300.400.500.600.700.800.901.00yvalDensity

A Quantile plot allows you to view what the value is for a given quantile value.

  • For more information on Quantile Plots, see http://en.wikipedia.org/wiki/Normal_probability_plot.
  • Plot Options... was selected to configure the graph below.
  • Value should contain exactly one field.
  • Series Grouping is always ignored.
  • Keys can support up to 2 fields.
    • When no key is specified, exactly one quantile plot is output.
    • When 2 fields are specified, then there is a trellis of quantile plots .
  • Aggregation is not applicable.
  • Quantiles are not being calculated on the serverside for now, so only the 1000 rows can be reflected in the plot.
from pyspark.sql import Row
quantileSeries = map(lambda x: Row(key="key_%01d" % (x % 4), grouping="group_%01d" % (x % 3), otherField=x, value=x*x), range(1, 5001))
quantileSeriesRDD = spark.createDataFrame(sc.parallelize(quantileSeries))
display(quantileSeriesRDD)
5.0M10M15M20M25Mgroup_1key_10.000.200.400.600.801.00group_2group_0key_25.0M10M15M20M25M5.0M10M15M20M25Mkey_30.000.200.400.600.801.00key_05.0M10M15M20M25M0.000.200.400.600.801.00Quantilevalue

A Q-Q plot shows you how a field of values are distributed.

  • For more information on Q-Q plots, see http://en.wikipedia.org/wiki/Q%E2%80%93Q_plot.
  • Value should contain one or two fields.
  • Series Grouping is always ignored.
  • Keys can support up to 2 fields.
    • When no key is specified, exactly one quantile plot is output.
    • When 2 fields are specified, then there is a trellis of quantile plots .
  • Aggregation is not applicable.
  • Q-Q Plots are not being calculated on the serverside for now, so only the 1000 rows can be reflected in the plot.
from pyspark.sql import Row
qqPlotSeries = map(lambda x: Row(key="key_%03d" % (x % 5), grouping="group_%01d" % (x % 3), value=x, value_squared=x*x), range(1, 5001))
qqPlotRDD = spark.createDataFrame(sc.parallelize(qqPlotSeries))

When there is only one field specified for Values, a Q-Q plot will just compare the distribution of the field with a normal distribution.

display(qqPlotRDD)
-4.0-3.5-3.0-2.5-2.0-1.5-1.00-0.500.000.501.001.52.02.53.03.54.05001.0k1.5k2.0k2.5k3.0k3.5k4.0k4.5k5.0kvalue QuantilesNormal Quantiles

When there are two fields specified for Values, a Q-Q plot will compare the distribution of the two fields with each other.

display(qqPlotRDD)
5001.0k1.5k2.0k2.5k3.0k3.5k4.0k4.5k5.0k2.0M4.0M6.0M8.0M10M12M14M16M18M20M22M24Mvalue_squared Quantiles2 Quantiles2 Quantiles

Up to two keys can be configured with a Q-Q plot to create a trellis of plots.

display(qqPlotRDD)
1.0k2.0k3.0k4.0k5.0kgroup_1key_0015.0M10M15M20M25Mgroup_2group_0key_0021.0k2.0k3.0k4.0k5.0k1.0k2.0k3.0k4.0k5.0kkey_003key_0041.0k2.0k3.0k4.0k5.0k1.0k2.0k3.0k4.0k5.0k5.0M10M15M20M25Mkey_0005.0M10M15M20M25Mvalue_squared Quantiles2 Quantiles2 Quantiles

A Box plot gives you an idea of what the expected range of values are and shows the outliers.

  • See http://en.wikipedia.org/wiki/Box_plot for more information on Box Plots.
  • Value should contain exactly one field.
  • Series Grouping is always ignored.
  • Keys can be added.
    • There will be one box and whisker plot for each combination of values for the keys.
  • Aggregation is not applicable.
  • Box plots are not being calculated on the serverside for now, so only the first 1000 rows can be reflected in the plot.
  • The Median value of the Box plot is displayed when you hover over the box.
from pyspark.sql import Row
import random
# Hovering over the Box will display the exact median value.
boxSeries = map(lambda x: Row(key="key_%01d" % (x % 2), grouping="group_%01d" % (x % 3), value=random.randint(0, x)), range(1, 5001))
boxSeriesRDD = spark.createDataFrame(sc.parallelize(boxSeries))
display(boxSeriesRDD)
key_1, group_13301.9kkey_0, group_23601.8kkey_1, group_03102.0kkey_0, group_13602.0kkey_1, group_23902.0kkey_0, group_03502.0k0.005001.0k1.5k2.0k2.5k3.0k3.5k4.0k4.5kvaluekey, grouping