Cohort visualization

A cohort analysis examines the outcomes of predetermined groups, called cohorts, as they progress through a set of stages. The signature characteristic of a cohort chart is its comparison of the change in a variable across two different time series. For example, a common cohort definition is users by sign-up period and their usage pattern by day. Other examples include:

  • Monthly hard drive failure statistics by month

  • Weekly supplier delivery performance by week

  • Monthly average class GPA’s by month

While there are many ways to define the stages of a Cohort analysis, Databricks supports cohort visualizations with daily, weekly, or monthly stages. Also, Databricks cohort charts compare a cohort’s measurements in a given period against that group’s initial population size.

Data format

Databricks expects your input samples to have the following fields:

  • Date (bucket): the date that uniquely identifies a cohort. Suppose you’re visualizing monthly user activity by sign-up date. Your cohort date for all users that signed-up in January 2018 would be January 1, 2018. The cohort date for any user that signed-up in February would be February 1, 2018.

  • Stage: a count of how many stages transpired since the cohort date as of this sample. If you are grouping users by sign-up month, then your stage will be the count of months since these users signed up. In the above example, a measurement of activity in July for users that signed up in January would yield a value of 7 because seven stages have transpired between January and July.

  • Stage value: your actual measurement of this cohort’s performance in the given stage. In the above example, if 30 users who signed up in January showed activity in July then the stage value would be 30.

  • Bucket population size: the denominator to use to calculate the percentage of a cohort’s target satisfaction for a given stage. Continuing the example above, if 72 users signed up in January then the bucket population size would be 72. When the visualization is rendered, the value would be displayed as 41.67% (32 ÷ 72).

  • Time interval: lets you choose to define the cohort on either a daily, weekly, or monthly basis.

Cohort date notes

Even if you define your cohorts by month or week, Databricks expects the values in your Date column to be a full date value. If you are grouping by month, 2018-01-18 should be shortened to 2018-01-01 or any other full date in January, not 2018-01.

The cohort visualizer converts all date and time values to GMT before rendering. To avoid rendering issues, you should adjust the date times returned from your database by your local UTC offset.