Marketplace Starter Notebook for Data Providers

The goal of this notebook is to provide an outline for a common workflow that consumers can use to examine new data shared with them. This is not an exhaustive data exploration presentation. When you create sample notebooks for your own listings, you are encouraged to highlight interesting examples to demonstrate the value of the data.

When you build your notebooks, it's often good to show:

Schema diagram
Dataset overview or description (a few paragraphs)
Minimum of 4-8 queries, prefereably:
- 2 to 4 that explore the data and demonstrate data coverage
- 2 to 4 that show how to drive insights from the data

Notebooks basics and goals

Notebooks provide cell-by-cell execution of code. Multiple languages can be mixed in a notebook. Users can add plots, images, and markdown text to enhance their code. Customers can easily deploy notebooks beployed as production code in Databricks, and notebooks can also provide a robust toolset for data exploration, reporting, and dashboarding.

Running a Cell

Run the cell below using one of the following options:
- CTRL+ENTER or CTRL+RETURN
- SHIFT+ENTER or SHIFT+RETURN to run the cell and move to the next one
- Using Run Cell, Run All Above, or Run All Below as seen here:

Databricks visualizations

When you use the Databricks built in visualizations, you will see tabs above the output, to the right of the data table. To set the default visualization, drag the tab to the leftmost position.

You can effortlessly generate data profiles, enabling users to grasp the essence of the data quickly through concise summary statistics that highlight its structure.

Goals

Show:

The content in the dataset.
What differentiates the dataset and makes it significant.
Data history and coverage.
Key concepts for each dataset.

Tip:

Use plenty of markdown content to explain what the notebook is doing.

Inspire:

Thought and customer engagement

%md
#### Coverage and Summary Statistics

Use this space to write some (basic) coverage related queries.  Use visualizations to enhance the story. Below are some basic coverage-related queries along with visualizations to enhance the storytelling:

 **Sector Coverage**:
 - Query: What are the top sectors covered in the dataset?
 - Visualization: A bar chart showcasing the distribution of data across different sectors.

**Country Coverage**:
 - Query: Which countries are represented in the dataset?
 - Visualization: A world map highlighting the countries covered, with color intensity indicating the data density.

**Coverage over Time**:
 - Query: How does the dataset's coverage vary over time?
 - Visualization: A line chart illustrating the dataset's coverage over a specific time period, showcasing the number of data points or events over each time point.

**Geographical Coverage**:
 - Query: Where does the dataset have the most comprehensive geographical coverage?
 - Visualization: A heat map highlighting regions or countries with the highest concentration of data points, emphasizing the extent of geographical coverage.

**Demographic Coverage**:
 - Query: What demographics are included in the dataset?
 - Visualization: A pie chart representing the distribution of different demographic groups within the dataset.

**Industry Coverage**:
 - Query: Which industries or sub-industries are prominently covered in the dataset?
 - Visualization: A stacked bar chart displaying the distribution of data across various industries or sub-industries, allowing users to identify the most prevalent ones.
Feel free to customize or elaborate on these queries and visualizations based on your specific dataset and coverage criteria.

These initial ideas serve as a starting point to guide you. You are encouraged to personalize and expand upon these queries and visualizations according to the specific characteristics of your dataset and the coverage criteria that are relevant to your needs.

Coverage and Summary Statistics

Use this space to write some (basic) coverage related queries. Use visualizations to enhance the story. Below are some basic coverage-related queries along with visualizations to enhance the storytelling:

Sector Coverage:

Query: What are the top sectors covered in the dataset?
Visualization: A bar chart showcasing the distribution of data across different sectors.

Country Coverage:

Query: Which countries are represented in the dataset?
Visualization: A world map highlighting the countries covered, with color intensity indicating the data density.

Coverage over Time:

Query: How does the dataset's coverage vary over time?
Visualization: A line chart illustrating the dataset's coverage over a specific time period, showcasing the number of data points or events over each time point.

Geographical Coverage:

Query: Where does the dataset have the most comprehensive geographical coverage?
Visualization: A heat map highlighting regions or countries with the highest concentration of data points, emphasizing the extent of geographical coverage.

Demographic Coverage:

Query: What demographics are included in the dataset?
Visualization: A pie chart representing the distribution of different demographic groups within the dataset.

Industry Coverage:

Query: Which industries or sub-industries are prominently covered in the dataset?
Visualization: A stacked bar chart displaying the distribution of data across various industries or sub-industries, allowing users to identify the most prevalent ones. Feel free to customize or elaborate on these queries and visualizations based on your specific dataset and coverage criteria.

These initial ideas serve as a starting point to guide you. You are encouraged to personalize and expand upon these queries and visualizations according to the specific characteristics of your dataset and the coverage criteria that are relevant to your needs.

Section 4 - Insights

This section is your opportunity to unleash your creativity. The primary objective is to delve deeper into the data, surpassing standard user guides and marketing information, and presenting customers with profound insights. Here are some suggestions for focal points:

Advanced Queries:

Demonstrate complex queries that extract valuable information beyond basic data retrieval.

Trend Analysis in Data:

Explore and visualize trends over time within the dataset, uncovering patterns and insights that might not be apparent at first glance.

Spark, Python and Other Advanced Tools:

Showcase the utilization of Spark, Python or other advanced tools for data manipulation, transformation, and analysis, empowering customers with more sophisticated techniques.

Data Science Analysis, such as Correlations:

Illustrate data science techniques like correlation analysis to reveal relationships between variables, enabling customers to gain a deeper understanding of their data.

Integration with Other Data Sources:

Highlight the possibilities of integrating the dataset with customers' own data, demonstrating how the combination of multiple data sources can yield richer insights.

Additionally, it might be appropriate to dedicate a separate section specifically for "Insights," where you can present key findings, noteworthy patterns, and significant discoveries derived from thorough analysis.

Feel free to explore these ideas and adapt them to your dataset, leveraging this section to provide customers with captivating and thought-provoking insights.

Additional Considerations for Consistency

Consider these best practices when you create notebooks:

Parameterize the catalog name so the customer only needs to make a single change to update the notebook for their use (should they choose to rename the catalog).
Please be conscious of the data you are using. Don't join data the customer doesn't have access to in your examples.
When done writing SQL, use the "Format SQL" option in the editor so that all SQL has a similar look and feel (though it will make it verbose)
Always point to your support site for links that help the customer make sense of any complexities.
Try to eliminate unnecessary columns from output.
Always use an alias for a transformed, aggregated, or cast column for consistency in table output
Do not use special characters (*, &, $, %) in notebook or folder names

Marketplace data example

In this illustrative notebook, we will leverage a set of basic market data pertaining to a small portfolio of stocks. Following the framework outlined above, we will guide you through the example, providing a step-by-step walkthrough of the process. We will use the following tables:

dailyprices - Contains the open, high, low and close prices as well as the volume and adjusted close price.
earnings_surprise - Contains quarterly earnings surprise data
company_profile - Basic company level information for a given stock

Introduction

In this notebook, we will provide an understanding of the Sample Market Dataset with example queries to help you quickly understand the data and best practices. Additioanlly, this notebook seeks to demonstrate the full range of capabilities within Databricks. We will be using mostly SQL and Python in this notebook to highlight the multilanguage capabilities.

The Sample Market Dataset provides you with the most recent year daily price data, 10+ years of earnings data and basic company level information (address, website etc. )

#Use this section to update the catalog and schema names to your naming convention
catalogName = 'tjs_sample_data'
schemaName = 'market_data'
sqlContext.sql(f"use {catalogName}.{schemaName}");

%sql
select 
ticker
,count(*) as recordCount
,min(date) as startingDate
,max(date) as latestDate
from dailyprice
group by ticker

Table

ticker

recordCount

startingDate

latestDate

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

AAPL

5930

1999-11-01

2023-05-25

Marketplace Starter Notebook for Data Providers(Python)

Marketplace Starter Notebook for Data Providers

Notebooks basics and goals

Databricks visualizations

Goals

Design Principles

General structure of notebooks

Section 1 - Understanding the Dataset

References

Dataset Overview

Schemas (optional but useful)

Section 2 - Exploring the Data

Data and Key Data Elements

Coverage and Summary Statistics

Section 3 - Sample Queries

Section 4 - Insights

Additional Considerations for Consistency

Market Data - Marketplace Notebook Example

Marketplace data example

Sample Market Data

Introduction

Data Exploration

Sample queries

Sample Queries