Marketplace Starter Notebook for Data Providers(Python)

Loading...

Marketplace Starter Notebook for Data Providers

The goal of this notebook is to provide an outline for a common workflow that consumers can use to examine new data shared with them. This is not an exhaustive data exploration presentation. When you create sample notebooks for your own listings, you are encouraged to highlight interesting examples to demonstrate the value of the data.

When you build your notebooks, it's often good to show:

  • Schema diagram
  • Dataset overview or description (a few paragraphs)
  • Minimum of 4-8 queries, prefereably:
    • 2 to 4 that explore the data and demonstrate data coverage
    • 2 to 4 that show how to drive insights from the data

Notebooks basics and goals

Notebooks provide cell-by-cell execution of code. Multiple languages can be mixed in a notebook. Users can add plots, images, and markdown text to enhance their code. Customers can easily deploy notebooks beployed as production code in Databricks, and notebooks can also provide a robust toolset for data exploration, reporting, and dashboarding.

Running a Cell

  • Run the cell below using one of the following options:
    • CTRL+ENTER or CTRL+RETURN

    • SHIFT+ENTER or SHIFT+RETURN to run the cell and move to the next one

    • Using Run Cell, Run All Above, or Run All Below as seen here:

      Run menu

Databricks visualizations

When you use the Databricks built in visualizations, you will see tabs above the output, to the right of the data table. To set the default visualization, drag the tab to the leftmost position.

Visualization tabs

You can effortlessly generate data profiles, enabling users to grasp the essence of the data quickly through concise summary statistics that highlight its structure.

Data profile

Goals

Show:

  • The content in the dataset.
  • What differentiates the dataset and makes it significant.
  • Data history and coverage.
  • Key concepts for each dataset.

Tip:

Inspire:

  • Thought and customer engagement

Design Principles

Simplicity

  • Make sure the notebooks highlight the content, and try to prevent consumers from getting lost in 'tech'
  • Users can change their catalog name: parameterize the catalog name so they only need to make a single change to the code.

Visuals

  • Use visuals when possible to make the data less boring and enrich the story
  • Along with the native visualizations, Databricks supports many of the popular visualization libraries such as Seaborn, Plotly, Matplotlib, etc.

General structure of notebooks

Your notebooks should have the following components:

  1. Title
  2. Section 1 - Introduction
  3. Section 2 - Exploring the Data
  4. Section 3 - Sample Queries
  5. Section 4 - Insights

Section 1 - Understanding the Dataset

References

You can provide links to:

  • Data Marketing pages
  • Support Resources
  • Data Dictionary (if one exists)

Dataset Overview

In this section, it is advised to present only essential information about the dataset, keeping it concise with a maximum of two to three paragraphs. The aim is to provide users with relevant context about what they are reviewing, while avoiding excessive repetition of content already found in the Marketplace listing.

Schemas (optional but useful)

Share an illustrative image depicting the schema(s) that visually represents the significant relationships within the data. Ensure that the images are uploaded to online storage for seamless utilization in an img HTML tag.

Section 2 - Exploring the Data

Data and Key Data Elements

This section is dedicated to assisting customers in comprehending the data and extracting key insights from essential tables. It is acceptable to utilize minimal text and images to convey information, while also striving to enhance interactivity by incorporating example queries whenever feasible.

Coverage and Summary Statistics

Use this space to write some (basic) coverage related queries. Use visualizations to enhance the story. Below are some basic coverage-related queries along with visualizations to enhance the storytelling:

Sector Coverage:

  • Query: What are the top sectors covered in the dataset?
  • Visualization: A bar chart showcasing the distribution of data across different sectors.

Country Coverage:

  • Query: Which countries are represented in the dataset?
  • Visualization: A world map highlighting the countries covered, with color intensity indicating the data density.

Coverage over Time:

  • Query: How does the dataset's coverage vary over time?
  • Visualization: A line chart illustrating the dataset's coverage over a specific time period, showcasing the number of data points or events over each time point.

Geographical Coverage:

  • Query: Where does the dataset have the most comprehensive geographical coverage?
  • Visualization: A heat map highlighting regions or countries with the highest concentration of data points, emphasizing the extent of geographical coverage.

Demographic Coverage:

  • Query: What demographics are included in the dataset?
  • Visualization: A pie chart representing the distribution of different demographic groups within the dataset.

Industry Coverage:

  • Query: Which industries or sub-industries are prominently covered in the dataset?
  • Visualization: A stacked bar chart displaying the distribution of data across various industries or sub-industries, allowing users to identify the most prevalent ones. Feel free to customize or elaborate on these queries and visualizations based on your specific dataset and coverage criteria.

These initial ideas serve as a starting point to guide you. You are encouraged to personalize and expand upon these queries and visualizations according to the specific characteristics of your dataset and the coverage criteria that are relevant to your needs.

Section 3 - Sample Queries

Utilize this section to incorporate a selection of sample queries that demonstrate the practical utilization of the data. Begin with straightforward examples and gradually increase the complexity. Aim to include three to five sample queries, as this provides a well-rounded representation of the data's potential applications.

Section 4 - Insights

This section is your opportunity to unleash your creativity. The primary objective is to delve deeper into the data, surpassing standard user guides and marketing information, and presenting customers with profound insights. Here are some suggestions for focal points:

Advanced Queries:

  • Demonstrate complex queries that extract valuable information beyond basic data retrieval.

Trend Analysis in Data:

  • Explore and visualize trends over time within the dataset, uncovering patterns and insights that might not be apparent at first glance.

Spark, Python and Other Advanced Tools:

  • Showcase the utilization of Spark, Python or other advanced tools for data manipulation, transformation, and analysis, empowering customers with more sophisticated techniques.

Data Science Analysis, such as Correlations:

  • Illustrate data science techniques like correlation analysis to reveal relationships between variables, enabling customers to gain a deeper understanding of their data.

Integration with Other Data Sources:

  • Highlight the possibilities of integrating the dataset with customers' own data, demonstrating how the combination of multiple data sources can yield richer insights.

Additionally, it might be appropriate to dedicate a separate section specifically for "Insights," where you can present key findings, noteworthy patterns, and significant discoveries derived from thorough analysis.

Feel free to explore these ideas and adapt them to your dataset, leveraging this section to provide customers with captivating and thought-provoking insights.

Additional Considerations for Consistency

Consider these best practices when you create notebooks:

  • Parameterize the catalog name so the customer only needs to make a single change to update the notebook for their use (should they choose to rename the catalog).
  • Please be conscious of the data you are using. Don't join data the customer doesn't have access to in your examples.
  • When done writing SQL, use the "Format SQL" option in the editor so that all SQL has a similar look and feel (though it will make it verbose)
  • Always point to your support site for links that help the customer make sense of any complexities.
  • Try to eliminate unnecessary columns from output.
  • Always use an alias for a transformed, aggregated, or cast column for consistency in table output
  • Do not use special characters (*, &, $, %) in notebook or folder names

Market Data - Marketplace Notebook Example

Marketplace data example

In this illustrative notebook, we will leverage a set of basic market data pertaining to a small portfolio of stocks. Following the framework outlined above, we will guide you through the example, providing a step-by-step walkthrough of the process. We will use the following tables:

  • dailyprices - Contains the open, high, low and close prices as well as the volume and adjusted close price.
  • earnings_surprise - Contains quarterly earnings surprise data
  • company_profile - Basic company level information for a given stock

Sample Market Data

Introduction

In this notebook, we will provide an understanding of the Sample Market Dataset with example queries to help you quickly understand the data and best practices. Additioanlly, this notebook seeks to demonstrate the full range of capabilities within Databricks. We will be using mostly SQL and Python in this notebook to highlight the multilanguage capabilities.


The Sample Market Dataset provides you with the most recent year daily price data, 10+ years of earnings data and basic company level information (address, website etc. )

Data Exploration

#Use this section to update the catalog and schema names to your naming convention
catalogName = 'tjs_sample_data'
schemaName = 'market_data'
sqlContext.sql(f"use {catalogName}.{schemaName}");
%sql
select 
ticker
,count(*) as recordCount
,min(date) as startingDate
,max(date) as latestDate
from dailyprice
group by ticker
 
ticker
recordCount
startingDate
latestDate
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
AAPL
5930
1999-11-01
2023-05-25
META
2773
2012-05-18
2023-05-25
TSLA
3250
2010-06-29
2023-05-25
JNJ
5930
1999-11-01
2023-05-25
GOOGL
4725
2004-08-19
2023-05-25
AMD
5930
1999-11-01
2023-05-25
LLY
5930
1999-11-01
2023-05-25
AMZN
5930
1999-11-01
2023-05-25
CRM
4765
2004-06-23
2023-05-25
MMM
5930
1999-11-01
2023-05-25
RPRX
742
2020-06-16
2023-05-25
BMY
5930
1999-11-01
2023-05-25
MRK
5930
1999-11-01
2023-05-25
CTLT
2221
2014-07-31
2023-05-25
PFE
5930
1999-11-01
2023-05-25
ZTS
2597
2013-02-01
2023-05-25
MSFT
5930
1999-11-01
2023-05-25
18 rows
%sql
DESCRIBE dailyprice
 
col_name
data_type
comment
1
2
3
4
5
6
7
8
9
10
date
date
null
open
double
null
high
double
null
low
double
null
close
double
null
adjClose
double
null
vol
double
null
divAmount
double
null
splitFactor
double
null
ticker
string
null
10 rows
# Get data into a dataframe using Spark
df = sqlContext.sql("select * from dailyprice")
# Convert to a Pandas dataframe and describe the data
df.describe().toPandas()
%sql
select 
*
from
earnings_surprise

Sample queries

%sql
SELECT
  industry,
  exchange,
  count(*) as companyCount
FROM
  company_profile
group by
  industry,
  exchange
9 rows

Sample Queries

%sql
with prc as (
  select
    row_number() over(
      partition by ticker
      order by
        date desc
    ) as reldate,
    *
  from
    dailyprice
  where
    ticker = 'AAPL'
)
select
  p1.ticker,
  p1.date,
  p1.adjclose,
  (((p1.adjclose - p2.adjclose) / p2.adjclose)) * 100 as percentChange
from
  prc p1
  join prc p2
where
  p1.reldate = p2.reldate -1
order by
  4 desc
limit
  10
 
ticker
date
adjclose
percentChange
1
2
3
4
5
6
7
8
9
10
AAPL
2008-10-13
3.34653152475106
13.904958677686038
AAPL
2000-03-01
0.988768644545417
13.688710521723937
AAPL
2004-10-14
0.682600163174781
13.157232704402483
AAPL
2001-04-19
0.390317389881177
12.856516015796249
AAPL
2008-11-24
2.82115096341022
12.557519980624932
AAPL
2020-03-13
68.0448286896084
11.980824235588074
AAPL
2006-07-20
1.83625210636168
11.829944547135
AAPL
2001-04-18
0.345852772760188
11.715686274509835
AAPL
2003-05-05
0.244176003234376
11.349480968858122
AAPL
2004-07-15
0.499733734400746
11.32521974306985
10 rows
%sql
select 
*
from
earnings
where ticker in('LLY','JNJ')
and year(reportedDate) = 2022
8 rows
%sql
select
  ticker,
  count(*) as missedEarnings,
  avg(surprise) as averageDifference
from
  earnings
where
  surprise <0
group by ticker
order by ticker
18 rows
%sql
select
ticker
,date
,open
,high
,low
,close
,vol
from
  dailyprice
where ticker ='TSLA'
and date >= dateadd(getdate(),-60)
38 rows