%md
# Marketplace Starter Notebook for Data Providers
The goal of this notebook is to provide an outline for a common workflow that consumers can use to examine new data shared with them. This is not an exhaustive data exploration presentation. When you create sample notebooks for your own listings, you are encouraged to highlight interesting examples to demonstrate the value of the data.
When you build your notebooks, it's often good to show:
- Schema diagram
- Dataset overview or description (a few paragraphs)
- Minimum of 4-8 queries, prefereably:
- 2 to 4 that explore the data and demonstrate data coverage
- 2 to 4 that show how to drive insights from the data
%md ## Notebooks basics and goals Notebooks provide cell-by-cell execution of code. **Multiple languages can be mixed in a notebook**. Users can add plots, images, and markdown text to enhance their code. Customers can easily deploy notebooks beployed as production code in Databricks, and notebooks can also provide a robust toolset for data exploration, reporting, and dashboarding. <b>Running a Cell</b> * Run the cell below using one of the following options: * **CTRL+ENTER** or **CTRL+RETURN** * **SHIFT+ENTER** or **SHIFT+RETURN** to run the cell and move to the next one * Using **Run Cell**, **Run All Above**, or **Run All Below** as seen here:  ### Databricks visualizations When you use the [Databricks built in visualizations](https://docs.databricks.com/notebooks/visualizations/index.html), you will see tabs above the output, to the right of the data table. To set the default visualization, drag the tab to the leftmost position.  You can effortlessly generate [data profiles](https://docs.databricks.com/visualizations/index.html#create-a-new-data-profile), enabling users to grasp the essence of the data quickly through concise summary statistics that highlight its structure.  ### Goals **Show:** - The content in the dataset. - What differentiates the dataset and makes it significant. - Data history and coverage. - Key concepts for each dataset. **Tip:** - Use plenty of [markdown content](https://www.markdownguide.org/cheat-sheet/#basic-syntax) to explain what the notebook is doing. **Inspire:** - Thought and customer engagement
Notebooks basics and goals
Notebooks provide cell-by-cell execution of code. Multiple languages can be mixed in a notebook. Users can add plots, images, and markdown text to enhance their code. Customers can easily deploy notebooks beployed as production code in Databricks, and notebooks can also provide a robust toolset for data exploration, reporting, and dashboarding.
Running a Cell
- Run the cell below using one of the following options:
CTRL+ENTER or CTRL+RETURN
SHIFT+ENTER or SHIFT+RETURN to run the cell and move to the next one
Using Run Cell, Run All Above, or Run All Below as seen here:
Databricks visualizations
When you use the Databricks built in visualizations, you will see tabs above the output, to the right of the data table. To set the default visualization, drag the tab to the leftmost position.
You can effortlessly generate data profiles, enabling users to grasp the essence of the data quickly through concise summary statistics that highlight its structure.
Goals
Show:
- The content in the dataset.
- What differentiates the dataset and makes it significant.
- Data history and coverage.
- Key concepts for each dataset.
Tip:
- Use plenty of markdown content to explain what the notebook is doing.
Inspire:
- Thought and customer engagement
%md ##Design Principles **Simplicity** - Make sure the notebooks highlight the content, and try to prevent consumers from getting lost in 'tech' - Users can change their catalog name: parameterize the catalog name so they only need to make a single change to the code. **Visuals** - Use visuals when possible to make the data less boring and enrich the story - Along with the native visualizations, Databricks supports many of the popular visualization libraries such as Seaborn, Plotly, Matplotlib, etc.
Design Principles
Simplicity
- Make sure the notebooks highlight the content, and try to prevent consumers from getting lost in 'tech'
- Users can change their catalog name: parameterize the catalog name so they only need to make a single change to the code.
Visuals
- Use visuals when possible to make the data less boring and enrich the story
- Along with the native visualizations, Databricks supports many of the popular visualization libraries such as Seaborn, Plotly, Matplotlib, etc.
%md # General structure of notebooks Your notebooks should have the following components: 1. Title 1. Section 1 - Introduction 1. Section 2 - Exploring the Data 1. Section 3 - Sample Queries 1. Section 4 - Insights
General structure of notebooks
Your notebooks should have the following components:
- Title
- Section 1 - Introduction
- Section 2 - Exploring the Data
- Section 3 - Sample Queries
- Section 4 - Insights
%md #### Dataset Overview In this section, it is advised to present only essential information about the dataset, keeping it concise with a maximum of two to three paragraphs. The aim is to provide users with relevant context about what they are reviewing, while avoiding excessive repetition of content already found in the Marketplace listing.
Dataset Overview
In this section, it is advised to present only essential information about the dataset, keeping it concise with a maximum of two to three paragraphs. The aim is to provide users with relevant context about what they are reviewing, while avoiding excessive repetition of content already found in the Marketplace listing.
%md #### Schemas (optional but useful) Share an illustrative image depicting the schema(s) that visually represents the significant relationships within the data. Ensure that the images are uploaded to online storage for seamless utilization in an img HTML tag.
Schemas (optional but useful)
Share an illustrative image depicting the schema(s) that visually represents the significant relationships within the data. Ensure that the images are uploaded to online storage for seamless utilization in an img HTML tag.
%md #### Data and Key Data Elements This section is dedicated to assisting customers in comprehending the data and extracting key insights from essential tables. It is acceptable to utilize minimal text and images to convey information, while also striving to enhance interactivity by incorporating example queries whenever feasible.
Data and Key Data Elements
This section is dedicated to assisting customers in comprehending the data and extracting key insights from essential tables. It is acceptable to utilize minimal text and images to convey information, while also striving to enhance interactivity by incorporating example queries whenever feasible.
%md #### Coverage and Summary Statistics Use this space to write some (basic) coverage related queries. Use visualizations to enhance the story. Below are some basic coverage-related queries along with visualizations to enhance the storytelling: **Sector Coverage**: - Query: What are the top sectors covered in the dataset? - Visualization: A bar chart showcasing the distribution of data across different sectors. **Country Coverage**: - Query: Which countries are represented in the dataset? - Visualization: A world map highlighting the countries covered, with color intensity indicating the data density. **Coverage over Time**: - Query: How does the dataset's coverage vary over time? - Visualization: A line chart illustrating the dataset's coverage over a specific time period, showcasing the number of data points or events over each time point. **Geographical Coverage**: - Query: Where does the dataset have the most comprehensive geographical coverage? - Visualization: A heat map highlighting regions or countries with the highest concentration of data points, emphasizing the extent of geographical coverage. **Demographic Coverage**: - Query: What demographics are included in the dataset? - Visualization: A pie chart representing the distribution of different demographic groups within the dataset. **Industry Coverage**: - Query: Which industries or sub-industries are prominently covered in the dataset? - Visualization: A stacked bar chart displaying the distribution of data across various industries or sub-industries, allowing users to identify the most prevalent ones. Feel free to customize or elaborate on these queries and visualizations based on your specific dataset and coverage criteria. These initial ideas serve as a starting point to guide you. You are encouraged to personalize and expand upon these queries and visualizations according to the specific characteristics of your dataset and the coverage criteria that are relevant to your needs.
Coverage and Summary Statistics
Use this space to write some (basic) coverage related queries. Use visualizations to enhance the story. Below are some basic coverage-related queries along with visualizations to enhance the storytelling:
Sector Coverage:
- Query: What are the top sectors covered in the dataset?
- Visualization: A bar chart showcasing the distribution of data across different sectors.
Country Coverage:
- Query: Which countries are represented in the dataset?
- Visualization: A world map highlighting the countries covered, with color intensity indicating the data density.
Coverage over Time:
- Query: How does the dataset's coverage vary over time?
- Visualization: A line chart illustrating the dataset's coverage over a specific time period, showcasing the number of data points or events over each time point.
Geographical Coverage:
- Query: Where does the dataset have the most comprehensive geographical coverage?
- Visualization: A heat map highlighting regions or countries with the highest concentration of data points, emphasizing the extent of geographical coverage.
Demographic Coverage:
- Query: What demographics are included in the dataset?
- Visualization: A pie chart representing the distribution of different demographic groups within the dataset.
Industry Coverage:
- Query: Which industries or sub-industries are prominently covered in the dataset?
- Visualization: A stacked bar chart displaying the distribution of data across various industries or sub-industries, allowing users to identify the most prevalent ones. Feel free to customize or elaborate on these queries and visualizations based on your specific dataset and coverage criteria.
These initial ideas serve as a starting point to guide you. You are encouraged to personalize and expand upon these queries and visualizations according to the specific characteristics of your dataset and the coverage criteria that are relevant to your needs.
%md ### Section 3 - Sample Queries Utilize this section to incorporate a selection of sample queries that demonstrate the practical utilization of the data. Begin with straightforward examples and gradually increase the complexity. Aim to include three to five sample queries, as this provides a well-rounded representation of the data's potential applications.
Section 3 - Sample Queries
Utilize this section to incorporate a selection of sample queries that demonstrate the practical utilization of the data. Begin with straightforward examples and gradually increase the complexity. Aim to include three to five sample queries, as this provides a well-rounded representation of the data's potential applications.
%md ### Section 4 - Insights This section is your opportunity to unleash your creativity. The primary objective is to delve deeper into the data, surpassing standard user guides and marketing information, and presenting customers with profound insights. Here are some suggestions for focal points: **Advanced Queries**: - Demonstrate complex queries that extract valuable information beyond basic data retrieval. **Trend Analysis in Data**: - Explore and visualize trends over time within the dataset, uncovering patterns and insights that might not be apparent at first glance. **Spark, Python and Other Advanced Tools**: - Showcase the utilization of Spark, Python or other advanced tools for data manipulation, transformation, and analysis, empowering customers with more sophisticated techniques. **Data Science Analysis, such as Correlations**: - Illustrate data science techniques like correlation analysis to reveal relationships between variables, enabling customers to gain a deeper understanding of their data. **Integration with Other Data Sources**: - Highlight the possibilities of integrating the dataset with customers' own data, demonstrating how the combination of multiple data sources can yield richer insights. Additionally, it might be appropriate to dedicate a separate section specifically for "Insights," where you can present key findings, noteworthy patterns, and significant discoveries derived from thorough analysis. Feel free to explore these ideas and adapt them to your dataset, leveraging this section to provide customers with captivating and thought-provoking insights.
Section 4 - Insights
This section is your opportunity to unleash your creativity. The primary objective is to delve deeper into the data, surpassing standard user guides and marketing information, and presenting customers with profound insights. Here are some suggestions for focal points:
Advanced Queries:
- Demonstrate complex queries that extract valuable information beyond basic data retrieval.
Trend Analysis in Data:
- Explore and visualize trends over time within the dataset, uncovering patterns and insights that might not be apparent at first glance.
Spark, Python and Other Advanced Tools:
- Showcase the utilization of Spark, Python or other advanced tools for data manipulation, transformation, and analysis, empowering customers with more sophisticated techniques.
Data Science Analysis, such as Correlations:
- Illustrate data science techniques like correlation analysis to reveal relationships between variables, enabling customers to gain a deeper understanding of their data.
Integration with Other Data Sources:
- Highlight the possibilities of integrating the dataset with customers' own data, demonstrating how the combination of multiple data sources can yield richer insights.
Additionally, it might be appropriate to dedicate a separate section specifically for "Insights," where you can present key findings, noteworthy patterns, and significant discoveries derived from thorough analysis.
Feel free to explore these ideas and adapt them to your dataset, leveraging this section to provide customers with captivating and thought-provoking insights.
%md ## Additional Considerations for Consistency Consider these best practices when you create notebooks: - Parameterize the catalog name so the customer only needs to make a single change to update the notebook for their use (should they choose to rename the catalog). - Please be conscious of the data you are using. Don't join data the customer doesn't have access to in your examples. - When done writing SQL, use the "Format SQL" option in the editor so that all SQL has a similar look and feel (though it will make it verbose) - Always point to your support site for links that help the customer make sense of any complexities. - Try to eliminate unnecessary columns from output. - Always use an alias for a transformed, aggregated, or cast column for consistency in table output - Do not use special characters (*, &, $, %) in notebook or folder names
Additional Considerations for Consistency
Consider these best practices when you create notebooks:
- Parameterize the catalog name so the customer only needs to make a single change to update the notebook for their use (should they choose to rename the catalog).
- Please be conscious of the data you are using. Don't join data the customer doesn't have access to in your examples.
- When done writing SQL, use the "Format SQL" option in the editor so that all SQL has a similar look and feel (though it will make it verbose)
- Always point to your support site for links that help the customer make sense of any complexities.
- Try to eliminate unnecessary columns from output.
- Always use an alias for a transformed, aggregated, or cast column for consistency in table output
- Do not use special characters (*, &, $, %) in notebook or folder names
%md ## Marketplace data example In this illustrative notebook, we will leverage a set of basic market data pertaining to a small portfolio of stocks. Following the framework outlined above, we will guide you through the example, providing a step-by-step walkthrough of the process. We will use the following tables: - **dailyprices** - Contains the open, high, low and close prices as well as the volume and adjusted close price. - **earnings_surprise** - Contains quarterly earnings surprise data - **company_profile** - Basic company level information for a given stock
Marketplace data example
In this illustrative notebook, we will leverage a set of basic market data pertaining to a small portfolio of stocks. Following the framework outlined above, we will guide you through the example, providing a step-by-step walkthrough of the process. We will use the following tables:
- dailyprices - Contains the open, high, low and close prices as well as the volume and adjusted close price.
- earnings_surprise - Contains quarterly earnings surprise data
- company_profile - Basic company level information for a given stock
%md ### Introduction In this notebook, we will provide an understanding of the Sample Market Dataset with example queries to help you quickly understand the data and best practices. Additioanlly, this notebook seeks to demonstrate the full range of capabilities within Databricks. We will be using mostly SQL and Python in this notebook to highlight the multilanguage capabilities. <hr>The Sample Market Dataset provides you with the most recent year daily price data, 10+ years of earnings data and basic company level information (address, website etc. )
Introduction
In this notebook, we will provide an understanding of the Sample Market Dataset with example queries to help you quickly understand the data and best practices. Additioanlly, this notebook seeks to demonstrate the full range of capabilities within Databricks. We will be using mostly SQL and Python in this notebook to highlight the multilanguage capabilities.
The Sample Market Dataset provides you with the most recent year daily price data, 10+ years of earnings data and basic company level information (address, website etc. )
%sql select ticker ,count(*) as recordCount ,min(date) as startingDate ,max(date) as latestDate from dailyprice group by ticker
%sql DESCRIBE dailyprice
%sql SELECT industry, exchange, count(*) as companyCount FROM company_profile group by industry, exchange
%sql with prc as ( select row_number() over( partition by ticker order by date desc ) as reldate, * from dailyprice where ticker = 'AAPL' ) select p1.ticker, p1.date, p1.adjclose, (((p1.adjclose - p2.adjclose) / p2.adjclose)) * 100 as percentChange from prc p1 join prc p2 where p1.reldate = p2.reldate -1 order by 4 desc limit 10
%sql select * from earnings where ticker in('LLY','JNJ') and year(reportedDate) = 2022
%sql select ticker, count(*) as missedEarnings, avg(surprise) as averageDifference from earnings where surprise <0 group by ticker order by ticker
%sql select ticker ,date ,open ,high ,low ,close ,vol from dailyprice where ticker ='TSLA' and date >= dateadd(getdate(),-60)
Marketplace Starter Notebook for Data Providers
The goal of this notebook is to provide an outline for a common workflow that consumers can use to examine new data shared with them. This is not an exhaustive data exploration presentation. When you create sample notebooks for your own listings, you are encouraged to highlight interesting examples to demonstrate the value of the data.
When you build your notebooks, it's often good to show: