Tutorial: EDA techniques(Python)

Loading...

🌍🔋 EDA tutorial: Exploring global energy and emissions data 🔍🌱


In this notebook, we dive deep into the world's energy consumption and greenhouse gas emissions data, exploring insights that highlight global trends, regional disparities, and the growing impact of renewable energy. Through a series of data visualizations and analyses, we'll investigate topics such as:

  • Top energy-consuming countries and leaders in renewable energy 🌍🔋
  • Trends in greenhouse gas emissions across regions 🏭🌱
  • Projected global energy demand 📊👥

Join us on this exploration to understand the dynamics of energy usage and its implications for a sustainable future! 🌱

Download the dataset

This tutorial demonstrates EDA techniques by examining global energy and emissions data. To follow along, download the Energy Consumption Dataset by Our World in Data from Kaggle. This tutorial uses the owid-energy-data.csv file.

To import the dataset to your Databricks workspace from this notebook:

  1. Click the folder icon on the left to open the workspace side panel.
  2. Drag and drop the CSV file, owid-energy-data.csv into the panel. Alternatively, you can click the kebab menu, choose Import and select the file.
  3. Click Import. The file should appear in your workspace.

Note the file path for the dataset. You’ll need to use this in the notebook. To copy the file path to your clipboard, right-click on the file name, then select Copy URL/path > Full path.

Load data

Understand your data

Understanding the basics of the dataset is crucial for any data science project. It involves familiarizing oneself with the structure, types, and quality of the data at hand.

In Databricks, you can use the display(df) command to display the dataset. Because the dataset has more than 10,000 rows, this command returns a truncated dataset. On the left of each column, you can see the column’s data type.

6

    Use pandas for data insights

    To understand your dataset effectively, use the following pandas commands:

    • The df.shape command returns the dimensions of the DataFrame, giving you a quick overview of the number of rows and columns.
    • The df.dtypes command provides the data types of each column, helping you understand the kind of data you are dealing with. You can also see the datatype for each column in the results table.
    • The df.describe() command generates descriptive statistics for numerical columns, such as mean, standard deviation, and percentiles, which can help you identify patterns, detect anomalies, and understand the distribution of your data. It also tells you how many rows and columns there are.
    8

      (21812, 130)
      9

        country object year int64 iso_code object population float64 gdp float64 ... wind_elec_per_capita float64 wind_electricity float64 wind_energy_per_capita float64 wind_share_elec float64 wind_share_energy float64 Length: 130, dtype: object
        10

          Generate a data profile

          Databricks notebooks include built-in data profiling capabilities. When viewing a DataFrame with the Databricks display function, you can generate a data profile.

          Run the code below, then click + -> Data Profile next to the Table in the output. This will automatically execute a new command that generates a profile of the data in the data frame. The profile will include summary statistics for numeric, string, and date columns as well as histograms of the value distributions for each column. Note that this command will profile the entire data set in the data frame or SQL query results, not just the portion displayed in the table (which can be truncated).

          12

          Clean the data

          Cleaning data is a vital step in EDA to ensure the dataset is accurate, consistent, and ready for meaningful analysis. This process involves several key tasks to ensure the data is ready for analysis, including:

          • Identifying and removing any duplicate data.
          • Handling missing values, which might involve replacing them with a specific value or removing the affected rows.
          • Standardizing data types (for example, converting strings to datetime) through conversions and transformations to ensure consistency. You may also want to convert data to a format that’s easier for you to work with.

          This cleaning phase is essential as it improves the quality and reliability of the data, enabling more accurate and insightful analysis.

          Tip: Try asking the assistant to help you generate code for data cleaning tasks

          You can use the Databricks Assistant to help you generate code. Create a new code cell and click the "generate" link or use the Assistant icon in the top right to open the Assistant. Enter a query for the Assistant. For example, ask the assistant to describe your data.

          For example, try using the following prompts to quickly generate code to clean the data:

          • Check if df contains any duplicate columns or rows. Print the duplicates. Then, delete the duplicate.
          • What format are date columns in? Change it to 'YYYY-MM-DD'
          • I'm not going to use the XXX column. Delete it.

          (Optional) Ask the Assistant to help you remove duplicate data

          16

          Remove duplicate data

          Handle null or missing values

          20

          Reformat dates

          22

          dtype('int64')

          Explore the data using the Databricks notebook output table

          Databricks provides built-in features to help you explore your data using the output table.

          In a new cell, use display(df) to display the dataset as a table.

          From the output table, you can:

          • Search the data for a specific string or value. Click the search icon on the top right of the table and enter your search.

          • Filter for specific conditions. You can use built-in table filters to filter your columns for specific conditions.

            Tip: Use the assistant to generate filters. Click the filter icon, then type in your desired filter conditions.

          • Create visualizations using the dataset. At the top of the output table, click + > Visualization to open the visualization editor.

            The image below shows how to edit a visualization to add multiple line charts to view the consumption of various renewable energy sources over time.

          24

            Explore and visualize the data using Python libraries

            Exploring data through visualizations is a fundamental aspect of EDA. Visualizations help to uncover patterns, trends, and relationships within the data that might not be immediately apparent through numerical analysis alone. Use libraries such as Plotly or Matplotlib for common visualization techniques including scatter plots, bar charts, line graphs, and histograms. These visual tools allow data scientists to identify anomalies, understand data distributions, and observe correlations between variables. For instance, scatter plots can highlight outliers, while time series plots can reveal trends and seasonality.

            Create an array for unique countries

            27

            array(['ASEAN (Ember)', 'Afghanistan', 'Africa', 'Africa (EI)', 'Africa (EIA)', 'Africa (Ember)', 'Africa (Shift)', 'Albania', 'Algeria', 'American Samoa', 'Angola', 'Antarctica', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', 'Asia', 'Asia & Oceania (EIA)', 'Asia (Ember)', 'Asia Pacific (EI)', 'Asia and Oceania (Shift)', 'Australia', 'Australia and New Zealand (EIA)', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'British Virgin Islands', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'CIS (EI)', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands', 'Central & South America (EIA)', 'Central African Republic', 'Central America (EI)', 'Central and South America (Shift)', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo', 'Cook Islands', 'Costa Rica', "Cote d'Ivoire", 'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czechia', 'Czechoslovakia', 'Democratic Republic of Congo', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'EU28 (Shift)', 'East Germany', 'East Timor', 'Eastern Africa (EI)', 'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia', 'Eurasia (EIA)', 'Eurasia (Shift)', 'Europe', 'Europe (EI)', 'Europe (EIA)', 'Europe (Ember)', 'Europe (Shift)', 'European Union (27)', 'Falkland Islands', 'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Guiana', 'French Polynesia', 'G20 (Ember)', 'G7 (Ember)', 'Gabon', 'Gambia', 'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland', 'Grenada', 'Guadeloupe', 'Guam', 'Guatemala', 'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', 'High-income countries', 'Honduras', 'Hong Kong', 'Hungary', 'Iceland', 'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati', 'Kosovo', 'Kuwait', 'Kyrgyzstan', 'Laos', 'Latin America and Caribbean (Ember)', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya', 'Lithuania', 'Low-income countries', 'Lower-middle-income countries', 'Luxembourg', 'Macao', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives', 'Mali', 'Malta', 'Martinique', 'Mauritania', 'Mauritius', 'Mexico', 'Micronesia (country)', 'Middle Africa (EI)', 'Middle East (EI)', 'Middle East (EIA)', 'Middle East (Ember)', 'Middle East (Shift)', 'Moldova', 'Mongolia', 'Montenegro', 'Montserrat', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nauru', 'Nepal', 'Netherlands', 'Netherlands Antilles', 'New Caledonia', 'New Zealand', 'Nicaragua', 'Niger', 'Nigeria', 'Niue', 'Non-OECD (EI)', 'Non-OECD (EIA)', 'Non-OPEC (EI)', 'Non-OPEC (EIA)', 'North America', 'North America (EI)', 'North America (Ember)', 'North America (Shift)', 'North Korea', 'North Macedonia', 'Northern Mariana Islands', 'Norway', 'OECD (EI)', 'OECD (EIA)', 'OECD (Ember)', 'OECD (Shift)', 'OPEC (EI)', 'OPEC (EIA)', 'OPEC (Shift)', 'Oceania', 'Oceania (Ember)', 'Oman', 'Pakistan', 'Palestine', 'Panama', 'Papua New Guinea', 'Paraguay', 'Persian Gulf (EIA)', 'Persian Gulf (Shift)', 'Peru', 'Philippines', 'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Reunion', 'Romania', 'Russia', 'Rwanda', 'Saint Helena', 'Saint Kitts and Nevis', 'Saint Lucia', 'Saint Pierre and Miquelon', 'Saint Vincent and the Grenadines', 'Samoa', 'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia', 'Serbia and Montenegro', 'Seychelles', 'Sierra Leone', 'Singapore', 'Slovakia', 'Slovenia', 'Solomon Islands', 'Somalia', 'South Africa', 'South America', 'South Korea', 'South Sudan', 'South and Central America (EI)', 'Spain', 'Sri Lanka', 'Sudan', 'Suriname', 'Sweden', 'Switzerland', 'Syria', 'Taiwan', 'Tajikistan', 'Tanzania', 'Thailand', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu', 'U.S. Pacific Islands (EIA)', 'U.S. Territories (EIA)', 'USSR', 'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom', 'United States', 'United States Pacific Islands (Shift)', 'United States Territories (Shift)', 'United States Virgin Islands', 'Upper-middle-income countries', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Wake Island (EIA)', 'Wake Island (Shift)', 'West Germany', 'Western Africa (EI)', 'Western Sahara', 'World', 'Yemen', 'Yugoslavia', 'Zambia', 'Zimbabwe'], dtype=object)

            Insight: This tells us that the country column includes various entities, including World, High-income countries, Asia, and United States, that aren’t always directly comparable. It could be more useful to filter the data by region.

            Chart emission trends for top 10 emitters (2000 - 2022)

            Insight:From this chart, we can see that overall greenhouse gas emissions trended upwards from 2000-2022, with the exception of a few countries where emissions were relatively stable with a slight decline over that time frame.

            Bar chart: Filter and chart emissions by region

            33

            Insight: From this chart, we can see that Asia is the region with the highest greenhouse gas emissions. Oceania, South America, and Africa produce the lowest greenhouse gas emissions.

            Calculate and graph renewable energy share growth for top 10 renewable leaders (2000-2022)

            Insight: Norway and Iceland are leading the world in renewable energy, with more than half of their consumption coming from renewable energy. Iceland and Sweden saw the largest growth in their renewable energy share. All countries saw occasional dips and rises, which tells us that renewable energy share isn’t linearly growing. Interestingly, Middle Africa saw a dip in the early 2010s but bounced back in 2020.

            Scatter plot: Show impact of renewable energy on emissions for top 10 emitters

            Insight: As a country uses more renewable energy, it also has more greenhouse gas emissions, meaning that its total energy consumption rises faster than its renewable consumption. North America is an exception in that its greenhouse gas emissions stayed relatively constant throughout the years as its renewable share continued to increase.

            Model projected global energy consumption

            42

            /databricks/python/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:471: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. /databricks/python/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:471: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. /databricks/python/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:471: ValueWarning: An unsupported index was provided and will be ignored when e.g. forecasting. /databricks/python/lib/python3.10/site-packages/statsmodels/tsa/base/tsa_model.py:834: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`.

            Insight: This model projects that global energy consumption will continue to rise.

            Next steps

            Now that you’ve performed some initial exploratory data analysis on your dataset, try these next steps:

            • See the Appendix section of this notebook for additional EDA visualization examples.
            • If you ran into any errors while going through this tutorial, try using the built-in debugger to step through your code. See Debug notebooks (AWS | Azure | GCP).
            • Share your notebook (AWS | Azure | GCP) with your team so that they can understand your analysis. Depending on what permissions you give them, they can help develop code to further the analysis and/or add comments and suggestions for further investigation.
            • Once you’ve finalized your analysis, create a notebook dashboard (AWS | Azure | GCP) or an AI/BI dashboard (AWS | Azure | GCP)with the key visualizations to share with stakeholders.

            Appendix: Additional examples of EDA visualizations

            Rank the top 20 energy-consuming countries

            47

            Insight: The output includes geographies that aren’t countries. Instead, you can filter the data by region, as we’ll do in other visualization in this tutorial. However, this output still offers some useful insights. For example, the United States consumed more energy than the entire European Union.

            Rank the top 20 countries in renewable energy

            50

            Insight: From this analysis, we can see that Norway and Iceland are leading the world in renewable energy, with more than half of their consumption coming from renewable energy.

            Chart consumption per capita by income group

            53

            Insight: Energy consumption is higher for high-income countries and much lower for low-income countries.

            Calculate and chart energy consumption per capita by region

            56

            Insight: North America consumes the most energy per capita. Africa and Asia have the lowest energy consumption per capita. Even though Asia produced the most greenhouse gas emissions, it had one of the lowest energy consumption per capita. This correlation might be interesting to investigate further.

            Plot global energy consumption per capita over time

            59

            Insight: Generally, energy consumption per capita trends upward over time. We can see that there was a large increase in the 60s and 70s, as well as the early 2000s.

            Scatter plot: Energy consumption growth vs. GDP growth

            62

            Insight: There are a couple of interesting outliers:

            • Kuwait had a large GDP growth with a small growth in energy consumption.
            • Oman and Equatorial Guinea both saw huge growths in energy consumption but had a small growth in GDP.

            It would be interesting to further investigate these outliers and understand what was going on in those countries that could explain these discrepancies.

            Scatter plot: Population growth vs. energy consumption growth

            65

            Insight: Generally, population growth doesn’t have a huge impact on energy consumption. However, both Equatorial Guinea and Oman had a year where they saw a huge jump in energy growth with relatively small population growth.

            Plot energy intensity over time for top economies

            68

            Insight: Generally, energy intensity declined over time for the top economies. Data for aggregated for the world is also missing before ~2015.