Get started: Enhance and cleanse data

This get started article walks you through using a Databricks notebook to cleanse and enhance the New York State baby name data that was previously loaded into a table in Unity Catalog by using Python, Scala, and R. In this article, you change column names, change capitalization, and spell out the sex of each baby name from the raw data table - and then save the DataFrame into a silver table. Then you filter the data to only include data for 2021, group the data at the state level, and then sort the data by count. Finally, you save this DataFrame into a gold table and visualize the data in a bar chart. For more information on silver and gold tables, see medallion architecture.

Important

This get started article builds on Get started: Ingest and insert additional data. You must complete the steps in that article to complete this article. For the complete notebook for that getting started article, see Ingest additional data notebooks.

Requirements

To complete the tasks in this article, you must meet the following requirements:

Tip

For a completed notebook for this article, see Cleanse and enhance data notebooks.

Step 1: Create a new notebook

To create a notebook in your workspace:

  1. Click New Icon New in the sidebar, and then click Notebook.

  2. On the Create Notebook page:

    • Specify a unique name for your notebook.

    • Set the default language for your notebook and then click Confirm if prompted.

    • Click Connect and select a compute resource. To create a new compute resource, see Use compute.

To learn more about creating and managing notebooks, see Manage notebooks.

Step 2: Define variables

In this step, you define variables for use in the example notebook you create in this article.

  1. Copy and paste the following code into the new empty notebook cell. Replace <catalog-name>, <schema-name>, and <volume-name> with the catalog, schema, and volume names for a Unity Catalog volume. Replace <table_name> with a table name of your choice. You will save the baby name data into this table later in this article.

  2. Press Shift+Enter to run the cell and create a new blank cell.

    catalog = "<catalog_name>"
    schema = "<schema_name>"
    table_name = "baby_names"
    silver_table_name = "baby_names_prepared"
    gold_table_name = "top_baby_names_2021"
    path_table = catalog + "." + schema
    print(path_table) # Show the complete path
    
    val catalog = "<catalog_name>"
    val schema = "<schema_name>"
    val tableName = "baby_names"
    val silverTableName = "baby_names_prepared"
    val goldTableName = "top_baby_names_2021"
    val pathTable = s"${catalog}.${schema}"
    print(pathTable) // Show the complete path
    
    catalog <- "<catalog_name>"
    schema <- "<schema_name>"
    volume <- "<volume_name>"
    table_name <- "baby_names"
    silver_table_name <- "baby_names_prepared"
    gold_table_name <- "top_baby_names_2021"
    path_table <- paste(catalog, ".", schema, sep = "")
    print(path_table) # Show the complete path
    

Step 3: Load the raw data into a new DataFrame

This step loads the raw data previously saved into a Delta table into a new DataFrame in preparation for cleansing and enhancing this data for further analysis.

  1. Copy and paste the following code into the new empty notebook cell.

    df_raw = spark.read.table(f"{path_table}.{table_name}")
    display(df_raw)
    
    val dfRaw = spark.read.table(s"${pathTable}.${tableName}")
    display(dfRaw)
    
    # Load the SparkR package that is already preinstalled on the cluster.
    library(SparkR)
    df_raw = sql(paste0("SELECT * FROM ", path_table, ".", table_name))
    display(df_raw)
    
  2. Press Shift+Enter to run the cell and then move to the next cell.

Step 4: Cleanse and enhance raw data and save

In this step, you change the name of the Year column, change the data in the First_Name column to initial capitals, and update the values for the sex to spell of the sex, and then save the DataFrame to a new table.

  1. Copy and paste the following code into an empty notebook cell.

    from pyspark.sql.functions import col, initcap, when
    
    # Rename "Year" column to "Year_Of_Birth"
    df_rename_year = df_raw.withColumnRenamed("Year", "Year_Of_Birth")
    
    # Change the case of "First_Name" column to initcap
    df_init_caps = df_rename_year.withColumn("First_Name", initcap(col("First_Name").cast("string")))
    
    # Update column values from "M" to "male" and "F" to "female"
    df_baby_names_sex = df_init_caps.withColumn(
    "Sex",
        when(col("Sex") == "M", "Male")
        .when(col("Sex") == "F", "Female")
    )
    
    # display
    display(df_baby_names_sex)
    
    # Save DataFrame to table
    df_baby_names_sex.write.mode("overwrite").saveAsTable(f"{path_table}.{silver_table_name}")
    
    import org.apache.spark.sql.functions.{col, initcap, when}
    
    // Rename "Year" column to "Year_Of_Birth"
    val dfRenameYear = dfRaw.withColumnRenamed("Year", "Year_Of_Birth")
    
    // Change the case of "First_Name" data to initial caps
    val dfNameInitCaps = dfRenameYear.withColumn("First_Name", initcap(col("First_Name").cast("string")))
    
    // Update column values from "M" to "Male" and "F" to "Female"
    val dfBabyNamesSex = dfNameInitCaps.withColumn("Sex",
        when(col("Sex") equalTo "M", "Male")
        .when(col("Sex") equalTo "F", "Female"))
    
    // Display the data
    display(dfBabyNamesSex)
    
    // Save DataFrame to a table
    dfBabyNamesSex.write.mode("overwrite").saveAsTable(s"${pathTable}.${silverTableName}")
    
    # Rename "Year" column to "Year_Of_Birth"
    df_rename_year <- withColumnRenamed(df_raw, "Year", "Year_Of_Birth")
    
    # Change the case of "First_Name" data to initial caps
    df_init_caps <- withColumn(df_rename_year, "First_Name", initcap(df_rename_year$First_Name))
    
    # Update column values from "M" to "Male" and "F" to "Female"
    df_baby_names_sex <- withColumn(df_init_caps, "Sex",
                                    ifelse(df_init_caps$Sex == "M", "Male",
                                          ifelse(df_init_caps$Sex == "F", "Female", df_init_caps$Sex)))
    # Display the data
    display(df_baby_names_sex)
    
    # Save DataFrame to a table
    saveAsTable(df_baby_names_sex, paste(path_table, ".", silver_table_name), mode = "overwrite")
    
  2. Press Shift+Enter to run the cell and then move to the next cell.

Step 5: Group and visualize data

In this step, you filter the data to only the year 2021, group the data by sex and name, aggregate by count, and order by count. You then save the DataFrame to a table and then visualize the data in a bar chart.

  1. Copy and paste the following code into an empty notebook cell.

    from pyspark.sql.functions import expr, sum, desc
    from pyspark.sql import Window
    
    # Count of names for entire state of New York by sex
    df_baby_names_2021_grouped=(df_baby_names_sex
    .filter(expr("Year_Of_Birth == 2021"))
    .groupBy("Sex", "First_Name")
    .agg(sum("Count").alias("Total_Count"))
    .sort(desc("Total_Count")))
    
    # Display data
    display(df_baby_names_2021_grouped)
    
    # Save DataFrame to a table
    df_baby_names_2021_grouped.write.mode("overwrite").saveAsTable(f"{path_table}.{gold_table_name}")
    
    import org.apache.spark.sql.functions.{expr, sum, desc}
    import org.apache.spark.sql.expressions.Window
    
    // Count of male and female names for entire state of New York by sex
    val dfBabyNames2021Grouped = dfBabyNamesSex
      .filter(expr("Year_Of_Birth == 2021"))
      .groupBy("Sex", "First_Name")
      .agg(sum("Count").alias("Total_Count"))
      .sort(desc("Total_Count"))
    
    // Display data
    display(dfBabyNames2021Grouped)
    
    // Save DataFrame to a table
    dfBabyNames2021Grouped.write.mode("overwrite").saveAsTable(s"${pathTable}.${goldTableName}")
    
    # Filter to only 2021 data
    df_baby_names_2021 <- filter(df_baby_names_sex, df_baby_names_sex$Year_Of_Birth == 2021)
    
    # Count of names for entire state of New York by sex
    df_baby_names_grouped <- agg(
      groupBy(df_baby_names_2021, df_baby_names_2021$Sex, df_baby_names_2021$First_Name),
      Total_Count = sum(df_baby_names_2021$Count)
    )
    # Display data
    display(arrange(select(df_baby_names_grouped, df_baby_names_grouped$Sex, df_baby_names_grouped$First_Name, df_baby_names_grouped$Total_Count), desc(df_baby_names_grouped$Total_Count)))
    
    # Save DataFrame to a table
    saveAsTable(df_baby_names_2021_grouped, paste(path_table, ".", gold_table_name), mode = "overwrite")
    
  2. Press Ctrl+Enter to run the cell.

    1. Next to the Table tab, click + and then click Visualization.

  3. In the visualization editor, click Visualization Type, and verify that Bar is selected.

  4. In the X column, select`First_Name`.

  5. Click Add column under Y columns and then select Total_Count.

  6. In Group by, select Sex.

    gold table
  7. Click Save.

Cleanse and enhance data notebooks

Use one of the following notebooks to perform the steps in this article.

Cleanse and enhance data using Python

Open notebook in new tab

Cleanse and enhance data using Scala

Open notebook in new tab

Cleanse and enhance data using R

Open notebook in new tab