Get started: Enhance and cleanse data

This get started article walks you through using a Databricks notebook to cleanse and enhance the New York State baby name data that was previously loaded into a table in Unity Catalog by using Python, Scala, and R. In this article, you change column names, change capitalization, and spell out the sex of each baby name from the raw data table - and then save the DataFrame into a silver table. Then you filter the data to only include data for 2021, group the data at the state level, and then sort the data by count. Finally, you save this DataFrame into a gold table and visualize the data in a bar chart. For more information on silver and gold tables, see medallion architecture.

important

This get started article builds on Get started: Ingest and insert additional data. You must complete the steps in that article to complete this article. For the complete notebook for that getting started article, see Ingest additional data notebooks.

Requirements

To complete the tasks in this article, you must meet the following requirements:

Your workspace must have Unity Catalog enabled. For information on getting started with Unity Catalog, see Set up and manage Unity Catalog.
You must have the WRITE VOLUME privilege on a volume, the USE SCHEMA privilege on the parent schema, and the USE CATALOG privilege on the parent catalog.
You must have permission to use an existing compute resource or create a new compute resource. See Get started with Databricks or see your Databricks administrator.

tip

For a completed notebook for this article, see Cleanse and enhance data notebooks.

Step 1: Create a new notebook

To create a notebook in your workspace, click New in the sidebar, and then click Notebook. A blank notebook opens in the workspace.

To learn more about creating and managing notebooks, see Manage notebooks.

Step 2: Define variables

In this step, you define variables for use in the example notebook you create in this article.

Copy and paste the following code into the new empty notebook cell. Replace <catalog-name>, <schema-name>, and <volume-name> with the catalog, schema, and volume names for a Unity Catalog volume. Optionally replace the table_name value with a table name of your choice. You will save the baby name data into this table later in this article.

Press Shift+Enter to run the cell and create a new blank cell.

Python
Scala
R

Python
catalog = "<catalog_name>"
schema = "<schema_name>"
table_name = "baby_names"
silver_table_name = "baby_names_prepared"
gold_table_name = "top_baby_names_2021"
path_table = catalog + "." + schema
print(path_table) # Show the complete path

Scala
val catalog = "<catalog_name>"
val schema = "<schema_name>"
val tableName = "baby_names"
val silverTableName = "baby_names_prepared"
val goldTableName = "top_baby_names_2021"
val pathTable = s"${catalog}.${schema}"
print(pathTable) // Show the complete path

R
catalog <- "<catalog_name>"
schema <- "<schema_name>"
volume <- "<volume_name>"
table_name <- "baby_names"
silver_table_name <- "baby_names_prepared"
gold_table_name <- "top_baby_names_2021"
path_table <- paste(catalog, ".", schema, sep = "")
print(path_table) # Show the complete path

Step 3: Load the raw data into a new DataFrame

This step loads the raw data previously saved into a Delta table into a new DataFrame in preparation for cleansing and enhancing this data for further analysis.

Copy and paste the following code into the new empty notebook cell.

Python
Scala
R

Python
df_raw = spark.read.table(f"{path_table}.{table_name}")
display(df_raw)

Scala
val dfRaw = spark.read.table(s"${pathTable}.${tableName}")
display(dfRaw)

R
# Load the SparkR package that is already preinstalled on the cluster.
library(SparkR)
df_raw = sql(paste0("SELECT * FROM ", path_table, ".", table_name))
display(df_raw)

Press Shift+Enter to run the cell and then move to the next cell.

Step 4: Cleanse and enhance raw data and save

In this step, you change the name of the Year column, change the data in the First_Name column to initial capitals, and update the values for the Sex column to spell out the sex, and then save the DataFrame to a new table.

Copy and paste the following code into an empty notebook cell.

Python
Scala
R

Python
from pyspark.sql.functions import col, initcap, when

# Rename "Year" column to "Year_Of_Birth"
df_rename_year = df_raw.withColumnRenamed("Year", "Year_Of_Birth")

# Change the case of "First_Name" column to initcap
df_init_caps = df_rename_year.withColumn("First_Name", initcap(col("First_Name").cast("string")))

# Update column values from "M" to "male" and "F" to "female"
df_baby_names_sex = df_init_caps.withColumn(
"Sex",
    when(col("Sex") == "M", "Male")
    .when(col("Sex") == "F", "Female")
)

# display
display(df_baby_names_sex)

# Save DataFrame to table
df_baby_names_sex.write.mode("overwrite").saveAsTable(f"{path_table}.{silver_table_name}")

Scala
import org.apache.spark.sql.functions.{col, initcap, when}

// Rename "Year" column to "Year_Of_Birth"
val dfRenameYear = dfRaw.withColumnRenamed("Year", "Year_Of_Birth")

// Change the case of "First_Name" data to initial caps
val dfNameInitCaps = dfRenameYear.withColumn("First_Name", initcap(col("First_Name").cast("string")))

// Update column values from "M" to "Male" and "F" to "Female"
val dfBabyNamesSex = dfNameInitCaps.withColumn("Sex",
    when(col("Sex") equalTo "M", "Male")
    .when(col("Sex") equalTo "F", "Female"))

// Display the data
display(dfBabyNamesSex)

// Save DataFrame to a table
dfBabyNamesSex.write.mode("overwrite").saveAsTable(s"${pathTable}.${silverTableName}")

R
# Rename "Year" column to "Year_Of_Birth"
df_rename_year <- withColumnRenamed(df_raw, "Year", "Year_Of_Birth")

# Change the case of "First_Name" data to initial caps
df_init_caps <- withColumn(df_rename_year, "First_Name", initcap(df_rename_year$First_Name))

# Update column values from "M" to "Male" and "F" to "Female"
df_baby_names_sex <- withColumn(df_init_caps, "Sex",
                                ifelse(df_init_caps$Sex == "M", "Male",
                                      ifelse(df_init_caps$Sex == "F", "Female", df_init_caps$Sex)))
# Display the data
display(df_baby_names_sex)

# Save DataFrame to a table
saveAsTable(df_baby_names_sex, paste(path_table, ".", silver_table_name), mode = "overwrite")

Press Shift+Enter to run the cell and then move to the next cell.

Step 5: Group and visualize data

In this step, you filter the data to only the year 2021, group the data by sex and name, aggregate by count, and order by count. You then save the DataFrame to a table and then visualize the data in a bar chart.

Copy and paste the following code into an empty notebook cell.

Python
Scala
R

Python
from pyspark.sql.functions import expr, sum, desc
from pyspark.sql import Window

# Count of names for entire state of New York by sex
df_baby_names_2021_grouped=(df_baby_names_sex
.filter(expr("Year_Of_Birth == 2021"))
.groupBy("Sex", "First_Name")
.agg(sum("Count").alias("Total_Count"))
.sort(desc("Total_Count")))

# Display data
display(df_baby_names_2021_grouped)

# Save DataFrame to a table
df_baby_names_2021_grouped.write.mode("overwrite").saveAsTable(f"{path_table}.{gold_table_name}")

Scala
import org.apache.spark.sql.functions.{expr, sum, desc}
import org.apache.spark.sql.expressions.Window

// Count of male and female names for entire state of New York by sex
val dfBabyNames2021Grouped = dfBabyNamesSex
  .filter(expr("Year_Of_Birth == 2021"))
  .groupBy("Sex", "First_Name")
  .agg(sum("Count").alias("Total_Count"))
  .sort(desc("Total_Count"))

// Display data
display(dfBabyNames2021Grouped)

// Save DataFrame to a table
dfBabyNames2021Grouped.write.mode("overwrite").saveAsTable(s"${pathTable}.${goldTableName}")

R
# Filter to only 2021 data
df_baby_names_2021 <- filter(df_baby_names_sex, df_baby_names_sex$Year_Of_Birth == 2021)

# Count of names for entire state of New York by sex
df_baby_names_grouped <- agg(
  groupBy(df_baby_names_2021, df_baby_names_2021$Sex, df_baby_names_2021$First_Name),
  Total_Count = sum(df_baby_names_2021$Count)
)
# Display data
display(arrange(select(df_baby_names_grouped, df_baby_names_grouped$Sex, df_baby_names_grouped$First_Name, df_baby_names_grouped$Total_Count), desc(df_baby_names_grouped$Total_Count)))

# Save DataFrame to a table
saveAsTable(df_baby_names_2021_grouped, paste(path_table, ".", gold_table_name), mode = "overwrite")

Press Ctrl+Enter to run the cell.
1. Next to the Table tab, click + and then click Visualization.
In the visualization editor, click Visualization Type, and verify that Bar is selected.
In the X column, selectFirst_Name.
Click Add column under Y columns and then select Total_Count.
In Group by, select Sex.
Click Save.

Cleanse and enhance data notebooks

Use one of the following notebooks to perform the steps in this article. Replace <catalog-name>, <schema-name>, and <volume-name> with the catalog, schema, and volume names for a Unity Catalog volume. Optionally replace the table_name value with a table name of your choice.

Python
Scala
R

Cleanse and enhance data using Python

Open notebook in new tab

Get started: Enhance and cleanse data

Requirements

Step 1: Create a new notebook

Step 2: Define variables

Step 3: Load the raw data into a new DataFrame

Step 4: Cleanse and enhance raw data and save

Step 5: Group and visualize data

Cleanse and enhance data notebooks

Cleanse and enhance data using Python

Cleanse and enhance data using Scala

Cleanse and enhance data using R

Additional resources

Requirements​

Step 1: Create a new notebook​

Step 2: Define variables​

Step 3: Load the raw data into a new DataFrame​

Step 4: Cleanse and enhance raw data and save​

Step 5: Group and visualize data​

Cleanse and enhance data notebooks​

Cleanse and enhance data using Python

Cleanse and enhance data using Scala

Cleanse and enhance data using R

Additional resources​

Requirements

Step 1: Create a new notebook

Step 2: Define variables

Step 3: Load the raw data into a new DataFrame

Step 4: Cleanse and enhance raw data and save

Step 5: Group and visualize data

Cleanse and enhance data notebooks

Additional resources