Get started: Import and visualize CSV data from a notebook

This get started article walks you through using a Databricks notebook to import data from a CSV file containing baby name data from health.data.ny.gov into your Unity Catalog volume using Python, Scala, and R. You also learn to modify a column name, visualize the data, and save to a table.

Requirements

To complete the tasks in this article, you must meet the following requirements:

Tip

For a completed notebook for this article, see Import and visualize data notebook.

Step 1: Create a new notebook

To create a notebook in your workspace:

  1. Click New Icon New in the sidebar, and then click Notebook.

  2. On the Create Notebook page:

    • Specify a unique name for your notebook.

    • Set the default language for your notebook and then click Confirm if prompted.

    • Click Connect and select a compute resource. To create a new compute resource, see Use compute.

To learn more about creating and managing notebooks, see Manage notebooks.

Step 2: Define variables

In this step, you define variables for use in the example notebook you create in this article.

  1. Copy and paste the following code into the new empty notebook cell. Replace <catalog-name>, <schema-name>, and <volume-name> with the catalog, schema, and volume names for a Unity Catalog volume. Replace <table_name> with a table name of your choice. You will save the baby name data into this table later in this article.

  2. Press Shift+Enter to run the cell and create a new blank cell.

    catalog = "<catalog_name>"
    schema = "<schema_name>"
    volume = "<volume_name>"
    download_url = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
    file_name = "baby_names.csv"
    table_name = "baby_names"
    path_volume = "/Volumes/" + catalog + "/" + schema + "/" + volume
    path_tables = catalog + "." + schema
    print(path_tables) # Show the complete path
    print(path_volume) # Show the complete path
    
    val catalog = "<catalog_name>"
    val schema = "<schema_name>"
    val volume = "<volume_name>"
    val download_url = "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
    file_name = "baby_names.csv"
    table_name = "baby_names"
    val path_volume = s"/Volumes/$catalog/$schema/$volume"
    val path_tables = s"$catalog.$schema.$table_name"
    print(path_volume) // Show the complete path
    print(path_tables) // Show the complete path
    
    catalog <- "<catalog_name>"
    schema <- "<schema_name>"
    volume <- "<volume_name>"
    download_url <- "https://health.data.ny.gov/api/views/jxy9-yhdk/rows.csv"
    file_name <- "baby_names.csv"
    table_name <- "baby_names"
    path_volume <- paste("/Volumes/", catalog, "/", schema, "/", volume, sep = "")
    path_tables <- paste(catalog, ".", schema, sep = "")
    print(path_volume) # Show the complete path
    print(path_tables) # Show the complete path
    

Step 3: Import CSV file

In this step, you import a CSV file containing baby name data from health.data.ny.gov into your Unity Catalog volume.

  1. Copy and paste the following code into the new empty notebook cell. This code copies the rows.csv file from health.data.ny.gov into your Unity Catalog volume using the Databricks dbutuils command.

  2. Press Shift+Enter to run the cell and then move to the next cell.

    dbutils.fs.cp(f"{download_url}", f"{path_volume}" + "/" + f"{file_name}")
    
    dbutils.fs.cp(download_url, s"$path_volume/$file_name")
    
    dbutils.fs.cp(download_url, paste(path_volume, "/", file_name, sep = ""))
    

Step 4: Load CSV data into a DataFrame

In this step, you create a DataFrame named df from the CSV file that you previously loaded into your Unity Catalog volume by using the spark.read.csv method.

  1. Copy and paste the following code into the new empty notebook cell. This code loads baby name data into DataFrame df from the CSV file.

  2. Press Shift+Enter to run the cell and then move to the next cell.

    df = spark.read.csv(f"{path_volume}/{file_name}",
      header=True,
      inferSchema=True,
      sep=",")
    
    val df = spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .option("delimiter", ",")
      .csv(s"$path_volume/$file_name")
    
    # Load the SparkR package that is already preinstalled on the cluster.
    library(SparkR)
    
    df <- read.df(paste(path_volume, "/", file_name, sep=""),
      source="csv",
      header = TRUE,
      inferSchema = TRUE,
      delimiter = ",")
    

You can load data from many supported file formats.

Step 5: Visualize data from notebook

In this step, you use the display() method to display the contents of the DataFrame in a table in the notebook, and then visualize the data in a word cloud chart in the notebook.

  1. Copy and paste the following code into the new empty notebook cell, and then click Run cell to display the data in a table.

    display(df)
    
    display(df)
    
    display(df)
    
  2. Review the results in the table.

  3. Next to the Table tab, click + and then click Visualization.

  4. In the visualization editor, click Visualization Type, and verify that Word cloud is selected.

  5. In the Words column, verify that First Name is selected.

  6. In Frequencies limit, click 35.

    word cloud chart
  7. Click Save.

Step 6: Save the DataFrame to a table

Important

To save your DataFrame in Unity Catalog, you must have CREATE table privileges on the catalog and schema. For information on permissions in Unity Catalog, see Privileges and securable objects in Unity Catalog and Manage privileges in Unity Catalog, and Manage Unity Catalog permissions in Catalog Explorer.

  1. Copy and paste the following code into an empty notebook cell. This code removes a space in column name in the DataFrame before writing the DataFrame to a table. Special characters, such as spaces are not allowed in column names. This code uses the Apache Spark withColumnRenamed() method.

    df = df.withColumnRenamed("First Name", "First_Name")
    df.printSchema
    
    val df_Renamed = df.withColumnRenamed("First Name", "First_Name")
    // when modifying a DataFrame in Scala, you must assign it to a new variable
    df_Renamed.printSchema()
    
    df <- withColumnRenamed(df, "First Name", "First_Name")
    printSchema(df)
    
  2. Copy and paste the following code into an empty notebook cell. This code saves the contents of the DataFrame to a table in Unity Catalog using the table name variable that you defined at the start of this article.

    df.write.saveAsTable(f"{path_tables}" + "." + f"{table_name}")
    
    # To overwrite an existing table, use the following code:
    # df.write.mode("overwrite").saveAsTable(f"{path_tables}" + "." + f"{table_name}")
    
    df_Renamedwrite.saveAsTable(s"$path_tables" + "." + s"$table_name")
    
    // To overwrite an existing table, use the following code:
    // df_Renamed.write.mode("overwrite").saveAsTable(s"$path_tables" + "." + s"$table_name")
    
    saveAsTable(df, paste(path_tables, ".", table_name))
    # To overwrite an existing table, use the following code:
    # saveAsTable(df, paste(path_tables, ".", table_name), mode = "overwrite")
    
  3. To verify that the table was saved, click Catalog in the left sidebar to open the Catalog Explorer UI. Open your catalog and then your schema to verify that the table appears.

  4. Click your table to view the table schema on the Overview tab.

  5. Click Sample Data to view 100 rows of data from the table.

Import and visualize data notebook

Use the following notebook to perform the steps in this Getting Started article.

Import data from CSV using Python notebook

Open notebook in new tab

Import data from CSV using Scala notebook

Open notebook in new tab

Import data from CSV using R notebook

Open notebook in new tab