bamboolib

Preview

This feature is in Public Preview.

Note

bamboolib is supported in Databricks Runtime 11.0 and above.

bamboolib is a user interface component that allows no-code data analysis and transformations from within a Databricks notebook. bamboolib helps users more easily work with their data and speeds up common data wrangling, exploration, and visualization tasks. As users complete these kinds of tasks with their data, bamboolib automatically generates Python code in the background. Users can share this code with others, who can run this code in their own notebooks to quickly reproduce those original tasks. They can also use bamboolib to extend those original tasks with additional data tasks, all without needing to know how to code. Those who are experienced with coding can extend this code to create even more sophisticated results.

Behind the scenes, bamboolib uses ipywidgets, which is an interactive HTML widget framework for the IPython kernel. ipywidgets runs inside of the IPython kernel.

Requirements

Quickstart

  1. Create a Python notebook.

  2. Attach the notebook to a cluster that meets the requirements.

  3. In the notebook’s first cell, enter the following code, and then run the cell. This step can be skipped if bamboolib is already installed in the workspace or cluster.

    %pip install bamboolib
    
  4. In the notebook’s second cell, enter the following code, and then run the cell.

    import bamboolib as bam
    
  5. In the notebook’s third cell, enter the following code, and then run the cell.

    bam
    

    Note

    Alternatively, you can print an existing pandas DataFrame to display bamboolib for use with that specific DataFrame.

  6. Continue with key tasks.

Walkthroughs

You can use bamboolib by itself or with an existing pandas DataFrame.

Use bamboolib by itself

In this walkthrough, you use bamboolib to display in your notebook the contents of an example sales data set. You then experiment with some of the related notebook code that bamboolib automatically generates for you. You finish by querying and sorting a copy of the sales data set’s contents.

  1. Create a Python notebook.

  2. Attach the notebook to a cluster that meets the requirements.

  3. In the notebook’s first cell, enter the following code, and then run the cell. This step can be skipped if bamboolib is already installed in the workspace or cluster.

    %pip install bamboolib
    
  4. In the notebook’s second cell, enter the following code, and then run the cell.

    import bamboolib as bam
    
  5. In the notebook’s third cell, enter the following code, and then run the cell.

    bam
    
  6. Click Load dummy data.

  7. In the Load dummy data pane, for Load a dummy data set for testing bamboolib, select Sales dataset.

  8. Click Execute.

  9. Display all of the rows where item_type is Baby Food:

    1. In the Search actions list, select Filter rows.

    2. In the Filter rows pane, in the Choose list (above where), select Select rows.

    3. In the list below where, select item_type.

    4. In the Choose list next to item_type, select has value(s).

    5. In the Choose value(s) box next to has value(s), select Baby Food.

    6. Click Execute.

  10. Copy the automatically generated Python code for this query:

    1. Cick Copy Code below the data preview.

  11. Paste and modify the code:

    1. In the notebook’s fourth cell, paste the code that you copied. It should look like this:

      import pandas as pd
      df = pd.read_csv(bam.sales_csv)
      # Step: Keep rows where item_type is one of: Baby Food
      df = df.loc[df['item_type'].isin(['Baby Food'])]
      
    2. Add to this code so that it displays only those rows where order_prio is C, and then run the cell:

      import pandas as pd
      df = pd.read_csv(bam.sales_csv)
      # Step: Keep rows where item_type is one of: Baby Food
      df = df.loc[df['item_type'].isin(['Baby Food'])]
      
      # Add the following code.
      # Step: Keep rows where order_prio is one of: C
      df = df.loc[df['order_prio'].isin(['C'])]
      df
      

    Tip

    Instead of writing this code, you can also do the same thing by just using bamboolib in the third cell to display only those rows where order_prio is C. This step is an example of extending the code that bamboolib automatically generated earlier.

  12. Sort the rows by region in ascending order:

    1. In the widget within the fourth cell, in the Search actions list, select Sort rows.

    2. In the Sort column(s) pane, in the Choose column list, select region.

    3. In the list next to region, select ascending (A-Z).

    4. Click Execute.

    Note

    This is equivalent to writing the following code yourself:

    df = df.sort_values(by=['region'], ascending=[True])
    df
    

    You could have also just used bamboolib in the third cell to sort the rows by region in ascending order. This step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it automatically generates the additional code for you in the background, so that you can further extend your already-extended code!

  13. Continue with key tasks.

Use bamboolib with an existing DataFrame

In this walkthrough, you use bamboolib to display in your notebook the contents of a pandas DataFrame. This DataFrame contains a copy of an example sales data set. You then experiment with some of the related notebook code that bamboolib automatically generates for you. You finish by querying and sorting some of the DataFrame’s contents.

  1. Create a Python notebook.

  2. Attach the notebook to a cluster that meets the requirements.

  3. In the notebook’s first cell, enter the following code, and then run the cell. This step can be skipped if bamboolib is already installed in the workspace or cluster.

    %pip install bamboolib
    
  4. In the notebook’s second cell, enter the following code, and then run the cell.

    import bamboolib as bam
    
  5. In the notebook’s third cell, enter the following code, and then run the cell.

    import pandas as pd
    
    df = pd.read_csv(bam.sales_csv)
    df
    

    Note that bamboolib only supports pandas DataFrames. To convert a PySpark DataFrame to a pandas DataFrame, call toPandas on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a pandas DataFrame, call to_pandas on the Pandas API on Spark DataFrame.

  6. Click Show bamboolib UI.

  7. Display all of the rows where item_type is Baby Food:

    1. In the Search actions list, select Filter rows.

    2. In the Filter rows pane, in the Choose list (above where), select Select rows.

    3. In the list below where, select item_type.

    4. In the Choose list next to item_type, select has value(s).

    5. In the Choose value(s) box next to has value(s), select Baby Food.

    6. Click Execute.

  8. Copy the automatically generated Python code for this query. To do this, click Copy Code below the data preview.

  9. Paste and modify the code:

    1. In the notebook’s fourth cell, paste the code that you copied. It should look like this:

      # Step: Keep rows where item_type is one of: Baby Food
      df = df.loc[df['item_type'].isin(['Baby Food'])]
      
    2. Add to this code so that it displays only those rows where order_prio is C, and then run the cell:

      # Step: Keep rows where item_type is one of: Baby Food
      df = df.loc[df['item_type'].isin(['Baby Food'])]
      
      # Add the following code.
      # Step: Keep rows where order_prio is one of: C
      df = df.loc[df['order_prio'].isin(['C'])]
      df
      

    Tip

    Instead of writing this code, you can also do the same thing by just using bamboolib in the third cell to display only those rows where order_prio is C. This step is an example of extending the code that bamboolib automatically generated earlier.

  10. Sort the rows by region in ascending order:

    a. In the widget within the fourth cell, click Sort rows.

    1. In the Sort column(s) pane, in the Choose column list, select region.

    2. In the list next to region, select ascending (A-Z).

    3. Click Execute.

    Note

    This is equivalent to writing the following code yourself:

    df = df.sort_values(by=['region'], ascending=[True])
    df
    

    You could have also just used bamboolib in the third cell to sort the rows by region in ascending order. This step demonstrates how you can use bamboolib to extend the code that you write. As you use bamboolib, it automatically generates the additional code for you in the background, so that you can further extend your already-extended code!

  11. Continue with key tasks.

Key tasks

Add the widget to a cell

Scenario: You want the bamboolib widget to display in a cell.

  1. Make sure the notebook meets the requirements for bamboolib.

  2. If bamboolib is not already installed in the workspace or cluster run the following code in a cell in the notebook, preferably in the first cell:

    %pip install bamboolib
    
  3. Run the following code in the notebook, preferably in the notebook’s first or second cell:

    import bamboolib as bam
    
  4. Option 1: In the cell where you want the widget to appear, add the following code, and then run the cell:

    bam
    

    The widget appears in the cell below the code.

    Or:

    Option 2: In a cell that contains a reference to a pandas DataFrame, print the DataFrame. For example, given the following DataFrame definition, run the cell:

    import pandas as pd
    from datetime import datetime, date
    
    df = pd.DataFrame({
      'a': [ 1, 2, 3 ],
      'b': [ 2., 3., 4. ],
      'c': [ 'string1', 'string2', 'string3' ],
      'd': [ date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1) ],
      'e': [ datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0) ]
    })
    
    df
    

    The widget appears in the cell below the code.

    Note that bamboolib only supports pandas DataFrames. To convert a PySpark DataFrame to a pandas DataFrame, call toPandas on the PySpark DataFrame. To convert a Pandas API on Spark DataFrame to a pandas DataFrame, call to_pandas on the Pandas API on Spark DataFrame.

Clear the widget

Scenario: You want to clear the contents of a widget and then read new data into the existing widget.

Option 1: Run the following code within the cell that contains the target widget:

bam

The widget clears and then redisplays the Databricks: Read CSV file from DBFS, Databricks: Load database table, and Load dummy data buttons.

Note

If the error name 'bam' is not defined appears, run the following code in the notebook (preferably in the notebook’s first cell), and then try again:

import bamboolib as bam

Option 2: In a cell that contains a reference to a pandas DataFrame, print the DataFrame again by running the cell again. The widget clears and then displays the new data.

Data loading tasks

Read an example dataset’s contents into the widget

Scenario: You want to read some example data into the widget, for example some pretend sales data, so that you can test out the widget’s functionality.

  1. Click Load dummy data.

    Note

    If Load dummy data is not visible, clear the widget with Option 1 and try again.

  2. In the Load dummy data pane, for Load a dummy data set for testing bamboolib, select the name of the dataset that you want to load.

  3. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  4. Click Execute.

    The widget displays the contents of the dataset.

Tip

You can switch the current widget to display the contents of a different example dataset:

  1. In the current widget, click the Load dummy data tab.

  2. Follow the preceding steps to read the other example dataset’s contents into the widget.

Read a CSV file’s contents into the widget

Scenario: You want to read the contents of a CSV file within your Databricks workspace into the widget.

  1. Click Databricks: Read CSV file from DBFS.

    Note

    If Databricks: Read CSV file from DBFS is not visible, clear the widget with Option 1 and try again.

  2. In the Read CSV from DBFS pane, browse to the location that contains the target CSV file.

  3. Select the target CSV file.

  4. For Dataframe name, enter a name for the programmatic identifier of the CSV file’s contents as a DataFrame, or leave df as the default programmatic identifier.

  5. For CSV value separator, enter the character that separates values in the CSV file, or leave the , (comma) character as the default value separator.

  6. For Decimal separator, enter the character that separates decimals in the CSV file, or leave the . (dot) character as the default value separator.

  7. For Row limit: read the first N rows - leave empty for no limit, enter the maximum number of rows to read into the widget, or leave 100000 as the default number of rows, or leave this box empty to specify no row limit.

  8. Click Open CSV file.

    The widget displays the contents of the CSV file, based on the settings that you specified.

Tip

You can switch the current widget to display the contents of a different CSV file:

  1. In the current widget, click the Read CSV from DBFS tab.

  2. Follow the preceding steps to read the other CSV file’s contents into the widget.

Read a database table’s contents into the widget

Scenario: You want to read the contents of a database table within your Databricks workspace into the widget.

  1. Click Databricks: Load database table.

    Note

    If Databricks: Load database table is not visible, clear the widget with Option 1 and try again.

  2. In the Databricks: Load database table pane, for Database - leave empty for default database, enter the name of the database in which the target table is located, or leave this box empty to specify the default database.

  3. For Table, enter the name of the target table.

  4. For Row limit: read the first N rows - leave empty for no limit, enter the maximum number of rows to read into the widget, or leave 100000 as the default number of rows, or leave this box empty to specify no row limit.

  5. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  6. Click Execute.

    The widget displays the contents of the table, based on the settings that you specified.

Tip

You can switch the current widget to display the contents of a different table:

  1. In the current widget, click the Databricks: Load database table tab.

  2. Follow the preceding steps to read the other table’s contents into the widget.

Data action tasks

bamboolib offers over 50 data actions. Following are some of the more common getting-started data action tasks.

Select columns

Scenario: You want to show only specific table columns by name, by data type, or that match some regular expression. For example, in the dummy Sales dataset, you want to show only the item_type and sales_channel columns, or you want to show only the columns that contain the string _date in their column names.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type select, and then select Select or drop columns.

    • Select Select or drop columns.

  2. In the Select or drop columns pane, in the Choose drop-down list, select Select.

  3. Select the target column names or inclusion criterion.

  4. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  5. Click Execute.

Drop columns

Scenario: You want to hide specific table columns by name, by data type, or that match some regular expression. For example, in the dummy Sales dataset, you want to hide the order_prio, order_date, and ship_date columns, or you want to hide all columns that contain only date-time values.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type drop, and then select Select or drop columns.

    • Select Select or drop columns.

  2. In the Select or drop columns pane, in the Choose drop-down list, select Drop.

  3. Select the target column names or inclusion criterion.

  4. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  5. Click Execute.

Filter rows

Scenario: You want to show or hide specific table rows based on criteria such as specific column values that are matching or missing. For example, in the dummy Sales dataset, you want to show only those rows where the item_type column’s value is set to Baby Food.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type filter, and then select Filter rows.

    • Select Filter rows.

  2. In the Filter rows pane, in the Choose drop-down list above where, select Select rows or Drop rows.

  3. Specify the first filter criterion.

  4. To add another filter criterion, click add condition, and specify the next filter criterion. Repeat as desired.

  5. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  6. Click Execute.

Sort rows

Scenario: You want to sort table rows based on the values within one or more columns. For example, in the dummy Sales dataset, you want to show the rows by the region column’s values in alphabetical order from A to Z.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type sort, and then select Sort rows.

    • Select Sort rows.

  2. In the Sort column(s) pane, choose the first column to sort by and the sort order.

  3. To add another sort criterion, click add column, and specify the next sort criterion. Repeat as desired.

  4. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  5. Click Execute.

Grouping rows and columns tasks

Group rows and columns by a single aggregate function

Scenario: You want to show row and column results by calculated groupings, and you want to assign custom names to those groupings. For example, in the dummy Sales dataset, you want to group the rows by the country column’s values, showing the numbers of rows containing the same country value, and giving the list of calculated counts the name country_count.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type group, and then select Group by and aggregate (with renaming).

    • Select Group by and aggregate (with renaming).

  2. In the Group by with column rename pane, select the columns to group by, the first calculation, and optionally specify a name for the calculated column.

  3. To add another calculation, click add calculation, and specify the next calculation and column name. Repeat as desired.

  4. Specify where to store the result.

  5. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  6. Click Execute.

Group rows and columns by multiple aggregate functions

Scenario: You want to show row and column results by calculated groupings. For example, in the dummy Sales dataset, you want to group the rows by the region, country, and sales_channel columns’ values, showing the numbers of rows containing the same region and country value by sales_channel, as well as the total_revenue by unique combination of region, country, and sales_channel.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type group, and then select Group by and aggregate (default).

    • Select Group by and aggregate (default).

  2. In the Group by with column rename pane, select the columns to group by and the first calculation.

  3. To add another calculation, click add calculation, and specify the next calculation. Repeat as desired.

  4. Specify where to store the result.

  5. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  6. Click Execute.

Remove rows with missing values

Scenario: You want to remove any row that has a missing value for the specified columns. For example, in the dummy Sales dataset, you want to remove any rows that have a missing item_type value.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type drop or remove, and then select Drop missing values.

    • Select Drop missing values.

  2. In the Drop missing values pane, select the columns to remove any row that has a missing value for that column.

  3. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  4. Click Execute.

Remove duplicated rows

Scenario: You want to to remove any row that has a duplicated value for the specified columns. For example, in the dummy Sales dataset, you want to remove any rows that are exact duplicates of each other.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type drop or remove, and then select Drop/Remove duplicates.

    • Select Drop/Remove duplicates.

  2. In the Remove Duplicates pane, select the columns to remove any row that has a duplicated value for those columns, and then select whether to keep the first or last row that has the duplicated value.

  3. For Dataframe name, enter a name for the programmatic identifier of the table’s contents as a DataFrame, or leave df as the default programmatic identifier.

  4. Click Execute.

Find and replace missing values

Scenario: You want to replace the missing value with a replacement value for any row with the specified columns. For example, in the dummy Sales dataset, you want to replace any row with a missing value in the item_type column with the value Unknown Item Type.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type find or replace, and then select Find and replace missing values.

    • Select Find and replace missing values.

  2. In the Replace missing values pane, select the columns to replace missing values for, and then specify the replacement value.

  3. Click Execute.

Create a column formula

Scenario: You want to create a column that uses a unique formula. For example, in the dummy Sales dataset, you want to create a column named profit_per_unit that displays the result of dividing the total_profit column value by the units_sold column value for each row.

  1. On the Data tab, in the Search actions drop-down list, do one of the following:

    • Type formula, and then select New column formula.

    • Select New column formula.

  2. In the Replace missing values pane, select the columns to replace missing values for, and then specify the replacement value.

  3. Click Execute.

Data action history tasks

View the list of actions taken in the widget

Scenario: You want to see a list of all of the changes that were made in the widget, starting with the most recent change.

Click History. The list of actions appears in the Transformations history pane.

Undo the most recent action taken in the widget

Scenario: You want to revert the most recent change that was made in the widget.

Do one of the following:

  • Click the counterclockwise arrow icon.

  • Click History, and in the Transformations history pane, click Undo last step.

Redo the most recent action taken in the widget

Scenario: You want to revert the most recent revert that was made in the widget.

Do one of the following:

  • Click the clockwise arrow icon.

  • Click History, and in the Transformations history pane, click Recover last step.

Change the most recent action taken in the widget

Scenario: You want to change the most recent change that was taken in the widget.

  1. Do one of the following:

    • Click the pencil icon.

    • Click History, and in the Transformations history pane, click Edit last step.

  2. Make the desired change, and then click Execute.

Get code to programmatically recreate the widget’s current state as a DataFrame

Scenario: You want to get Python code that programmatically recreates the current widget’s state, represented as a pandas DataFrame. You want to run this code in a different cell in this workbook or a different workbook altogether.

  1. Click Get Code.

  2. In the Export code pane, click Copy code. The code is copied to your system’s clipboard.

  3. Paste the code into a different cell in this workbook or into a different workbook.

  4. Write additional code to work with this pandas DataFrame programmatically, and then run the cell. For example, to display the DataFrame’s contents, assuming that your DataFrame is represented programmatically by df:

    # Your pasted code here, followed by...
    df
    

Limitations

  • Using bamboolib for data wrangling is limited to approximately 10 million rows. This limit is based on pandas and your cluster’s compute resources.

  • Using bamboolib for data visualizations is limited to approximately 10 thousand rows. This limit is based on plotly.