Batch embedding generation for Amazon beauty product reviews
In this notebook you learn how to generate embeddings in a batch fashion using ai_query
on Databricks. The embeddings are generated using a model serving endpoint that uses provisioned throughput. These embeddings can be utilized for downstream tasks such as sentiment analysis, clustering, and semantic search.
This notebook also:
- Uses
ai_query
to generate embeddings efficiently for large datasets. - Runs entirely on Databricks, leveraging its scalable compute environment.
- Writes processed embeddings to a Delta table for further analysis.
Dataset: Amazon Reviews 2023 - All Beauty
The Amazon reviews 2023 dataset, specifically the All Beauty category, is a large-scale collection of customer reviews for beauty products available on Amazon. It contains structured data, including:
- Product ratings
- Review text
- Timestamps
- User IDs
- Other metadata
This dataset provides valuable insights into customer feedback and is ideal for analyzing sentiment, customer satisfaction, and identifying trends in product reviews.
With this setup, the notebook ensures an end-to-end workflow for generating and leveraging embeddings at scale.
Part 1: Prepare data for embedding generation
In this section, we download the Amazon Beauty Product Reviews dataset from the McAuley-Lab/Amazon-Reviews-2023 repository on Hugging Face, perform basic data cleaning, and prepare the data for embedding generation. The cleaned data is written out to a Delta table for downstream tasks.
Data fields
Field | Description |
---|---|
rating | The numerical rating given by the customer (for example, 1–5 stars). |
title | The title of the review. |
text | The detailed review text. |
images | A list of image URLs associated with the review. |
asin | The unique product identifier (Amazon Standard Identification Number). |
parent_asin | The parent ASIN of the product (useful for product variants). |
user_id | An anonymized identifier for the user. |
timestamp | The time when the review was created (in epoch format). |
helpful_vote | The count of votes marking the review as helpful. |
verified_purchase | Boolean indicating whether the review is from a verified purchase. |
The following sets and validates your parameter selection in the widgets.
The following loads the Amazon reviews 2023 dataset, specifically the All Beauty and converts it to a Pandas Dataframe and Spark Dataframe.
The following defines functions for cleaning and processing your data before writing it to a Delta table.
Part 2: Create a provisioned throughput endpoint
This creates a large provisioned throughput endpoint for batch inference. The endpoint size is specified in PTUs (provisioned throughput units). The endpoint is stopped after batch inference is complete.
Set the size of the batch inference endpoint in PTUs. Increasing PTUs will increase inference throughput. Due to GPU capacity limitations, if there is an error creating the endpoint, consider reducing the number of PTUs.
Part 2: Use ai_query for batch inference
The following sets up ai_query to run batch inference with your data, querying the provisioned throughput endpoint.