Batch embedding generation for Amazon beauty product reviews

In this notebook you learn how to generate embeddings in a batch fashion using ai_query on Databricks. The embeddings are generated using a model serving endpoint that uses provisioned throughput. These embeddings can be utilized for downstream tasks such as sentiment analysis, clustering, and semantic search.

This notebook also:

Uses ai_query to generate embeddings efficiently for large datasets.
Runs entirely on Databricks, leveraging its scalable compute environment.
Writes processed embeddings to a Delta table for further analysis.

Dataset: Amazon Reviews 2023 - All Beauty

The Amazon reviews 2023 dataset, specifically the All Beauty category, is a large-scale collection of customer reviews for beauty products available on Amazon. It contains structured data, including:

Product ratings
Review text
Timestamps
User IDs
Other metadata

This dataset provides valuable insights into customer feedback and is ideal for analyzing sentiment, customer satisfaction, and identifying trends in product reviews.

With this setup, the notebook ensures an end-to-end workflow for generating and leveraging embeddings at scale.

2

Part 1: Prepare data for embedding generation

In this section, we download the Amazon Beauty Product Reviews dataset from the McAuley-Lab/Amazon-Reviews-2023 repository on Hugging Face, perform basic data cleaning, and prepare the data for embedding generation. The cleaned data is written out to a Delta table for downstream tasks.

Data fields

Field	Description
rating	The numerical rating given by the customer (for example, 1–5 stars).
title	The title of the review.
text	The detailed review text.
images	A list of image URLs associated with the review.
asin	The unique product identifier (Amazon Standard Identification Number).
parent_asin	The parent ASIN of the product (useful for product variants).
user_id	An anonymized identifier for the user.
timestamp	The time when the review was created (in epoch format).
helpful_vote	The count of votes marking the review as helpful.
verified_purchase	Boolean indicating whether the review is from a verified purchase.

5

7

8

10

11

13

15

16

17

19

20

batch-embeddings-ai_query(Python)

Batch embedding generation for Amazon beauty product reviews

Dataset: Amazon Reviews 2023 - All Beauty

Part 1: Prepare data for embedding generation

Data fields

Part 2: Create a provisioned throughput endpoint

Part 2: Use ai_query for batch inference