Generate and save the dataset(Python)

Loading...

Generate a dataset and save to Delta tables

This notebook generates five Delta tables:

  • user_profile: user_id and their static profiles
  • item_profile: item_id and their static profiles
  • user_item_interaction: events when a user interacts with an item
    • this table is randomly split into three tables for model training and evaluation: train, val, and test

For simplicity, the user and item profiles contain only two attributes: age and topic. The user_age column is the user's age, and the item_age column is the average age of users who interact with the item. The user_topic column is the user's favorite topic, and the item_topic column is the most relevant topic of the item. Also for simplicity, the user_item_interaction table ignores event timestamps and includes only user_id, item_id, and a binary label column repreesenting whether the user interacts with the item.

How the label is calculated

This notebook randomly assigns a label representing whether the user interacts with the item. The label is based on the similarity of the user and item, which is determined by their age and topic attributes.

The calculation divides users into three age ranges: under 18, 18-34, 35-60. If user_age and item_age are in the same range, the probability of iteraction is higher.

  • same age range: P(interact_age) = 0.3
  • different age range: P(interact_age) = 0.1

The topic has 10 categories. This calculation assumes that topics (1,2,4) are related, (3,6,9) are related, and (0,5,7,8) are related.

  • related topic: P(interact_topic) = 0.3
  • different topic: P(interact_topic) = 0.05

The overall probability that a user interacts with an item is P(interact) = P(interact_age OR interact_topic). This notebook randomly generates a label based on that probability.

import numpy as np
import pandas as pd
 
from sklearn.model_selection import train_test_split   
 
from pyspark.sql.functions import *
NUM_USERS = 400
NUM_ITEMS = 2000
NUM_INTERACTIONS = 4000
NUM_TOPICS = 10
MAX_AGE = 60
 
DATA_DBFS_ROOT_DIR = '/tmp/recommender/data'
 
def export_pd_in_delta(pdf, name):
  spark.createDataFrame(pdf).write.format("delta").mode("overwrite").save(f"{DATA_DBFS_ROOT_DIR}/{name}")

Generate features

user_pdf = pd.DataFrame({
  "user_id": [i for i in range(NUM_USERS)],
  "user_age": [np.random.randint(MAX_AGE) for _ in range(NUM_USERS)],
  "user_topic": [np.random.randint(NUM_TOPICS) for _ in range(NUM_USERS)],
})
item_pdf = pd.DataFrame({
  "item_id": [i for i in range(NUM_ITEMS)],
  "item_age": [np.random.randint(MAX_AGE) for _ in range(NUM_ITEMS)],
  "item_topic": [np.random.randint(NUM_TOPICS) for _ in range(NUM_ITEMS)],
})
export_pd_in_delta(item_pdf, "item_profile")
export_pd_in_delta(user_pdf, "user_profile")

Generate labels

user_id = [np.random.randint(NUM_USERS) for _ in range(NUM_INTERACTIONS)]
item_id = [np.random.randint(NUM_ITEMS) for _ in range(NUM_INTERACTIONS)]
pdf = pd.DataFrame({"user_id": user_id, "item_id": item_id})
all_pdf = pdf \
    .set_index('item_id') \
    .join(item_pdf.set_index('item_id'), rsuffix='_it').reset_index() \
    .set_index('user_id') \
    .join(user_pdf.set_index('user_id'), rsuffix='_us').reset_index()
display(all_pdf)
 
user_id
item_id
item_age
item_topic
user_age
user_topic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0
66
12
9
57
9
0
516
15
5
57
9
0
691
15
3
57
9
0
746
1
5
57
9
0
936
44
1
57
9
0
1031
28
3
57
9
0
1130
34
7
57
9
0
1217
1
8
57
9
0
1324
55
8
57
9
0
1387
43
2
57
9
0
1446
54
5
57
9
0
1532
58
4
57
9
0
1660
15
5
57
9
0
1867
51
1
57
9
0
1930
41
4
57
9
0
1934
45
8
57
9
1
31
34
6
37
8

Showing the first 1000 rows.

def get_range(age):
  # <18, 18-34, 35-60
  if age < 18:
    return 0
  if age < 35:
    return 1
  return 2
 
#  (1,2,4) are related, (3,6,9) are related, (0,5,7,8) are related.
d = {1:0, 2:0, 4:0, 3:1, 6:1, 9:1, 0:2, 5:2, 7:2, 8:2}
  
def calc_clicked(ad_age, ad_topic, disp_age, disp_topic):
  if get_range(ad_age) == get_range(disp_age):
    age_not_click = 0.7
  else:
    age_not_click = 0.9
  if d[ad_topic] == d[disp_topic]:
    disp_not_click = 0.7
  else:
    disp_not_click = 0.95
  overall_click = 1 - age_not_click * disp_not_click
  return 1 if np.random.rand() < overall_click else 0
all_pdf['label'] = all_pdf.apply(lambda row: calc_clicked(
  row['item_age'], row['item_topic'], row['user_age'], row['user_topic']
), axis=1)
export_pdf = all_pdf[['user_id', 'item_id', 'label']]
display(export_pdf)
 
user_id
item_id
label
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
0
66
1
0
516
0
0
691
1
0
746
1
0
936
0
0
1031
0
0
1130
0
0
1217
0
0
1324
0
0
1387
0
0
1446
0
0
1532
0
0
1660
0
0
1867
1
0
1930
1
0
1934
1
1
31
0

Showing the first 1000 rows.

export_pdf.groupby(['user_id']).sum().describe()[['label']]
Out[12]: