def get_range(age):
# <18, 18-34, 35-60
if age < 18:
return 0
if age < 35:
return 1
return 2
# (1,2,4) are related, (3,6,9) are related, (0,5,7,8) are related.
d = {1:0, 2:0, 4:0, 3:1, 6:1, 9:1, 0:2, 5:2, 7:2, 8:2}
def calc_clicked(ad_age, ad_topic, disp_age, disp_topic):
if get_range(ad_age) == get_range(disp_age):
age_not_click = 0.7
else:
age_not_click = 0.9
if d[ad_topic] == d[disp_topic]:
disp_not_click = 0.7
else:
disp_not_click = 0.95
overall_click = 1 - age_not_click * disp_not_click
return 1 if np.random.rand() < overall_click else 0
train, test = train_test_split(export_pdf, test_size=0.2)
train, val = train_test_split(train, test_size=0.2)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')
export_pd_in_delta(train, "user_item_interaction_train")
export_pd_in_delta(val, "user_item_interaction_val")
export_pd_in_delta(test, "user_item_interaction_test")
2560 train examples
640 validation examples
800 test examples
Generate a dataset and save to Delta tables
This notebook generates five Delta tables:
user_profile
: user_id and their static profilesitem_profile
: item_id and their static profilesuser_item_interaction
: events when a user interacts with an itemtrain
,val
, andtest
For simplicity, the user and item profiles contain only two attributes: age and topic. The
user_age
column is the user's age, and theitem_age
column is the average age of users who interact with the item. Theuser_topic
column is the user's favorite topic, and theitem_topic
column is the most relevant topic of the item. Also for simplicity, theuser_item_interaction
table ignores event timestamps and includes onlyuser_id
,item_id
, and a binary label column repreesenting whether the user interacts with the item.How the label is calculated
This notebook randomly assigns a label representing whether the user interacts with the item. The label is based on the similarity of the user and item, which is determined by their age and topic attributes.
The calculation divides users into three age ranges: under 18, 18-34, 35-60. If
user_age
anditem_age
are in the same range, the probability of iteraction is higher.The topic has 10 categories. This calculation assumes that topics (1,2,4) are related, (3,6,9) are related, and (0,5,7,8) are related.
The overall probability that a user interacts with an item is P(interact) = P(interact_age OR interact_topic). This notebook randomly generates a label based on that probability.