This Reddit Machine Learning Engineer interview is a practical tabular modeling exercise done in a Jupyter notebook. You are given a clean JSON dataset where each row represents a post impression for a Reddit user. The task is to predict whether a user will click on a post after it has been shown to them.
Given a dataset of post impressions on Reddit, predict whether a user will click on a post after it has been shown to them. Each row in the dataset represents a post impression for a Reddit user. The dataset is a clean JSON file.
post_id: Unique identifier for the post.user_id: Unique identifier for the user.click: Binary target variable (1 if the user clicked on the post, 0 otherwise).timestamp: Timestamp of when the post was shown to the user.post_title: Title of the post.post_body: Body of the post.post_upvotes: Number of upvotes on the post.post_comments: Number of comments on the post.user_age: Age of the user.user_karma: Total karma of the user.user_is_moderator: Whether the user is a moderator (1 if yes, 0 otherwise).user_id 1234 clicks on a post with post_id 5678, the corresponding row in the dataset would have click = 1.user_id 2345 does not click on a post with post_id 6789, the corresponding row in the dataset would have click = 0.post_title and post_body columns.post_length or post_upvotes_to_comments_ratio.To solve this problem, you can follow these steps:
Data Preprocessing:
user_is_moderator) to numerical variables.Feature Engineering:
post_title and post_body using natural language processing techniques.post_length or post_upvotes_to_comments_ratio.Model Selection:
Model Training:
Model Evaluation:
Model Deployment:
After following these steps, you should have a model that can accurately predict whether a user will click on a post after it has been shown to them.