Post Click Prediction

Problem Overview

This Reddit Machine Learning Engineer interview is a practical tabular modeling exercise done in a Jupyter notebook. You are given a clean JSON dataset where each row represents a post impression for a Reddit user. The task is to predict whether a user will click on a post after it has been shown to them.

Problem Statement

Given a dataset of post impressions on Reddit, predict whether a user will click on a post after it has been shown to them. Each row in the dataset represents a post impression for a Reddit user. The dataset is a clean JSON file.

Features

post_id: Unique identifier for the post.
user_id: Unique identifier for the user.
click: Binary target variable (1 if the user clicked on the post, 0 otherwise).
timestamp: Timestamp of when the post was shown to the user.
post_title: Title of the post.
post_body: Body of the post.
post_upvotes: Number of upvotes on the post.
post_comments: Number of comments on the post.
user_age: Age of the user.
user_karma: Total karma of the user.
user_is_moderator: Whether the user is a moderator (1 if yes, 0 otherwise).

Constraints

The dataset is relatively small (a few hundred thousand rows).
You are expected to preprocess the data, handle missing values, and engineer relevant features.
You can use any machine learning algorithm or library to build your model.
The model should be able to make predictions on new, unseen data.

Examples

If a user with user_id 1234 clicks on a post with post_id 5678, the corresponding row in the dataset would have click = 1.
If a user with user_id 2345 does not click on a post with post_id 6789, the corresponding row in the dataset would have click = 0.

Hints

Consider using natural language processing techniques to extract features from the post_title and post_body columns.
Feature engineering can play a significant role in improving the model's performance. For example, you can create new features like post_length or post_upvotes_to_comments_ratio.
Experiment with different machine learning algorithms to find the best model for this problem.

Solution

To solve this problem, you can follow these steps:

Data Preprocessing:
- Handle missing values in the dataset.
- Convert categorical variables (e.g., user_is_moderator) to numerical variables.
- Normalize or standardize numerical features.
Feature Engineering:
- Extract features from the post_title and post_body using natural language processing techniques.
- Create new features that might be relevant to the prediction, such as post_length or post_upvotes_to_comments_ratio.
Model Selection:
- Choose a suitable machine learning algorithm (e.g., logistic regression, random forest, gradient boosting machines).
- Split the dataset into training and testing sets.
Model Training:
- Train the model on the training set.
- Tune hyperparameters using cross-validation.
Model Evaluation:
- Evaluate the model's performance on the testing set using appropriate metrics (e.g., accuracy, AUC-ROC, F1-score).
- Make predictions on new, unseen data.
Model Deployment:
- Deploy the trained model to make real-time predictions on Reddit's platform.

After following these steps, you should have a model that can accurately predict whether a user will click on a post after it has been shown to them.

Post Click Prediction

Problem Overview

Problem Statement

Features

post_id: Unique identifier for the post.
user_id: Unique identifier for the user.
click: Binary target variable (1 if the user clicked on the post, 0 otherwise).
timestamp: Timestamp of when the post was shown to the user.
post_title: Title of the post.
post_body: Body of the post.
post_upvotes: Number of upvotes on the post.
post_comments: Number of comments on the post.
user_age: Age of the user.
user_karma: Total karma of the user.
user_is_moderator: Whether the user is a moderator (1 if yes, 0 otherwise).

Constraints

The dataset is relatively small (a few hundred thousand rows).
You are expected to preprocess the data, handle missing values, and engineer relevant features.
You can use any machine learning algorithm or library to build your model.
The model should be able to make predictions on new, unseen data.

Examples

If a user with user_id 1234 clicks on a post with post_id 5678, the corresponding row in the dataset would have click = 1.
If a user with user_id 2345 does not click on a post with post_id 6789, the corresponding row in the dataset would have click = 0.

Hints

Consider using natural language processing techniques to extract features from the post_title and post_body columns.
Feature engineering can play a significant role in improving the model's performance. For example, you can create new features like post_length or post_upvotes_to_comments_ratio.
Experiment with different machine learning algorithms to find the best model for this problem.

Solution

To solve this problem, you can follow these steps:

Data Preprocessing:
- Handle missing values in the dataset.
- Convert categorical variables (e.g., user_is_moderator) to numerical variables.
- Normalize or standardize numerical features.
Feature Engineering:
- Extract features from the post_title and post_body using natural language processing techniques.
- Create new features that might be relevant to the prediction, such as post_length or post_upvotes_to_comments_ratio.
Model Selection:
- Choose a suitable machine learning algorithm (e.g., logistic regression, random forest, gradient boosting machines).
- Split the dataset into training and testing sets.
Model Training:
- Train the model on the training set.
- Tune hyperparameters using cross-validation.
Model Evaluation:
- Evaluate the model's performance on the testing set using appropriate metrics (e.g., accuracy, AUC-ROC, F1-score).
- Make predictions on new, unseen data.
Model Deployment:
- Deploy the trained model to make real-time predictions on Reddit's platform.

After following these steps, you should have a model that can accurately predict whether a user will click on a post after it has been shown to them.

ML Coding - Post Click Prediction

Post Click Prediction

Problem Overview

Problem Statement

Features

Constraints

Examples

Hints

Solution

ML Coding - Post Click Prediction

Post Click Prediction

Problem Overview

Problem Statement

Features

Constraints

Examples

Hints

Solution