Data Scientist · Full Journey · Multiple Types — American Express

American Express — Data Scientist ✅ Passed

Level: Intern

Round: Full Journey · Type: Multiple Types · Difficulty: 4/10 · Duration: 150 min · Interviewer: Unfriendly

Topics: Machine Learning, SQL, Statistics, Gradient Boosting, Random Forest, Regression, Time Series, Gradient Descent, Risk Management, Data Analysis

Location: New York, NY, US

Interview date: 2020-01-30

Summary

Interview Rounds Overview

Round 1: Recruiter Screen
Round 2: Online Assessment
Round 3: Phone Interview
Round 4: Onsite
Round 5: Video Interview

Details

My interview experience for a Data Scientist Intern position at American Express involved several rounds:

Round 1: Recruiter Screen This was a 5-minute phone call with an interviewer to confirm my availability for the OA and ask basic questions like expected graduation date and time in the US.

Round 2: Online Assessment (OA) The OA consisted of two parts:

A statistics problem:

There’s an airplane with 20 seats numbered 1, 2, ..., 20 and everyone has an assigned seat. People go in order to their respective seats... except for one guy. He just sits wherever he pleases when he gets on the plane and he’s pushed his way to the front of the line. So, he gets on first. Then he chooses a seat at random. After that, as the people board the plane, if their seat is open, they sit. If their seat is occupied, they pick another open seat at random. Now, you’re the 12th person to board. What is the probability that your seat is open when you board?
A take-home challenge using Jupyter Notebook. I was provided with a CSV file containing several columns, some with clear meanings and others without. The goal was to predict the probability of default. I was slow to understand the columns and create visualizations, and I didn't write complete answers due to time constraints. The questions were:
1. What do you observe about the provided dataset in distinct loans, length of performance as represented by loan x month on book, loan status and variables that are correlated with it or relevant segment splits?
2. What would you define as the target variable to predict PD? Please share your rationale and thinking
3. How do you approach the decision on the model specification and the features to be used in training? Please explain
4. (Bonus question) How would you approach the validation of the model you have just trained?

Round 3: Phone Interview I had a phone interview where I was asked about:

My resume.
My research background.
My preference for the finance industry and why.
Detailed explanations of Gradient Boosting and Random Forest principles.
Quick, one-sentence explanations of terms like regression, Time Series, and gradient descent.
My strengths and weaknesses.

Round 4: Onsite Interview During my onsite interview, I was asked:

About my resume.
To explain k-fold cross-validation. I also encountered a tricky question about how to determine the importance of features using a Neural Network.
Why I wanted to work at American Express.
Again about my strengths and weaknesses.
To solve four simple SQL problems. The interviewer deemed them too easy and moved on.
Which hyperparameters of Gradient Boosting need tuning.
To choose between two problems: (1) Which data would I use to predict PD, or (2) How would I design a merchant recommendation system for American Express Offers. I chose the first question since I was familiar with it from the OA. I mentioned collecting information like age and medical history, but realized this raised privacy concerns. The interviewer pointed out that this wouldn't be done in practice.
How Amex makes its profits.

Round 5: Video Interview This interview covered topics from previous rounds:

Questions about my resume and why I switched from physics to banking.
Whether I had any pending offers.
My familiarity with Python and SQL.
What steps I would take when given a dataset.
Which machine learning model to use in Amex's business scenario and how to evaluate the trade-offs.

I mentioned my experience with Spark and Amazon EMR, as the team uses them.