Meta — Data Engineer ❌ Failed
Level: Senior-Level
Round: Full Journey · Type: Multiple Types · Difficulty: 7/10 · Duration: 240 min · Interviewer: Unfriendly
Topics: Data Modeling, SQL, Python, ETL, Product Sense, Data Visualization, DAU/MAU Analysis, System Design
Location: San Francisco Bay Area
Interview date: 2021-08-15
Got offer: False
Summary
Interview Rounds Overview
- Round 1: Phone Screen
- Round 2: Onsite (Data Modeling)
- Round 3: Onsite (Tech 1
- Round 4: Onsite (Tech 2)
Details
The entire interview process took about two months, and unfortunately, I was rejected. The main feedback from HR was that I didn't do well with the Hiring Manager (HM). I had tried to prepare by researching the HM and preparing topics in advance, but the feedback was that my experience's impact was lacking.
One of the tech rounds involved a difficult interaction with another interviewer; it was hard to explain the schema clearly.
Here's a breakdown of the topics covered:
Data Modeling
- Version 01 - Ride Sharing Company (Uber/Lyft)
- 15-20 mins product sense discussion: How to expand market share.
- 20-30 mins data modeling discussion – fact table(s) & dimension tables related to the product sense.
- 5-10 mins on SQL query – 1-2 queries: Calculate the number of people who only use this app to go to the airport.
- Version 02 - Cloud File Storage Company (Dropbox/Google Drive)
- 15-20 mins product sense discussion: How to evaluate whether the product is successful (name some classical metrics and start the discussion).
- 20-30 mins data modeling discussion – fact table(s) & dimension tables related to product sense.
- I was asked about how to consider building tables if the file is shared among multiple people. I suggested building two fact tables – one specifically for recording upload/download activities and another for recording shared assets.
- I was asked how to record file ownership transfer. I think the question is related to the share activity issue above.
- 5-10 mins on SQL query – 1-2 queries: Find how many files (storage assets) have multiple owners.
Tech 01
- Version 01 – DAU/MAU (The case study in this round generally revolves around DAU/MAU without disclosing too much specific company background).
- The product sense part was commonly asked about what might be the reasons if DAU/MAU suddenly dropped a lot.
- SQL – table schema will be provided, but actual data will not be given. The interviewer generally knows what is right and wrong.
- Data contains N days of login information – calculate the number of users who log in via mobile phone and the number of all users (AU/N by Phone and AU/N).
- Think about how to design a series of ETL processes – how to capture new users/retained users/churn users.
- Calc DAU by different categories (DAU group by platforms/devices).
- Version 02 – News Feed
- Product sense – how to evaluate whether users have viewed a post – needs some discussion – how to determine whether the user has effectively read the content of a post (some will be driven to discuss the so-called screen occupancy ratio – if a post’s screen occupancy ratio reaches x% threshold, then it is considered an effective reading).
- SQL & Python
- Use SQL and Python to calculate the number of effective readings of the corresponding post in each session.
Some details need to be confirmed – if there are logs under a post showing the same session_id multiple times - it may indicate that the post has been read multiple times, then you can consider taking the first time as the standard or you can consider taking the average value.
There are several versions of the schema, but they may be similar:
(Table schema – session id, post id, start_timestamp, end_timestamp, read_percent)
(Table schema – session id, post id, time_stamp, event_type, percentage, where event type has start time and end time). Actually, these two tables are similar.
[One version seems more reliable: SQL - Find out whether the post in this session (each post will correspond to several start and end times) is effectively read (the threshold is 5 seconds and 80% screen occupancy rate are considered effective readings). This question requires evaluating each post one by one, and if one of them is an effective reading, it is considered an effective reading. The focus of this SQL is how to align the start time and end time together. Once this is written, the rest is simple.]
(To transform the timestamp in the above table into start and end timestamps aligned together, window function should be a more intuitive solution. If the data is large, you may only be able to solve it by joining, but it may be more troublesome to write. I will see if there is a more suitable solution in the next two days.)
- Python is pseudo-streaming the above SQL content, and each session_id will have a corresponding session end signal.
- (It feels like following the logic session -> post – validate_whether_in_threshold() -> update dictionary)
Tech 02
- Version 01- Photo Upload (similar to Instagram)
- Product sense generally reflects the discussion of the process time of the photo upload process.
- SQL & Python:
- SQL – Calculate average time taken.
- Python coding – same as SQL, but here it is required to implement a pseudo-streaming process – give a log file – process data line by line – calculate metrics – average time taken.
- Dashboard – data visualization – basically, just briefly introduce how to visualize the data – explain the horizontal and vertical axes and the metrics that need to be visualized.
- Version 02 – FB Messenger
- product sense – DAU/MAU – what factors may cause the DAU/MAU number to decrease
- SQL and Python
- SQL: Given a log table, find out how many times each user has signed in in a day, how many messages have been sent, the first sign-in date, how many messages have been sent since the first sign-in date, and whether they are active today.
(The log seems to have user_id, date, send_on_ios, send_on_android)
- Python: Given several tables, use Python to write SQL statements to update these tables every day.
(Given table A, table B, table C. Table C is SELECT col1, col2 FROM A WHERE condition1 UNION ALL SELECT col3, col4 FROM B WHERE condition2 – It feels like giving the table and using Python form SQL and insert into DB – pay attention to SQL injection)
(The Python table is similar to the log, which is log by different metrics, and data related to device, platform, and location)
(All Python questions do not need to use pandas. Simply put, you are given several SQL tables and you use Python to write a string, and the content of the string is \