This is an open-ended, hands-on machine learning coding exercise. You will train a binary classifier using scikit-learn on a toy dataset and iteratively improve the model's evaluation metrics.
The interviewer expects you to:
Choose an appropriate toy dataset from scikit-learn
Select and train a binary classifier
Evaluate the model using appropriate metrics
Identify and implement strategies to improve performance
Explain your reasoning at each step
This question tests your practical knowledge of the ML workflow, understanding of evaluation metrics, and ability to iterate on model improvements.
Start with a simple baseline and iteratively improve
Explain your choices for metrics, models, and hyperparameters
Be prepared to discuss trade-offs and production considerations
Write clean, reproducible code with proper train/test splits
You can choose from any of these binary-friendly datasets:
from sklearn.datasets import ( load_breast_cancer, # Binary: malignant/benign, 569 samples, 30 features load_iris, # Multi-class, but take 2 classes for binary make_classification, # Synthetic with controllable difficulty make_moons, # Non-linearly separable, good for testing make_circles # Concentric circles, tests non-linear models )
` from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
data = load_breast_cancer() X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y )
scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) `
Use stratified splitting to preserve class distribution
Scale features using StandardScaler (fit on train, transform on test)
Never fit scaler on test data (data leakage)
Set random_state for reproducibility
Common ClassifiersModelStrengthsWhen to UseLogistic RegressionFast, interpretable, good baselineLinear relationshipsRandom ForestHandles non-linearity, feature interactionsMixed features, no scaling neededSVMWorks well in high dimensionsClean data, smaller datasetsGradient BoostingHigh accuracy, handles imbalanceWhen accuracy is critical
` from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42, max_iter=1000) model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled) print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") print(f"Precision: {precision_score(y_test, y_pred):.4f}") print(f"Recall: {recall_score(y_test, y_pred):.4f}") print(f"F1 Score: {f1_score(y_test, y_pred):.4f}") `
Accuracy: Good for balanced classes, misleading for imbalanced data
Precision: Important when false positives are costly (spam detection)
Recall: Important when false negatives are costly (cancer detection)
F1 Score: Balanced metric when both precision and recall matter
AUC-ROC: Good for ranking and threshold-agnostic evaluation