Dataset Join

Problem Overview

You are building a reconciliation tool that joins two CSV-style datasets — one coming from Stripe's internal customer ledger, the other from an external payment processor — so downstream analytics can compare and match records between the two datasets.

Problem Statement

Given two CSV files, ledger.csv and processor.csv, implement a function that joins these two datasets on a common key (e.g., customer ID). The function should return a new CSV file with the combined records from both datasets.

Constraints

Both CSV files have a header row.
The common key column is named customer_id in both datasets.
The datasets may have different numbers of records.
The datasets may have different additional columns.

Examples

Example 1

Input: ` ledger.csv: customer_id,amount,date 1,100,2022-01-01 2,200,2022-01-02 3,300,2022-01-03

processor.csv: customer_id,transaction_id,amount 1,101,150 2,102,250 3,103,350 `

Output: customer_id,amount,date,transaction_id 1,100,2022-01-01,101 2,200,2022-01-02,102 3,300,2022-01-03,103

Example 2

Input: ` ledger.csv: customer_id,amount,date 1,100,2022-01-01 2,200,2022-01-02

processor.csv: customer_id,transaction_id,amount 1,101,150 2,102,250 3,103,350 `

Output: customer_id,amount,date,transaction_id 1,100,2022-01-01,101 2,200,2022-01-02,102

Hints

Consider using a hash table to store records from one dataset for efficient lookups when processing the other dataset.
Handle cases where a customer_id may not be present in both datasets.

Solution

Here's a sample Python solution using the pandas library:

`python import pandas as pd

def join_datasets(ledger_path, processor_path): # Read the CSV files into DataFrames ledger_df = pd.read_csv(ledger_path) processor_df = pd.read_csv(processor_path)

# Join the DataFrames on the 'customer_id' column
result_df = pd.merge(ledger_df, processor_df, on='customer_id', how='inner')

# Save the result to a new CSV file
result_df.to_csv('joined.csv', index=False)

Example usage

join_datasets('ledger.csv', 'processor.csv') `

This solution reads the input CSV files into pandas DataFrames, joins them on the customer_id column, and saves the result to a new CSV file. The how='inner' parameter ensures that only matching records are included in the output.

Dataset Join

Problem Overview

Problem Statement

Constraints

Both CSV files have a header row.
The common key column is named customer_id in both datasets.
The datasets may have different numbers of records.
The datasets may have different additional columns.

Examples

Example 1

Input: ` ledger.csv: customer_id,amount,date 1,100,2022-01-01 2,200,2022-01-02 3,300,2022-01-03

processor.csv: customer_id,transaction_id,amount 1,101,150 2,102,250 3,103,350 `

Output: customer_id,amount,date,transaction_id 1,100,2022-01-01,101 2,200,2022-01-02,102 3,300,2022-01-03,103

Example 2

Input: ` ledger.csv: customer_id,amount,date 1,100,2022-01-01 2,200,2022-01-02

processor.csv: customer_id,transaction_id,amount 1,101,150 2,102,250 3,103,350 `

Output: customer_id,amount,date,transaction_id 1,100,2022-01-01,101 2,200,2022-01-02,102

Hints

Consider using a hash table to store records from one dataset for efficient lookups when processing the other dataset.
Handle cases where a customer_id may not be present in both datasets.

Solution

Here's a sample Python solution using the pandas library:

`python import pandas as pd

def join_datasets(ledger_path, processor_path): # Read the CSV files into DataFrames ledger_df = pd.read_csv(ledger_path) processor_df = pd.read_csv(processor_path)

# Join the DataFrames on the 'customer_id' column
result_df = pd.merge(ledger_df, processor_df, on='customer_id', how='inner')

# Save the result to a new CSV file
result_df.to_csv('joined.csv', index=False)

Example usage

join_datasets('ledger.csv', 'processor.csv') `

OA - Dataset Join

Dataset Join

Problem Overview

Problem Statement

Constraints

Examples

Example 1

Example 2

Hints

Solution

Example usage

OA - Dataset Join

Dataset Join

Problem Overview

Problem Statement

Constraints

Examples

Example 1

Example 2

Hints

Solution

Example usage