1. Business Problem (Industry Scenario)
You are hired as a Data Scientist for an online marketplace similar to Amazon/Flipkart.
The company has 3 major issues
- Customers are silently churning
- Marketing budget is wasted on wrong users
- Revenue forecasting is inaccurate
Management Questions
- Who are our most valuable customers?
- Which customers will churn next month?
- What will be next 30 days revenue?
- Which payment type leads to higher spend?
- Are delivery delays causing churn?
2. Data Source (REAL)
We will use Brazilian Olist E-commerce Dataset
https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce
Sample Database Look,
| Table | Purpose |
|---|---|
| customers | user info |
| orders | purchase header |
| order_items | product level |
| payments | payment type |
| reviews | satisfaction |
| products | category |
3. WHAT YOU MUST BUILD
You must deliver 5 layers
A. SQL Analytics Layer
You must write SQL to produce
- RFM Table
- Cohort Table
- Churn Label
- AOV & LTV
- Payment analysis
Example Expectations
- Customers who ordered > 3 times
- Revenue by state
- Late delivery impact
B. Pandas + Statistics Layer
You must
- Handle missing values
- Remove outliers
- Feature engineering
- Statistical tests
Hypothesis Examples
- Do credit card users spend more than Boleto?
- Does bad review → churn?
- Is delivery delay correlated with rating?
Techniques:
- T-test
- Chi-square
- Correlation
- CLT concepts
C. PySpark Layer
You must convert above to scalable pipeline
- CSV → Parquet
- Window functions
- Aggregations
- Feature store
D. Airflow Layer (Optional)
ingest → clean → transform → train → report
4. What Is Expected From YOU
1. SQL Skills
You should produce
- JOINs
- GROUP BY
- Window functions
- CTE
- Performance thinking
2. Pandas Skills
- EDA notebook
- Clean dataset
- Visuals
- Stats reasoning
3. PySpark Skills
- DataFrame API
- Partitioning
- Optimization
4. Airflow Skills
- DAG
- Operators
- Monitoring
5. Final Deliverables
You must submit
- SQL scripts
- Pandas notebook
- PySpark job
- Airflow DAG
- Report PDF