Imagine
You are working as a Data Scientist in
“StreamFlix” – a Netflix-like platform
Management Problems
- Users install app → watch few movies → disappear
- Homepage recommendations are weak
- Content team doesn’t know:
- Which genres retain users
- What time people watch
- Marketing wants:
- Who to target with offers
DATA SOURCE
MovieLens Dataset (Core)
https://grouplens.org/datasets/movielens/25m/
Tables:
- ratings.csv
- movies.csv
- tags.csv
- genome-scores.csv
BUSINESS OBJECTIVES
You must build
- User churn prediction
- Personalized recommendation
- Engagement analytics
- Automated pipeline
PROBLEM STATEMENT FOR YOU
You must act as a Data Scientist and deliver:
A. Analytics Layer
- Who are active users?
- Who are binge watchers?
- Genre demand
- Retention curve
B. ML Layer
- ALS Recommendation
- Churn Model
- User segmentation
C. Engineering Layer
- Spark processing
- Airflow automation
DETAILED EXPECTATIONS
PHASE 1 – SQL ANALYTICS
You must load CSV into SQL DB
Tables
ratings(user_id, movie_id, rating, ts)movies(movie_id, title, genres)
TASKS
1. User Summary
- total_movies
- avg_rating
- last_watch
- favorite_genre
SELECT user_id, COUNT(*) as movies_watched, AVG(rating) as avg_rating, MAX(ts) as last_activeFROM ratingsGROUP BY user_id;
2. Genre Popularity
- explode genres
- rating by genre
3. Churn Label
Definition:
No activity in last 30 days = churn
PHASE 2 – PANDAS + EDAYou must:
- Convert timestamp → date
- Create:
- watch frequency
- rating distribution
- genre split
Questions to Answer
- Do highly rated movies increase engagement?
- Which genre retains users?
- Are weekends different?
PHASE 3 – STATISTICS
You MUST perform
- T-Test
Action vs Drama watch time difference
- ANOVA across genres
- Correlation
- rating vs frequency
- recency vs rating
PHASE 4 – PYSPARK
Build Scalable Pipeline
- Sessionization
Window.partitionBy("userId").orderBy("timestamp")
- Features
- recency
- frequency
- diversity
- avg rating
- ALS Model
from pyspark.ml.recommendation import ALS
PHASE 5 – AIRFLOW
DAG
1. ingest_csv2. clean3. features_spark4. train_als5. churn_dataset6. report
Schedule: Daily
WHAT YOU MUST SUBMIT
1. SQL
- 8 queries
- churn label
- genre analytics
2. Pandas
- EDA notebook
- 5 insights
3. Stats
- 2 hypothesis tests
4. Spark
- ALS recommender
- feature job
5. Airflow
- DAG