Parottasalna

AI, Backend Engineering & Architecture Guides

MovieLens 360: User Retention, Recommendation & Engagement Analytics

Imagine

You are working as a Data Scientist in

“StreamFlix” – a Netflix-like platform

Management Problems

  1. Users install app → watch few movies → disappear
  2. Homepage recommendations are weak
  3. Content team doesn’t know:
    • Which genres retain users
    • What time people watch
  4. Marketing wants:
    • Who to target with offers

DATA SOURCE

MovieLens Dataset (Core)

https://grouplens.org/datasets/movielens/25m/

Tables:

  • ratings.csv
  • movies.csv
  • tags.csv
  • genome-scores.csv

BUSINESS OBJECTIVES

You must build

  1. User churn prediction
  2. Personalized recommendation
  3. Engagement analytics
  4. Automated pipeline

PROBLEM STATEMENT FOR YOU

You must act as a Data Scientist and deliver:

A. Analytics Layer

  • Who are active users?
  • Who are binge watchers?
  • Genre demand
  • Retention curve

B. ML Layer

  1. ALS Recommendation
  2. Churn Model
  3. User segmentation

C. Engineering Layer

  • Spark processing
  • Airflow automation

DETAILED EXPECTATIONS

PHASE 1 – SQL ANALYTICS

You must load CSV into SQL DB

Tables

ratings(user_id, movie_id, rating, ts)
movies(movie_id, title, genres)

TASKS

1. User Summary

  • total_movies
  • avg_rating
  • last_watch
  • favorite_genre
SELECT
user_id,
COUNT(*) as movies_watched,
AVG(rating) as avg_rating,
MAX(ts) as last_active
FROM ratings
GROUP BY user_id;

2. Genre Popularity

  • explode genres
  • rating by genre

3. Churn Label

Definition:

No activity in last 30 days = churn

PHASE 2 – PANDAS + EDAYou must:

  1. Convert timestamp → date
  2. Create:
  • watch frequency
  • rating distribution
  • genre split

Questions to Answer

  1. Do highly rated movies increase engagement?
  2. Which genre retains users?
  3. Are weekends different?

PHASE 3 – STATISTICS

You MUST perform

  1. T-Test

Action vs Drama watch time difference

  1. ANOVA across genres
  2. Correlation
  • rating vs frequency
  • recency vs rating

PHASE 4 – PYSPARK

Build Scalable Pipeline

  1. Sessionization
Window.partitionBy("userId").orderBy("timestamp")
  1. Features
  • recency
  • frequency
  • diversity
  • avg rating
  1. ALS Model
from pyspark.ml.recommendation import ALS


PHASE 5 – AIRFLOW

DAG

1. ingest_csv
2. clean
3. features_spark
4. train_als
5. churn_dataset
6. report

Schedule: Daily

WHAT YOU MUST SUBMIT

1. SQL

  • 8 queries
  • churn label
  • genre analytics

2. Pandas

  • EDA notebook
  • 5 insights

3. Stats

  • 2 hypothesis tests

4. Spark

  • ALS recommender
  • feature job

5. Airflow

  • DAG

Discover more from Parottasalna

Subscribe now to keep reading and get access to the full archive.

Continue reading