Boost StreamFlix Engagement with Data Science Strategies

Imagine

You are working as a Data Scientist in

“StreamFlix” – a Netflix-like platform

Management Problems

Users install app → watch few movies → disappear
Homepage recommendations are weak
Content team doesn’t know:
- Which genres retain users
- What time people watch
Marketing wants:
- Who to target with offers

DATA SOURCE

MovieLens Dataset (Core)

https://grouplens.org/datasets/movielens/25m/

Tables:

ratings.csv
movies.csv
tags.csv
genome-scores.csv

BUSINESS OBJECTIVES

You must build

User churn prediction
Personalized recommendation
Engagement analytics
Automated pipeline

PROBLEM STATEMENT FOR YOU

You must act as a Data Scientist and deliver:

A. Analytics Layer

Who are active users?
Who are binge watchers?
Genre demand
Retention curve

B. ML Layer

ALS Recommendation
Churn Model
User segmentation

C. Engineering Layer

Spark processing
Airflow automation

DETAILED EXPECTATIONS

PHASE 1 – SQL ANALYTICS

You must load CSV into SQL DB

Tables

			
ratings(user_id, movie_id, rating, ts)
movies(movie_id, title, genres)

TASKS

1. User Summary

total_movies
avg_rating
last_watch
favorite_genre

			
SELECT
 user_id,
 COUNT(*) as movies_watched,
 AVG(rating) as avg_rating,
 MAX(ts) as last_active
FROM ratings
GROUP BY user_id;

		

2. Genre Popularity

explode genres
rating by genre

3. Churn Label

Definition:

No activity in last 30 days = churn

PHASE 2 – PANDAS + EDAYou must:

Convert timestamp → date
Create:

watch frequency
rating distribution
genre split

Questions to Answer

Do highly rated movies increase engagement?
Which genre retains users?
Are weekends different?

PHASE 3 – STATISTICS

You MUST perform

T-Test

Action vs Drama watch time difference

ANOVA across genres
Correlation

rating vs frequency
recency vs rating

PHASE 4 – PYSPARK

Build Scalable Pipeline

Sessionization

Window.partitionBy("userId").orderBy("timestamp")

Features

recency
frequency
diversity
avg rating

ALS Model

from pyspark.ml.recommendation import ALS

PHASE 5 – AIRFLOW

DAG

			
ingest_csv
clean
features_spark
train_als
churn_dataset
report

		

Schedule: Daily

WHAT YOU MUST SUBMIT

1. SQL

8 queries
churn label
genre analytics

2. Pandas

EDA notebook
5 insights

3. Stats

2 hypothesis tests

4. Spark

ALS recommender
feature job

Parottasalna

MovieLens 360: User Retention, Recommendation & Engagement Analytics

DATA SOURCE

MovieLens Dataset (Core)

BUSINESS OBJECTIVES

PROBLEM STATEMENT FOR YOU

DETAILED EXPECTATIONS

TASKS

PHASE 2 – PANDAS + EDAYou must:

Questions to Answer

PHASE 3 – STATISTICS

PHASE 4 – PYSPARK

Build Scalable Pipeline

PHASE 5 – AIRFLOW

DAG

WHAT YOU MUST SUBMIT

1. SQL

2. Pandas

3. Stats

4. Spark

5. Airflow

Like this:

MovieLens 360: User Retention, Recommendation & Engagement Analytics

DATA SOURCE

MovieLens Dataset (Core)

BUSINESS OBJECTIVES

PROBLEM STATEMENT FOR YOU

DETAILED EXPECTATIONS

TASKS

PHASE 2 – PANDAS + EDAYou must:

Questions to Answer

PHASE 3 – STATISTICS

PHASE 4 – PYSPARK

Build Scalable Pipeline

PHASE 5 – AIRFLOW

DAG

WHAT YOU MUST SUBMIT

1. SQL

2. Pandas

3. Stats

4. Spark

5. Airflow

Share this:

Like this:

Discover more from Parottasalna