Project: Investigating Netflix Movies

Netflix! What started in 1997 as a DVD rental service has since exploded into one of the largest entertainment and media companies.

Given the large number of movies and series available on the platform, it is a perfect opportunity to flex your exploratory data analysis skills and dive into the entertainment industry. Our friend has also been brushing up on their Python skills and has taken a first crack at a CSV file containing Netflix data. They believe that the average duration of movies has been declining. Using your friends initial research, you’ll delve into the Netflix data to see if you can determine whether movie lengths are actually getting shorter and explain some of the contributing factors, if any.

You have been supplied with the dataset netflix_data.csv , along with the following table detailing the column names and descriptions:

The data

netflix_data.csv

Column Description
show_id The ID of the show
type Type of show
title Title of the show
director Director of the show
cast Cast of the show
country Country of origin
date_added Date added to Netflix
release_year Year of Netflix release
duration Duration of the show in minutes
description Description of the show
genre Show genre

Importing pandas and matplotlib

# Importing pandas and matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Start coding!

Load the CSV file and store as netflix_df.

#Load the CSV file and store as netflix_df.
netflix_df = pd.read_csv("datasets/netflix_data.csv")
netflix_df.head(2)
show_id type title director cast country date_added release_year duration description genre
0 s1 TV Show 3% NaN João Miguel, Bianca Comparato, Michel Gomes, R... Brazil August 14, 2020 2020 4 In a future where the elite inhabit an island ... International TV
1 s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico December 23, 2016 2016 93 After a devastating earthquake hits Mexico Cit... Dramas

Filter the data to remove TV shows and store as netflix_subset

#Filter the data to remove TV shows and store as netflix_subset
netflix_subset = netflix_df[netflix_df.type!='TV Show']
netflix_subset.head()
show_id type title director cast country date_added release_year duration description genre
1 s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, ... Mexico December 23, 2016 2016 93 After a devastating earthquake hits Mexico Cit... Dramas
2 s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence ... Singapore December 20, 2018 2011 78 When an army recruit is found dead, his fellow... Horror Movies
3 s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly... United States November 16, 2017 2009 80 In a postapocalyptic world, rag-doll robots hi... Action
4 s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... United States January 1, 2020 2008 123 A brilliant group of students become card-coun... Dramas
6 s7 Movie 122 Yasir Al Yasiri Amina Khalil, Ahmed Dawood, Tarek Lotfy, Ahmed... Egypt June 1, 2020 2019 95 After an awful accident, a couple admitted to ... Horror Movies

Investigate the Netflix movie data, keeping only the columns “title”, “country”, “genre”, “release_year”, “duration”, and saving this into a new DataFrame called netflix_movies.

#Investigate the Netflix movie data, keeping only the columns "title", "country", "genre", "release_year", "duration", and saving this into a new DataFrame called netflix_movies.
netflix_movies = netflix_subset[["title", "country", "genre", "release_year", "duration"]]
netflix_movies.head(10)
title country genre release_year duration
1 7:19 Mexico Dramas 2016 93
2 23:59 Singapore Horror Movies 2011 78
3 9 United States Action 2009 80
4 21 United States Dramas 2008 123
6 122 Egypt Horror Movies 2019 95
7 187 United States Dramas 1997 119
8 706 India Horror Movies 2019 118
9 1920 India Horror Movies 2008 143
10 1922 United States Dramas 2017 103
13 2,215 Thailand Documentaries 2018 89

Filter netflix_movies to find the movies that are shorter than 60 minutes, saving the resulting DataFrame as short_movies; inspect the result to find possible contributing factors.

#Filter netflix_movies to find the movies that are shorter than 60 minutes, saving the resulting DataFrame as short_movies; inspect the result to find possible contributing factors.
short_movies = netflix_movies[netflix_movies.duration<60]
short_movies.head(2)
print(short_movies.shape)
(420, 5)
short_movies.columns
Index(['title', 'country', 'genre', 'release_year', 'duration'], dtype='object')
sns.countplot(y='genre', data=short_movies)

Genre of the top four movies are Documentaries, Children, Standup and Uncategorized

Using a for loop and if/elif statements, iterate through the rows of netflix_movies and assign colors of your choice to four genre groups (“Children”, “Documentaries”, “Stand-Up”, and “Other” for everything else). Save the results in a colors list.

#Using a for loop and if/elif statements, iterate through the rows of netflix_movies and assign colors of your choice to four genre groups ("Children", "Documentaries", "Stand-Up", and "Other" for everything else). Save the results in a colors list. "
colors = []

for index, row in netflix_movies.iterrows():
    genre = row['genre']
    if genre == "Children":
        colors.append("blue")  # Assigning blue for Children genre
    elif genre == "Documentaries":
        colors.append("green")  # Assigning green for Documentaries genre
    elif genre == "Stand-Up":
        colors.append("yellow")  # Assigning yellow for Stand-Up genre
    else:
        colors.append("red")  # Assigning red for Other genres

#print(colors)

Initialize a figure object called fig and create a scatter plot for movie duration by release year using the colors list to color the points and using the labels “Release year” for the x-axis, “Duration (min)” for the y-axis, and the title “Movie Duration by Year of Release

#Initialize a figure object called fig and create a scatter plot for movie duration by release year using the colors list to color the points and using the labels "Release year" for the x-axis, "Duration (min)" for the y-axis, and the title "Movie Duration by Year of Release
fig, ax = plt.subplots()
ax.scatter(netflix_movies['release_year'],netflix_movies['duration'],color=colors)
ax.set_xlabel("Release year")
ax.set_ylabel("Duration (min)")
ax.set_title("Movie Duration by Year of Release")

plt.show()

answer="maybe"