Cleaning Data in Python
# Import the course packages
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import missingno as msno
import fuzzywuzzy
import recordlinkage
# Import the course datasets
= pd.read_csv('datasets/ride_sharing_new.csv', index_col = 'Unnamed: 0')
ride_sharing = pd.read_csv('datasets/airlines_final.csv', index_col = 'Unnamed: 0')
airlines = pd.read_csv('datasets/banking_dirty.csv', index_col = 'Unnamed: 0')
banking = pd.read_csv('datasets/restaurants_L2.csv', index_col = 'Unnamed: 0')
restaurants = pd.read_csv('datasets/restaurants_L2_dirty.csv', index_col = 'Unnamed: 0') restaurants_new
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills! - For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles. - Identify and remove all the duplicate rows in ride_sharing
. - Inspect the unique values of all the columns in airlines
and clean any inconsistencies. - For the airlines
DataFrame, create a new column called International
from dest_region
, where values representing US regions map to False
and all other regions map to True
. - The banking
DataFrame contains out of date ages. Update the Age
column using today’s date and the birth_date
column. - Clean the restaurants_new
DataFrame so that it better matches the categories in the city
and type
column of the restaurants
DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows between restaurants
and restaurants_new
using criteria you think is best.
Chapter 1: Common data problems
In this chapter, you’ll learn how to overcome some of the most common dirty data problems. You’ll convert data types, apply range constraints to remove future data points, and remove duplicated data points to avoid double-counting.
Numeric data or … ?
In this exercise, and throughout this chapter, you’ll be working with bicycle ride sharing data in San Francisco called ride_sharing. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.
The user_type column contains information on whether a user is taking a free ride and takes on the following values:
1 for free riders.
2 for pay per ride.
3 for monthly subscribers.
In this instance, you will print the information of ride_sharing using .info() and see a firsthand example of how an incorrect data type can flaw your analysis of the dataset. The pandas package is imported as pd. ### Instructions - Print the information of ride_sharing. - Use .describe() to print the summary statistics of the user_type column from ride_sharing. -
# Print the information of ride_sharing
print(ride_sharing.info())
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 duration 25760 non-null object
1 station_A_id 25760 non-null int64
2 station_A_name 25760 non-null object
3 station_B_id 25760 non-null int64
4 station_B_name 25760 non-null object
5 bike_id 25760 non-null int64
6 user_type 25760 non-null int64
7 user_birth_year 25760 non-null int64
8 user_gender 25760 non-null object
dtypes: int64(5), object(4)
memory usage: 2.0+ MB
None
count 25760.000000
mean 2.008385
std 0.704541
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 3.000000
Name: user_type, dtype: float64
# Print the information of ride_sharing
print(ride_sharing.info())
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
# Convert user_type from integer to category
'user_type_cat'] = ride_sharing['user_type'].astype('category')
ride_sharing[
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 25760 entries, 0 to 25759
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 duration 25760 non-null object
1 station_A_id 25760 non-null int64
2 station_A_name 25760 non-null object
3 station_B_id 25760 non-null int64
4 station_B_name 25760 non-null object
5 bike_id 25760 non-null int64
6 user_type 25760 non-null int64
7 user_birth_year 25760 non-null int64
8 user_gender 25760 non-null object
dtypes: int64(5), object(4)
memory usage: 2.0+ MB
None
count 25760.000000
mean 2.008385
std 0.704541
min 1.000000
25% 2.000000
50% 2.000000
75% 3.000000
max 3.000000
Name: user_type, dtype: float64
count 25760
unique 3
top 2
freq 12972
Name: user_type_cat, dtype: int64
Summing strings and concatenating numbers
In the previous exercise, you were able to identify that category is the correct data type for user_type and convert it in order to extract relevant statistical summaries that shed light on the distribution of user_type.
Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.
In this exercise, you’ll be converting the string column duration to the type int. Before that however, you will need to make sure to strip “minutes” from the column in order to make sure pandas reads it as numerical. The pandas package has been imported as pd. ### Instructions - Use the .strip() method to strip duration of “minutes” and store it in the duration_trim column. - Convert duration_trim to int and store it in the duration_time column. - Write an assert statement that checks if duration_time’s data type is now an int. - Print the average ride duration.
# Strip duration of minutes
'duration_trim'] = ride_sharing['duration'].str.strip("minutes")
ride_sharing[
# Convert duration to integer
'duration_time'] = ride_sharing['duration_trim'].astype("int")
ride_sharing[
# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'
# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())
duration duration_trim duration_time
0 12 minutes 12 12
1 24 minutes 24 24
2 8 minutes 8 8
3 4 minutes 4 4
4 11 minutes 11 11
... ... ... ...
25755 11 minutes 11 11
25756 10 minutes 10 10
25757 14 minutes 14 14
25758 14 minutes 14 14
25759 29 minutes 29 29
[25760 rows x 3 columns]
11.389052795031056
Tire size constraints
In this lesson, you’re going to build on top of the work you’ve been doing with the ride_sharing DataFrame. You’ll be working with the tire_sizes column which contains data on each bike’s tire size.
Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″.
In this exercise, you will make sure the tire_sizes column has the correct range by first converting it to an integer, then setting and testing the new upper limit of 27″ for tire sizes. ### Instructions - Convert the tire_sizes column from category to ‘int’. - Use .loc[] to set all values of tire_sizes above 27 to 27. - Reconvert back tire_sizes to ‘category’ from int. - Print the description of the tire_sizes.
ride_sharing.columns
Index(['duration', 'station_A_id', 'station_A_name', 'station_B_id',
'station_B_name', 'bike_id', 'user_type', 'user_birth_year',
'user_gender', 'user_type_cat', 'duration_trim', 'duration_time'],
dtype='object')
# Convert tire_sizes to integer
'tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
ride_sharing[
# Set all values above 27 to 27
'tire_sizes'] > 27, 'tire_sizes'] = 27
ride_sharing.loc[ride_sharing[
# Reconvert tire_sizes back to categorical
'tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
ride_sharing[
# Print tire size description
print(ride_sharing['tire_sizes'].describe())
Back to the future
A new update to the data pipeline feeding into the ride_sharing DataFrame has been updated to register each ride’s date. This information is stored in the ride_date column of the type object, which represents strings in pandas.
A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the ride_date column that occur anytime in the future, and set the maximum possible value of this column to today’s date. Before doing so, you would need to convert ride_date to a datetime object.
The datetime package has been imported as dt, alongside all the packages you’ve been using till now. ### Instructions - Convert ride_date to a datetime object using to_datetime(), then convert the datetime object into a date and store it in ride_dt column. - Create the variable today, which stores today’s date by using the dt.date.today() function. - For all instances of ride_dt in the future, set them to today’s date. - Print the maximum date in the ride_dt column.
# Convert ride_date to date
'ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
ride_sharing[
# Save today's date
= dt.date.today()
today
# Set all in the future to today's date
'ride_dt'] > today, 'ride_dt'] = today
ride_sharing.loc[ride_sharing[
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
Finding duplicates
A new update to the data pipeline feeding into ride_sharing has added the ride_id column, which represents a unique identifier for each ride.
The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the ride_sharing DataFrame.
In this exercise, you will confirm this suspicion by finding those duplicates. A sample of ride_sharing is in your environment, as well as all the packages you’ve been working with thus far. ### Instructions - Find duplicated rows of ride_id in the ride_sharing DataFrame while setting keep to False. - Subset ride_sharing on duplicates and sort by ride_id and assign the results to duplicated_rides. - Print the ride_id, duration and user_birth_year columns of duplicated_rides in that order.
# Find duplicates
= ride_sharing.duplicated(subset=['ride_id'], keep=False)
duplicates
# Sort your duplicated rides
= ride_sharing[duplicates].sort_values(by='ride_id')
duplicated_rides
# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id', 'duration' , 'user_birth_year']])
Treating duplicates
In the last exercise, you were able to verify that the new update feeding into ride_sharing contains a bug generating both complete and incomplete duplicated rows for some values of the ride_id column, with occasional discrepant values for the user_birth_year and duration columns.
In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average duration, and the minimum user_birth_year for each set of incomplete duplicate rows. ### Instructions - Drop complete duplicates in ride_sharing and store the results in ride_dup. - Create the statistics dictionary which holds minimum aggregation for user_birth_year and mean aggregation for duration. - Drop incomplete duplicates by grouping by ride_id and applying the aggregation in statistics. - Find duplicates again and run the assert statement to verify de-duplication.
# Drop complete duplicates from ride_sharing
= ride_sharing.drop_duplicates()
ride_dup
# Create statistics dictionary for aggregation function
= {'user_birth_year': 'min', 'duration': 'mean'}
statistics
# Group by ride_id and compute new statistics
= ride_dup.groupby('ride_id').agg(statistics).reset_index()
ride_unique
# Find duplicated values again
= ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicates = ride_unique[duplicates == True]
duplicated_rides
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0
Chapter 2: Text and categorical data problems
Categorical and text data can often be some of the messiest parts of a dataset due to their unstructured nature. In this chapter, you’ll learn how to fix whitespace and capitalization inconsistencies in category labels, collapse multiple categories into one, and reformat strings for consistency.
Finding consistency
In this exercise and throughout this chapter, you’ll be working with the airlines DataFrame which contains survey responses on the San Francisco Airport from airline customers.
The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction. Another DataFrame named categories was created, containing all correct possible values for the survey columns.
In this exercise, you will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise. The pandas package has been imported as pd, and the airlines and categories DataFrames are in your environment. ### Instructions - Print the categories DataFrame and take a close look at all possible correct categories of the survey columns. - Print the unique values of the survey columns in airlines using the .unique() method. - Create a set out of the cleanliness column in airlines using set() and find the inconsistent category by finding the difference in the cleanliness column of categories. - Find rows of airlines with a cleanliness value not in categories and print the output.
= pd.DataFrame({'cleanliness':["Clean","Average","Somewhat clean","Somewhat dirty","Dirty"],
categories 'safety':["Neutral","Very safe","Somewhat safe","Very unsafe","Somewhat unsafe"],
'satisfaction':["Very satisfied","Neutral","Somewhat satisfied","Somewhat unsatisfied","Very unsatisfied"]})
print(categories)
cleanliness safety satisfaction
0 Clean Neutral Very satisfied
1 Average Very safe Neutral
2 Somewhat clean Somewhat safe Somewhat satisfied
3 Somewhat dirty Very unsafe Somewhat unsatisfied
4 Dirty Somewhat unsafe Very unsatisfied
# Print categories DataFrame
print(categories)
# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")
cleanliness safety satisfaction
0 Clean Neutral Very satisfied
1 Average Very safe Neutral
2 Somewhat clean Somewhat safe Somewhat satisfied
3 Somewhat dirty Very unsafe Somewhat unsatisfied
4 Dirty Somewhat unsafe Very unsatisfied
Cleanliness: ['Clean' 'Average' 'Somewhat clean' 'Somewhat dirty' 'Dirty']
Safety: ['Neutral' 'Very safe' 'Somewhat safe' 'Very unsafe' 'Somewhat unsafe']
Satisfaction: ['Very satisfied' 'Neutral' 'Somewhat satsified' 'Somewhat unsatisfied'
'Very unsatisfied']
# Find the cleanliness category in airlines not in categories
= set(airlines['cleanliness']).difference(categories['cleanliness'])
cat_clean
# Find rows with that category
= airlines['cleanliness'].isin(cat_clean)
cat_clean_rows
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
Empty DataFrame
Columns: [id, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, wait_min, cleanliness, safety, satisfaction]
Index: []
# Find the cleanliness category in airlines not in categories
= set(airlines['cleanliness']).difference(categories['cleanliness'])
cat_clean
# Find rows with that category
= airlines['cleanliness'].isin(cat_clean)
cat_clean_rows
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
Empty DataFrame
Columns: [id, day, airline, destination, dest_region, dest_size, boarding_area, dept_time, wait_min, cleanliness, safety, satisfaction]
Index: []
id day ... safety satisfaction
0 1351 Tuesday ... Neutral Very satisfied
1 373 Friday ... Very safe Very satisfied
2 2820 Thursday ... Somewhat safe Neutral
3 1157 Tuesday ... Very safe Somewhat satsified
4 2992 Wednesday ... Very safe Somewhat satsified
... ... ... ... ... ...
2804 1475 Tuesday ... Neutral Somewhat satsified
2805 2222 Thursday ... Very safe Very satisfied
2806 2684 Friday ... Very safe Very satisfied
2807 2549 Tuesday ... Somewhat safe Very satisfied
2808 2162 Saturday ... Very safe Somewhat satsified
[2477 rows x 12 columns]
Inconsistent categories
In this exercise, you’ll be revisiting the airlines DataFrame from the previous lesson.
As a reminder, the DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport.
In this exercise, you will examine two categorical columns from this DataFrame, dest_region and dest_size respectively, assess how to address them and make sure that they are cleaned and ready for analysis. The pandas package has been imported as pd, and the airlines DataFrame is in your environment. ### Instructions - Print the unique values in dest_region and dest_size respectively. - Change the capitalization of all values of dest_region to lowercase. - Replace the ‘eur’ with ‘europe’ in dest_region using the .replace() method. - Strip white spaces from the dest_size column using the .strip() method. - Verify that the changes have been into effect by printing the unique values of the columns using .unique() .
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
# Lower dest_region column and then replace "eur" with "europe"
'dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'}) airlines[
['Asia' 'Canada/Mexico' 'West US' 'East US' 'Midwest US' 'EAST US'
'Middle East' 'Europe' 'eur' 'Central/South America'
'Australia/New Zealand' 'middle east']
['Hub' 'Small' ' Hub' 'Medium' 'Large' 'Hub ' ' Small'
'Medium ' ' Medium' 'Small ' ' Large' 'Large ']
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
# Lower dest_region column and then replace "eur" with "europe"
'dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
airlines[
# Remove white spaces from `dest_size`
'dest_size'] = airlines['dest_size'].str.strip()
airlines[
# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
'europe' 'central/south america' 'australia/new zealand']
['Hub' 'Small' ' Hub' 'Medium' 'Large' 'Hub ' ' Small'
'Medium ' ' Medium' 'Small ' ' Large' 'Large ']
['asia' 'canada/mexico' 'west us' 'east us' 'midwest us' 'middle east'
'europe' 'central/south america' 'australia/new zealand']
['Hub' 'Small' 'Medium' 'Large']
Remapping categories
To better understand survey respondents from airlines, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate.
The airlines DataFrame contains the day and wait_min columns, which are categorical and numerical respectively. The day column contains the exact day a flight took place, and wait_min contains the amount of minutes it took travelers to wait at the gate. To make your analysis easier, you want to create two new categorical variables:
wait_type: 'short' for 0-60 min, 'medium' for 60-180 and long for 180+
day_week: 'weekday' if day is in the weekday, 'weekend' if day is in the weekend.
The pandas and numpy packages have been imported as pd and np. Let’s create some new categorical data! ### Instructions - Create the ranges and labels for the wait_type column mentioned in the description. - Create the wait_type column by from wait_min by using pd.cut(), while inputting label_ranges and label_names in the correct arguments. - Create the mapping dictionary mapping weekdays to ‘weekday’ and weekend days to ‘weekend’. - Create the day_week column by using .replace().
# Create ranges for categories
= [0, 60, 180, np.inf]
label_ranges = ['short', 'medium', 'long']
label_names
# Create wait_type column
'wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
airlines[= label_names)
labels
# Create mappings and replace
= {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
mappings 'Thursday': 'weekday', 'Friday': 'weekday',
'Saturday': 'weekend', 'Sunday': 'weekend'}
'day_week'] = airlines['day'].replace(mappings) airlines[
Removing titles and taking names
While collecting survey respondent metadata in the airlines DataFrame, the full name of respondents was saved in the full_name column. However upon closer inspection, you found that a lot of the different names are prefixed by honorifics such as “Dr.”, “Mr.”, “Ms.” and “Miss”.
Your ultimate objective is to create two new columns named first_name and last_name, containing the first and last names of respondents respectively. Before doing so however, you need to remove honorifics.
The airlines DataFrame is in your environment, alongside pandas as pd. ### Instructions - Remove “Dr.”, “Mr.”, “Miss” and “Ms.” from full_name by replacing them with an empty string “” in that order. - Run the assert statement using .str.contains() that tests whether full_name still contains any of the honorifics.
# Replace "Dr." with empty string ""
'full_name'] = airlines['full_name'].str.replace("Dr.","")
airlines[
# Replace "Mr." with empty string ""
'full_name'] = airlines['full_name'].str.replace("Mr.","")
airlines[
# Replace "Miss" with empty string ""
'full_name']=airlines['full_name'].str.replace("Miss","")
airlines[
# Replace "Ms." with empty string ""
'full_name']=airlines['full_name'].str.replace("Ms.","")
airlines[
# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False
Keeping it descriptive
To further understand travelers’ experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport.
Their response is stored in the survey_response column. Upon a closer look, you realized a few of the answers gave the shortest possible character amount without much substance. In this exercise, you will isolate the responses with a character count higher than 40 , and make sure your new DataFrame contains responses with 40 characters or more using an assert statement.
The airlines DataFrame is in your environment, and pandas is imported as pd. ### Instructions - Using the airlines DataFrame, store the length of each instance in the survey_response column in resp_length by using .str.len(). - Isolate the rows of airlines with resp_length higher than 40. - Assert that the smallest survey_response length in airlines_survey is now bigger than 40.
# Store length of each row in survey_response column
= airlines['survey_response'].str.len()
resp_length
# Find rows in airlines where resp_length > 40
= airlines[resp_length > 40]
airlines_survey
# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40
# Print new survey_response column
print(airlines_survey['survey_response'])
Chapter 3: Advanced data problems
In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.
Uniform currencies
In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the banking DataFrame. The dataset contains data on the amount of money stored in accounts (acct_amount), their currency (acct_cur), amount invested (inv_amount), account opening date (account_opened), and last transaction date (last_transaction) that were consolidated from American and European branches.
You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to unify the currency amount into dollars. The pandas package has been imported as pd, and the banking DataFrame is in your environment. ### Instructions - Find the rows of acct_cur in banking that are equal to ‘euro’ and store them in the variable acct_eu. - Find all the rows of acct_amount in banking that fit the acct_eu condition, and convert them to USD by multiplying them with 1.1. - Find all the rows of acct_cur in banking that fit the acct_eu condition, set them to ‘dollar’.
2) banking.head(
|
cust_id |
birth_date |
Age |
acct_amount |
inv_amount |
fund_A |
fund_B |
fund_C |
fund_D |
account_opened |
last_transaction |
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
870A9281 |
1962-06-09 |
58 |
63523.31 |
51295 |
30105.0 |
4138.0 |
1420.0 |
15632.0 |
02-09-18 |
22-02-19 |
1 |
166B05B0 |
1962-12-16 |
58 |
38175.46 |
15050 |
4995.0 |
938.0 |
6696.0 |
2421.0 |
28-02-19 |
31-10-18 |
# Find values of acct_cur that are equal to 'euro'
= banking['acct_cur'] == 'euro'
acct_eu
# Convert acct_amount where it is in euro to dollars
'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1
banking.loc[acct_eu,
# Unify acct_cur column by changing 'euro' values to 'dollar'
'acct_cur'] = 'dollar'
banking.loc[acct_eu,
# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'
Uniform dates
After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The account_opened column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.
However, since this data was consolidated from multiple sources, you need to make sure that all dates are of the same format. You will do so by converting this column into a datetime object, while making sure that the format is inferred and potentially incorrect formats are set to missing. The banking DataFrame is in your environment and pandas was imported as pd. ### Instructions - Print the header of account_opened from the banking DataFrame and take a look at the different results. - Convert the account_opened column to datetime, while making sure the date format is inferred and that erroneous formats that raise error return a missing value. -
# Print the header of account_opend
print(banking['account_opened'].head())
# Convert account_opened to datetime
'account_opened'] = pd.to_datetime(banking['account_opened'],
banking[# Infer datetime format
= True,
infer_datetime_format # Return missing value for error
= 'coerce')
errors
# Get year of account opened
'acct_year'] = banking['account_opened'].dt.strftime('%Y')
banking[
# Print acct_year
print(banking['acct_year'])
How’s our data integrity?
New data has been merged into the banking DataFrame that contains details on how investments in the inv_amount column are allocated across four different funds A, B, C and D.
Furthermore, the age and birthdays of customers are now stored in the age and birth_date columns respectively.
You want to understand how customers of different age groups invest. However, you want to first make sure the data you’re analyzing is correct. You will do so by cross field checking values of inv_amount and age against the amount invested in different funds and customers’ birthdays. Both pandas and datetime have been imported as pd and dt respectively. ### Instructions - Find the rows where the sum of all rows of the fund_columns in banking are equal to the inv_amount column. - Store the values of banking with consistent inv_amount in consistent_inv, and those with inconsistent ones in inconsistent_inv. - Store today’s date into today, and manually calculate customers’ ages and store them in ages_manual. - Find all rows of banking where the age column is equal to ages_manual and then filter banking into consistent_ages and inconsistent_ages.
# Store fund columns to sum against
= ['fund_A', 'fund_B', 'fund_C', 'fund_D']
fund_columns
# Find rows where fund_columns row sum == inv_amount
= banking[fund_columns].sum(axis=1) == banking['inv_amount']
inv_equ
# Store consistent and inconsistent data
= banking[inv_equ]
consistent_inv = banking[~inv_equ]
inconsistent_inv
# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])
# Store today's date and find ages
= dt.date.today()
today = today.year - banking['birth_date'].dt.year
ages_manual
# Find rows where age column == ages_manual
= ages_manual == banking['age']
age_equ
# Store consistent and inconsistent data
= banking[age_equ]
consistent_ages = banking[~age_equ]
inconsistent_ages
# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])
Missing investors
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.
You just received a new version of the banking DataFrame containing data on the amount held and invested for new and existing customers. However, there are rows with missing inv_amount values.
You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness. The pandas, missingno and matplotlib.pyplot packages have been imported as pd, msno and plt respectively. The banking DataFrame is in your environment. ### Instructions - Print the number of missing values by column in the banking DataFrame. - Plot and show the missingness matrix of banking with the msno.matrix() function. - Isolate the values of banking missing values of inv_amount into missing_investors and with non-missing inv_amount values into investors. - Sort the banking DataFrame by the age column and plot the missingness matrix of banking_sorted.
import missingno as msno
import matplotlib.pyplot as plt
# Print number of missing values in banking
print(banking.isna().sum())
# Visualize missingness matrix
msno.matrix(banking)
plt.show()
# Isolate missing and non missing values of inv_amount
= banking[banking['inv_amount'].isna()]
missing_investors = banking[~banking['inv_amount'].isna()]
investors
# Sort banking by age and visualize
= banking.sort_values(by='age')
banking_sorted
msno.matrix(banking_sorted) plt.show()
Follow the money
In this exercise, you’re working with another version of the banking DataFrame that contains missing values for both the cust_id column and the acct_amount column.
You want to produce analysis on how many unique customers the bank has, the average amount held by customers and more. You know that rows with missing cust_id don’t really help you, and that on average acct_amount is usually 5 times the amount of inv_amount.
In this exercise, you will drop rows of banking with missing cust_ids, and impute missing values of acct_amount with some domain knowledge. ### Instructions - Use .dropna() to drop missing values of the cust_id column in banking and store the results in banking_fullid. - Use inv_amount to compute the estimated account amounts for banking_fullid by setting the amounts equal to inv_amount * 5, and assign the results to acct_imp. - Impute the missing values of acct_amount in banking_fullid with the newly created acct_imp using .fillna().
# Drop missing values of cust_id
= banking.dropna(subset = ['cust_id'])
banking_fullid
# Compute estimated acct_amount
= banking_fullid['inv_amount']*5
acct_imp
# Impute missing acct_amount with corresponding acct_imp
= banking_fullid.fillna({'acct_amount':acct_imp})
banking_imputed
# Print number of missing values
print(banking_imputed.isna().sum())
Chapter 4: Record linkage
Record linkage is a powerful technique used to merge multiple datasets together, used when values have typos or different spellings. In this chapter, you’ll learn how to link records by calculating the similarity between strings—you’ll then use your new skills to join two restaurant review datasets into one clean master dataset. ## The cutoff point
In this exercise, and throughout this chapter, you’ll be working with the restaurants DataFrame which has data on various restaurants. Your ultimate goal is to create a restaurant recommendation engine, but you need to first clean your data.
This version of restaurants has been collected from many sources, where the cuisine_type column is riddled with typos, and should contain only italian, american and asian cuisine types. There are so many unique categories that remapping them manually isn’t scalable, and it’s best to use string similarity instead.
Before doing so, you want to establish the cutoff point for the similarity score using the thefuzz’s process.extract() function by finding the similarity score of the most distant typo of each category. ### Instructions - Import process from thefuzz. - Store the unique cuisine_types into unique_types. - Calculate the similarity of ‘asian’, ‘american’, and ‘italian’ to all possible cuisine_types using process.extract(), while returning all possible matches.
# Import process from thefuzz
from thefuzz import process
# Store the unique values of cuisine_type in unique_types
= restaurants['cuisine_type'].unique()
unique_types
# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit=len(unique_types)))
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract("italian",unique_types,limit=len(unique_types)))
Remapping categories II
In the last exercise, you determined that the distance cutoff point for remapping typos of ‘american’, ‘asian’, and ‘italian’ cuisine types stored in the cuisine_type column should be 80.
In this exercise, you’re going to put it all together by finding matches with similarity scores equal to or higher than 80 by using fuzywuzzy.process’s extract() function, for each correct cuisine type, and replacing these matches with it. Remember, when comparing a string with an array of strings using process.extract(), the output is a list of tuples where each is formatted like:
(closest match, similarity score, index of match)
The restaurants DataFrame is in your environment, and you have access to a categories list containing the correct cuisine types (‘italian’, ‘asian’, and ‘american’). ### Instructions - Return all of the unique values in the cuisine_type column of restaurants.
# Inspect the unique values of the cuisine_type column
print(restaurants.cuisine_type.unique())
# Create a list of matches, comparing 'italian' with the cuisine_type column
= process.extract('italian',restaurants.cuisine_type,limit = len(restaurants.cuisine_type) )
matches
# Inspect the first 5 matches
print(matches[0:5])
# Create a list of matches, comparing 'italian' with the cuisine_type column
= process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
matches
# Iterate through the list of matches to italian
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1]>=80:
# Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
'cuisine_type']==match[0],'cuisine_type']='italian'
restaurants.loc[restaurants[
# Iterate through categories
for cuisine in categories:
# Create a list of matches, comparing cuisine with the cuisine_type column
= process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
matches
# Iterate through the list of matches
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
'cuisine_type'] == match[0]] = cuisine
restaurants.loc[restaurants[
# Inspect the final result
print(restaurants['cuisine_type'].unique())
Pairs of restaurants
In the last lesson, you cleaned the restaurants dataset to make it ready for building a restaurants recommendation engine. You have a new DataFrame named restaurants_new with new restaurants to train your model on, that’s been scraped from a new data source.
You’ve already cleaned the cuisine_type and city columns using the techniques learned throughout the course. However you saw duplicates with typos in restaurants names that require record linkage instead of joins with restaurants.
In this exercise, you will perform the first step in record linkage and generate possible pairs of rows between restaurants and restaurants_new. Both DataFrames, pandas and recordlinkage are in your environment. ### Instructions - Instantiate an indexing object by using the Index() function from recordlinkage. - Block your pairing on cuisine_type by using indexer’s’ .block() method. - Generate pairs by indexing restaurants and restaurants_new in that order.
# Create an indexer and object and find possible pairs
= recordlinkage.Index()
indexer
# Block pairing on cuisine_type
'cuisine_type')
indexer.block(
# Generate pairs
= indexer.index(restaurants, restaurants_new) pairs
Similar restaurants
In the last exercise, you generated pairs between restaurants and restaurants_new in an effort to cleanly merge both DataFrames using record linkage.
When performing record linkage, there are different types of matching you can perform between different columns of your DataFrames, including exact matches, string similarities, and more.
Now that your pairs have been generated and stored in pairs, you will find exact matches in the city and cuisine_type columns between each pair, and similar strings for each pair in the rest_name column. Both DataFrames, pandas and recordlinkage are in your environment. ### Instructions - Instantiate a comparison object using the recordlinkage.Compare() function. - Use the appropriate comp_cl method to find exact matches between the city and cuisine_type columns of both DataFrames. - Use the appropriate comp_cl method to find similar strings with a 0.8 similarity threshold in the rest_name column of both DataFrames. - Compute the comparison of the pairs by using the .compute() method of comp_cl.
# Create a comparison object
= recordlinkage.Compare()
comp_cl
# Find exact matches on city, cuisine_types -
'city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')
comp_cl.exact(
# Find similar matches of rest_name
'rest_name', 'rest_name', label='name', threshold = 0.8)
comp_cl.string(
# Get potential matches and print
= comp_cl.compute(pairs, restaurants, restaurants_new)
potential_matches print(potential_matches)
Linking them together!
In the last lesson, you’ve finished the bulk of the work on your effort to link restaurants and restaurants_new. You’ve generated the different pairs of potentially matching rows, searched for exact matches between the cuisine_type and city columns, but compared for similar strings in the rest_name column. You stored the DataFrame containing the scores in potential_matches.
Now it’s finally time to link both DataFrames. You will do so by first extracting all row indices of restaurants_new that are matching across the columns mentioned above from potential_matches. Then you will subset restaurants_new on these indices, then append the non-duplicate values to restaurants. All DataFrames are in your environment, alongside pandas imported as pd. ### Instructions - Isolate instances of potential_matches where the row sum is above or equal to 3 by using the .sum() method. - Extract the second column index from matches, which represents row indices of matching record from restaurants_new by using the .get_level_values() method. - Subset restaurants_new for rows that are not in matching_indices. - Append non_dup to restaurants.
# Isolate potential matches with row sum >=3
= potential_matches[potential_matches.sum(axis=1) >= 3]
matches
# Get values of second column index of matches
= matches.index.get_level_values(1)
matching_indices
# Subset restaurants_new based on non-duplicate values
= restaurants_new[~restaurants_new.index.isin(matching_indices)]
non_dup
# Append non_dup to restaurants
= restaurants.append(non_dup)
full_restaurants print(full_restaurants)