Booking cancellations have a substantial impact on demand management decisions in the hospitality industry.
Every year, more than 140 million bookings were made on the internet and many hotel bookings were booked through top-visited travel websites like Booking.com, Expedia.com, Hotels.com, etc.
When Analyzing the past 5 years data, D-Edge Hospitality solutions has found that the global average cancellation rate on bookings has reached almost 40% and this trend produces a very negative impact on hotel revenue.
To overcome the problems caused by booking cancellations, hotels implement rigid cancellation policies, inventory management, and overbooking strategies, which can also have a negative influence on revenue and reputation.
Once the reservation has been canceled, there is almost nothing to be done and it creates discomfort for many Hotels and Hotel Technology companies. Therefore, predicting reservations which might get canceled and preventing these cancellations will create a surplus revenue for both Hotels and Hotel Technology companies.
Have you ever wondered what if there was a way we could predict which guests are likely to Cancel the Hotel Booking? That would be great right?
Luckily, by using Machine learning with Python, we can predict the guests who are likely to cancel the reservation at the Hotels and this will create a surplus revenue, better forecasts and reduce uncertainty in business management decisions.
If you want to follow a process while working on a machine learning project, this blog is for you. In this blog, I will walk you through the entire process of solving a real-world Machine learning project right from understanding the Business problem to the Model deployment on the cloud.
The Goal of this project is to Predict the Guests who are likely to Cancel the Hotel Booking using Machine Learning with Python. Therefore, predicting reservations which might get canceled and preventing these cancellations will create a surplus revenue, better forecasts and reduce uncertainty in business management decisions.
This dataset contains booking information for a City hotel and a Resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces etc.
I have collected the dataset from the kaggle. Dataset is available at Kaggle Dataset
In this step, we will apply Exploratory Data Analysis (EDA) to extract insights from the data set to know which features have contributed more in predicting Cancellations by performing Data Analysis using Pandas and Data visualization using Matplotlib & Seaborn. It is always a good practice to understand the data first and try to gather as many insights from it.
Below are tasks to be performed:
MongoDB Atlas is a fully-managed Database-as-a-Service (DBaaS) cloud database that handles all the complexity of deploying, managing, and healing your deployments on the cloud service provider of your choice (AWS , Azure, and GCP). MongoDB Atlas is the best way to deploy, run, and scale MongoDB in the cloud.
In this project, we will store the dataset into MongoDB Atlas cloud database by following industry best practices which helps in managing the data in scalable, robust and secure manner.
1. Create an Atlas Account using Gmail Id
2. Create a Free Cluster in MongoDB Atlas Cloud
3. Add Your Connection IP Address to Your IP Access List
4. Create a Database and User for Your Cluster with username and password to make it secure
5. Connect to Your Cluster
6. Insert and View Data in Your Cluster
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12,6))
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
<Figure size 864x432 with 0 Axes>
# Install and import Pymongo libraries to connect with MongoDB Atlas Cloud
!pip install pymongo
!pip install 'pymongo[srv]'
!pip install dnspython
from pymongo import MongoClient
Requirement already satisfied: pymongo in /usr/local/lib/python3.7/dist-packages (3.12.0)
Requirement already satisfied: pymongo[srv] in /usr/local/lib/python3.7/dist-packages (3.12.0)
Requirement already satisfied: dnspython<2.0.0,>=1.16.0 in /usr/local/lib/python3.7/dist-packages (from pymongo[srv]) (1.16.0)
Requirement already satisfied: dnspython in /usr/local/lib/python3.7/dist-packages (1.16.0)
# Establish a connection to a MongoDB Atlas Cluster with Secured Authentication using User Name and Password of the Database
client = MongoClient("mongodb+srv://production:prod_db@cluster0.7u4hj.mongodb.net/myFirstDatabase?retryWrites=true&w=majority")
# Create Database and specify name of database
db = client.get_database('hotel_bookings_db')
# Create a collection
records = db.hotel_records
# Create Dataframe and Read the dataset using Pandas
df = pd.read_csv('/content/bookings.csv')
df.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
# Convert Dataframe into Dictionary as MongoDB stores data in records/documents
data = df.to_dict(orient = 'records')
# Insert records in the dataset into MongoDB collection "hotel_records"
db.hotel_records.insert_many(data)
print("All the Data has been Exported to MongoDB Successfully")
All the Data has been Exported to MongoDB Successfully
#Load all records from MongoDB using find()
all_records = records.find()
print(all_records)
<pymongo.cursor.Cursor object at 0x7fe1dcd4ec50>
#Convert Cursor Object into list
list_cursor = list(all_records)
#Convert list into Dataframe
df1 = pd.DataFrame(list_cursor)
df1.head()
| _id | hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 617745f32cb05b4871a1523c | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 3 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | 617745f32cb05b4871a1523d | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | BB | PRT | Direct | Direct | 0 | 0 | 0 | C | C | 4 | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | 617745f32cb05b4871a1523e | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Direct | Direct | 0 | 0 | 0 | A | C | 0 | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | 617745f32cb05b4871a1523f | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | BB | GBR | Corporate | Corporate | 0 | 0 | 0 | A | A | 0 | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | 617745f32cb05b4871a15240 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | BB | GBR | Online TA | TA/TO | 0 | 0 | 0 | A | A | 0 | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 _id 119390 non-null object
1 hotel 119390 non-null object
2 is_canceled 119390 non-null int64
3 lead_time 119390 non-null int64
4 arrival_date_year 119390 non-null int64
5 arrival_date_month 119390 non-null object
6 arrival_date_week_number 119390 non-null int64
7 arrival_date_day_of_month 119390 non-null int64
8 stays_in_weekend_nights 119390 non-null int64
9 stays_in_week_nights 119390 non-null int64
10 adults 119390 non-null int64
11 children 119386 non-null float64
12 babies 119390 non-null int64
13 meal 119390 non-null object
14 country 118902 non-null object
15 market_segment 119390 non-null object
16 distribution_channel 119390 non-null object
17 is_repeated_guest 119390 non-null int64
18 previous_cancellations 119390 non-null int64
19 previous_bookings_not_canceled 119390 non-null int64
20 reserved_room_type 119390 non-null object
21 assigned_room_type 119390 non-null object
22 booking_changes 119390 non-null int64
23 deposit_type 119390 non-null object
24 agent 103050 non-null float64
25 company 6797 non-null float64
26 days_in_waiting_list 119390 non-null int64
27 customer_type 119390 non-null object
28 adr 119390 non-null float64
29 required_car_parking_spaces 119390 non-null int64
30 total_of_special_requests 119390 non-null int64
31 reservation_status 119390 non-null object
32 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(13)
memory usage: 30.1+ MB
1. Top 10 countries of origin of Hotel visitors (Guests)
df1['country'].value_counts(normalize = True)[:10]
PRT 0.408656
GBR 0.102008
FRA 0.087593
ESP 0.072059
DEU 0.061286
ITA 0.031673
IRL 0.028385
BEL 0.019697
BRA 0.018704
NLD 0.017695
Name: country, dtype: float64
plt.figure(figsize=(12,6))
sns.countplot(x='country', data=df1,order=pd.value_counts(df1['country']).iloc[:10].index,palette= 'colorblind')
plt.title('Top 10 Countries of Origin of the Guests', weight='bold')
plt.xlabel('Country')
plt.ylabel('Reservation Count')
Text(0, 0.5, 'Reservation Count')
2. Which Month is the Most Occupied (Busiest) with Bookings at the Hotel
df1['arrival_date_month'].value_counts(normalize = True)
August 0.116233
July 0.106047
May 0.098760
October 0.093475
April 0.092880
June 0.091624
September 0.088014
March 0.082034
February 0.067577
November 0.056906
December 0.056789
January 0.049661
Name: arrival_date_month, dtype: float64
ordered_months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
df1['arrival_date_month'] = pd.Categorical(df1['arrival_date_month'], categories=ordered_months, ordered=True)
plt.figure(figsize=(12,6))
sns.countplot(x='arrival_date_month', data = df1,palette= 'viridis')
plt.title('Most Occupied (Busiest) Month with Bookings', weight='bold')
plt.xlabel('Month')
plt.xticks(rotation=45)
plt.ylabel('Reservation Count')
Text(0, 0.5, 'Reservation Count')
3. How many Bookings were Cancelled at the Hotel
df1['is_canceled'].value_counts(normalize = True)
0 0.629584
1 0.370416
Name: is_canceled, dtype: float64
proportion = df1['is_canceled'].value_counts()
labels = ['Not Cancelled','Cancelled']
plt.figure(figsize=(12,6))
plt.title('Proportion of Cancelled & Not Cancelled Bookings',weight = 'bold')
plt.pie(proportion,labels=labels,shadow = True, autopct = '%1.1f%%',wedgeprops= {'edgecolor':'black'},textprops={'fontsize': 14})
plt.show()
4. Which Month has Highest Number of Cancellations By Hotel Type
ordered_months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
df1['arrival_date_month'] = pd.Categorical(df1['arrival_date_month'], categories=ordered_months, ordered=True)
plt.figure(figsize=(12,6))
sns.barplot(x = "arrival_date_month", y = "is_canceled", hue="hotel",hue_order = ["City Hotel", "Resort Hotel"],data=df1,palette= 'Blues')
plt.title("Cancelations Per Month", weight = 'bold')
plt.xlabel("Arrival Month")
plt.xticks(rotation=45)
plt.ylabel("Cancelations %")
plt.legend(loc="upper right")
plt.show()
5. How many Bookings were Cancelled by Hotel Type
df1.groupby('is_canceled')['hotel'].value_counts(normalize = True)
is_canceled hotel
0 City Hotel 0.615012
Resort Hotel 0.384988
1 City Hotel 0.748508
Resort Hotel 0.251492
Name: hotel, dtype: float64
plt.figure(figsize=(12,6))
sns.countplot(x= 'is_canceled',data = df1,hue = 'hotel',palette= 'Set1')
plt.title("Cancelation Status By Hotel Type ", weight = 'bold')
plt.xlabel("Cancelation Status")
plt.ylabel(" Reservations Count")
plt.show()
6. Relationship between Average Daily Rate(ADR) and Arrival Month by Booking cancellation status
ordered_months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
df1['arrival_date_month'] = pd.Categorical(df1['arrival_date_month'], categories=ordered_months, ordered=True)
plt.figure(figsize=(12,6))
sns.lineplot(x = "arrival_date_month", y = "adr", hue="is_canceled",hue_order= [1,0],data=df1,palette= 'Set1')
plt.title("Relationship between ADR and Arrival Month by Booking cancellation status", weight = 'bold')
plt.xlabel("Arrival Month")
plt.xticks(rotation=45)
plt.ylabel("Average Daily Rate")
plt.legend(loc="upper right")
plt.show()
7. Total Number of Bookings by Market Segment
df1['market_segment'].value_counts(normalize = True)
Online TA 0.473046
Offline TA/TO 0.202856
Groups 0.165935
Direct 0.105587
Corporate 0.044350
Complementary 0.006223
Aviation 0.001985
Undefined 0.000017
Name: market_segment, dtype: float64
plt.figure(figsize=(12,6))
sns.countplot(df1['market_segment'], palette='colorblind',order=pd.value_counts(df1['market_segment']).index)
plt.title('Total Number of Bookings by Market Segment', weight='bold')
plt.xlabel('Market Segment')
plt.ylabel('Reservation Count')
Text(0, 0.5, 'Reservation Count')
8. Total Nights Spent by Guests at the Hotel by Market Segment and Hotel Type
df1['total_stay'] = df1['stays_in_week_nights'] + df1['stays_in_weekend_nights']
plt.figure(figsize=(12,6))
sns.barplot(x = "market_segment", y = "total_stay", data = df1, hue = "hotel", palette = 'rocket')
plt.title('Total Nights Spent by Guests by Market Segment & Hotel Type', weight='bold')
plt.xlabel('Market Segment')
plt.ylabel('Number of Days')
plt.legend(loc = "upper right")
<matplotlib.legend.Legend at 0x7fe1be830d10>
9. Arrival Date Year vs Lead Time By Booking Cancellation Status
plt.figure(figsize=(12,6))
sns.barplot(x='arrival_date_year', y ='lead_time', hue="is_canceled", data=df1, palette="viridis")
plt.title('Arrival Year vs Lead Time By Cancellation Status', weight='bold')
plt.xlabel(' Arrival Year')
plt.ylabel('Lead Time')
plt.legend(loc = "upper right")
<matplotlib.legend.Legend at 0x7fe1bcec7e10>
10. Total Number of bookings by deposit type
df1['deposit_type'].value_counts(normalize = True)
No Deposit 0.876464
Non Refund 0.122179
Refundable 0.001357
Name: deposit_type, dtype: float64
proportion = df1['deposit_type'].value_counts()
labels = ['No Deposit','Non Refundable','Refundable']
plt.figure(figsize=(12,6))
plt.title('Proportion of Total Bookings by Deposit Type',weight = 'bold')
plt.pie(proportion,labels=labels,shadow = True, autopct = '%1.1f%%',wedgeprops= {'edgecolor':'black'},textprops={'fontsize': 14})
plt.show()
After exploring the dataset, we will find a lot of information that will help you prepare the data. Most important steps in Data Preparation are:
num_feature = [feature for feature in df1.columns if df1[feature].dtype != 'object']
print("Number of Numerical Features are : ",len(num_feature))
Number of Numerical Features are : 22
df1[num_feature][:5]
| is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | total_stay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | 0 | 0 | 0 | 3 | NaN | NaN | 0 | 0.0 | 0 | 0 | 0 |
| 1 | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | 0.0 | 0 | 0 | 0 | 0 | 4 | NaN | NaN | 0 | 0.0 | 0 | 0 | 0 |
| 2 | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | 0 | 75.0 | 0 | 0 | 1 |
| 3 | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 304.0 | NaN | 0 | 75.0 | 0 | 0 | 1 |
| 4 | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 240.0 | NaN | 0 | 98.0 | 0 | 1 | 2 |
cat_feature = [feature for feature in df1.columns if df1[feature].dtype == 'object']
print("Number of Categorical Features are : ",len(cat_feature))
Number of Categorical Features are : 12
df1[cat_feature][:5]
| _id | hotel | meal | country | market_segment | distribution_channel | reserved_room_type | assigned_room_type | deposit_type | customer_type | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 617745f32cb05b4871a1523c | Resort Hotel | BB | PRT | Direct | Direct | C | C | No Deposit | Transient | Check-Out | 2015-07-01 |
| 1 | 617745f32cb05b4871a1523d | Resort Hotel | BB | PRT | Direct | Direct | C | C | No Deposit | Transient | Check-Out | 2015-07-01 |
| 2 | 617745f32cb05b4871a1523e | Resort Hotel | BB | GBR | Direct | Direct | A | C | No Deposit | Transient | Check-Out | 2015-07-02 |
| 3 | 617745f32cb05b4871a1523f | Resort Hotel | BB | GBR | Corporate | Corporate | A | A | No Deposit | Transient | Check-Out | 2015-07-02 |
| 4 | 617745f32cb05b4871a15240 | Resort Hotel | BB | GBR | Online TA | TA/TO | A | A | No Deposit | Transient | Check-Out | 2015-07-03 |
df1.isnull().sum()
_id 0
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
total_stay 0
dtype: int64
# Check % of Missing Values in the Data set
feature_nan = [feature for feature in df1.columns if df1[feature].isnull().sum()>1]
for feature in feature_nan:
print('{} : {}% Missing values'.format(feature,np.around(df1[feature].isnull().mean(),4)))
children : 0.0% Missing values
country : 0.0041% Missing values
agent : 0.1369% Missing values
company : 0.9431% Missing values
#Dropping the "Company" feature
df1.drop(columns = ['company'],inplace= True)
#Filling missing values by 0 for "Agent" feature
df1['agent']= df1['agent'].fillna(0)
#Imputing missing values of "Country" feature
df1['country'].fillna(df1['country'].mode()[0], inplace=True)
#Filling missing values by 0 for "Children" feature
df1['children']= df1['children'].fillna(0)
#Dropping the "_Id" feature
df1.drop(columns = ['_id'],inplace= True)
df2 = df1.copy()
cat_feature = [feature for feature in df2.columns if df2[feature].dtype == 'object']
print("Number of Categorical Features are : ",len(cat_feature))
Number of Categorical Features are : 11
df2[cat_feature][:5]
| hotel | meal | country | market_segment | distribution_channel | reserved_room_type | assigned_room_type | deposit_type | customer_type | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | BB | PRT | Direct | Direct | C | C | No Deposit | Transient | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | BB | PRT | Direct | Direct | C | C | No Deposit | Transient | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | BB | GBR | Direct | Direct | A | C | No Deposit | Transient | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | BB | GBR | Corporate | Corporate | A | A | No Deposit | Transient | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | BB | GBR | Online TA | TA/TO | A | A | No Deposit | Transient | Check-Out | 2015-07-03 |
# Identify unique (Distinct) labels that exists in each categorical features
for feature in cat_feature:
print("{} : {} labels ".format(feature,len(df2[feature].unique())))
hotel : 2 labels
meal : 5 labels
country : 177 labels
market_segment : 8 labels
distribution_channel : 5 labels
reserved_room_type : 10 labels
assigned_room_type : 12 labels
deposit_type : 3 labels
customer_type : 4 labels
reservation_status : 3 labels
reservation_status_date : 926 labels
#Custom mapping
df2['hotel'] = df2['hotel'].map({'Resort Hotel':0, 'City Hotel':1})
df2['arrival_date_month'] = df2['arrival_date_month'].map({'January':1, 'February': 2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7,
'August':8, 'September':9, 'October':10, 'November':11, 'December':12})
#Replacing "Undefined" with "SC" as mentioned in data set description
df2["meal"].replace("Undefined", "SC", inplace=True)
#Applying label encoding for categorical features
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df2['meal'] = le.fit_transform(df2['meal'])
df2['deposit_type'] = le.fit_transform(df2['deposit_type'])
df2['customer_type'] = le.fit_transform(df2['customer_type'])
df2['market_segment'] = le.fit_transform(df2['market_segment'])
df2['distribution_channel'] = le.fit_transform(df2['distribution_channel'])
df2['reserved_room_type'] = le.fit_transform(df2['reserved_room_type'])
df2['assigned_room_type']= le.fit_transform(df2['assigned_room_type'])
df2['reservation_status'] = le.fit_transform(df2['reservation_status'])
df2['reservation_status_date'] = le.fit_transform(df2['reservation_status_date'])
df2['country'] = le.fit_transform(df2['country'])
# Relationship between Independent and Dependent feature (Correlation Heat map)
df2.corr()["is_canceled"].sort_values()
reservation_status -0.917196
total_of_special_requests -0.234658
required_car_parking_spaces -0.195498
assigned_room_type -0.176028
reservation_status_date -0.162135
booking_changes -0.144381
is_repeated_guest -0.084793
customer_type -0.068140
reserved_room_type -0.061282
previous_bookings_not_canceled -0.057358
agent -0.046529
babies -0.032491
meal -0.015693
arrival_date_day_of_month -0.006130
stays_in_weekend_nights -0.001791
children 0.005036
arrival_date_week_number 0.008148
arrival_date_year 0.016660
total_stay 0.017779
stays_in_week_nights 0.024765
adr 0.047557
days_in_waiting_list 0.054186
market_segment 0.059338
adults 0.060017
previous_cancellations 0.110133
hotel 0.136531
distribution_channel 0.167600
country 0.267502
lead_time 0.293123
deposit_type 0.468634
is_canceled 1.000000
Name: is_canceled, dtype: float64
plt.figure(figsize = (15,10))
sns.heatmap(df2.corr(), cmap="viridis")
<matplotlib.axes._subplots.AxesSubplot at 0x7fe1bcde00d0>
df2.drop(columns = ['reservation_status','arrival_date_week_number','arrival_date_month','arrival_date_year','stays_in_week_nights','stays_in_weekend_nights','reservation_status_date'],inplace = True)
df3 = df2.copy()
df3.head()
| hotel | is_canceled | lead_time | arrival_date_day_of_month | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | total_stay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 342 | 1 | 2 | 0.0 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 3 | 0 | 0.0 | 0 | 2 | 0.0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 737 | 1 | 2 | 0.0 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 4 | 0 | 0.0 | 0 | 2 | 0.0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 7 | 1 | 1 | 0.0 | 0 | 0 | 59 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0.0 | 0 | 2 | 75.0 | 0 | 0 | 1 |
| 3 | 0 | 0 | 13 | 1 | 1 | 0.0 | 0 | 0 | 59 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 304.0 | 0 | 2 | 75.0 | 0 | 0 | 1 |
| 4 | 0 | 0 | 14 | 1 | 2 | 0.0 | 0 | 0 | 59 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 240.0 | 0 | 2 | 98.0 | 0 | 1 | 2 |
df3.shape
(119390, 25)
The next step now is to build a Machine learning model using a Machine learning Algorithm. In this project, we will build Decision Tree Machine Learning Model using Scikit Learn Library.
We need to divide/split the data into training and test sets using train_test_split.
For better training of a machine learning model, it is necessary to divide the training data with more numbers (70% to 80%) of samples and the test set with 20% to 30% of the dataset depending on the size of the data.
Decision tree algorithm is one of the most versatile algorithms in machine learning which can perform both classification and regression analysis. It is very powerful and works great with complex datasets. Apart from that, it is very easy to understand and read.
As the name suggests, this algorithm works by dividing the whole dataset into a tree-like structure based on some rules and conditions and then gives prediction based on those conditions. Let’s understand the approach to decision tree with a basic scenario.
Let’s assume we want to classify whether a customer could retire or not based on their savings and age.
Attribute Selection Measures
Attribute selection measure can be used as a basic set of rules to select the splitting criterion that helps in dividing the dataset in the best possible way. The splitting feature is always a best score attribute. Entropy, Information Gain, Gini Index and Gain Ratio are the most commonly used and popular selection measures.
Entropy
Entropy is the measure of randomness in the data. In other words, it gives the impurity present in the dataset.
When we split our nodes into two regions and put different observations in both the regions, the main goal is to reduce the entropy i.e. reduce the randomness in the region and divide our data cleanly than it was in the previous node. If splitting the node doesn’t lead into entropy reduction, we try to split based on a different condition, or we stop.
A region is clean (low entropy) when it contains data with the same labels and random if there is a mixture of labels present (high entropy).
Let’s suppose there are ‘m’ observations and we need to classify them into categories 1 and 2. Let’s say that category 1 has ‘n’ observations and category 2 has ‘m-n’ observations.
p= n/m and q = m-n/m = 1-p
then, entropy for the given set is:
E = -plog2(p) – qlog2(q)
When all the observations belong to category 1, then p = 1 and all observations belong to category 2, then p =0, int both cases E =0, as there is no randomness in the categories.
If half of the observations are in category 1 and another half in category 2, then p =1/2 and q =1/2, and the entropy is maximum, E =1.
Information Gain
Information gain calculates the decrease in entropy after splitting a node. It is the difference between entropies before and after the split. The more the information gain, the more entropy is removed.
Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
Mathematically, IG is represented as:
Where, T is the parent node before split and X is the split node from T.
Gini Index
Gini index is a cost function used to evaluate splits in the dataset. It is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions and easy to implement whereas information gain favors smaller partitions with distinct values.
Steps to Calculate Gini for a split
X = df3.drop(['is_canceled'], axis = 1)
y = df3['is_canceled']
#Train and test split data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42,stratify = y,test_size = 0.30)
#Checking if train and test data of target feature is equally distributed
y_train.value_counts(normalize=True)
0 0.629581
1 0.370419
Name: is_canceled, dtype: float64
y_test.value_counts(normalize=True)
0 0.629589
1 0.370411
Name: is_canceled, dtype: float64
Steps to be followed to build Machine Learning Model with Scikit-learn
# import the class
from sklearn.tree import DecisionTreeClassifier
# instantiate the estimator(model)
dt_model = DecisionTreeClassifier(random_state = 42)
#fit the model with data
dt_model.fit(X_train,y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
#Predict the Target feature for a new observation
dt_model.predict(X_train)
array([1, 0, 0, ..., 0, 1, 0], dtype=int64)
#Accuracy score on Training Data
dt_model.score(X_train,y_train)
0.996434255082383
The training set accuracy is close to 100%! But we can’t rely solely on the training set accuracy, we must evaluate the model on the test data set too.
We can make predictions and compute accuracy in one step using model.score
#Accuracy score on Testing Data
dt_model.score(X_test,y_test)
0.8441242985174637
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(80,20))
plot_tree(dt_model, feature_names=X_train.columns, max_depth=2, filled=True)
The Gini value in each box is the loss function used by the decision tree to decide which column should be used for splitting the data, and at what point the column should be split.
Lower Gini index indicates a better split
Based on the Gini Index computations, a decision tree assigns an “importance” value to each feature. These values can be used to interpret the results given by a decision tree.
dt_model.feature_importances_
array([0.00430979, 0.13396326, 0.0714008 , 0.01165362, 0.00543306,
0.00069664, 0.01043204, 0.08195502, 0.06748312, 0.00350234,
0.00182046, 0.02199886, 0.00573351, 0.01277278, 0.01703412,
0.0153714 , 0.23543993, 0.06189516, 0.00334881, 0.02076981,
0.09174907, 0.02378499, 0.05232125, 0.04513017])
#Convert this into a dataframe and visualize the most important features
importance_df = pd.DataFrame({
'feature': X_train.columns,
'importance': dt_model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(12,6))
sns.barplot(data=importance_df.head(10), x='importance', y='feature',palette= 'viridis')
plt.title('Top 10 Most Important Features', weight='bold')
plt.xlabel('Feature Importance %',weight='bold')
plt.ylabel('Features',weight='bold')
Text(0, 0.5, 'Features')
We can increase the model performance by hyperparameter tuning and finding these optimal hyperparameters would help us achieve the best-performing model.
GridSearchCV uses a different combination of all the specified hyperparameters and their values and calculates the performance for each combination and selects the best value for the hyperparameters. This will cost us the processing time and expense but will surely give us the best results.
Cross Validation is a statistical method used to estimate the performance (or accuracy) of Machine learning Models. It is used to protect against overfitting in a predictive model.
# Hyper Parameter Tuning using Grid SeachCV on Decision Tree Algorithm to check Best score and Best parameters
from sklearn.model_selection import GridSearchCV
param_grid= { 'criterion' : ['gini', 'entropy'],'min_samples_split' : [2,4,6,8],
'min_samples_leaf': [1,2,3,4,5],'max_features' : ['auto', 'sqrt'],'max_depth': [1,2,3,4,5,6,7,8,9,10]}
clf = GridSearchCV(estimator=dt_model, param_grid = param_grid, cv = 10, n_jobs=-1)
best_clf = clf.fit(X_train,y_train)
print('Decision Tree Best score: {} using best parameters {}'.format(best_clf.best_score_, best_clf.best_params_))
Decision Tree Best score: 0.8155264977113239 using best parameters {'criterion': 'gini', 'max_depth': 10, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 6}
# Stratified K-fold Cross Validation Technique on Decision Tree Alogorithm to know the exact Mean CV accuracy score
# Impute the best parameters obtained in Hyper Parameter tuning for Decision Tree Algorithm
from sklearn.model_selection import cross_val_score,StratifiedKFold
skfold = StratifiedKFold(n_splits= 10,shuffle= True,random_state= 42)
dt_cv_result = cross_val_score(DecisionTreeClassifier(criterion = 'gini',max_depth = 10, max_features = 'auto', min_samples_leaf = 1,min_samples_split = 6),X, y, cv=skfold,scoring="accuracy",n_jobs=-1)
dt_cv = dt_cv_result.mean()*100
print('Decision Tree CV Mean Accuarcy Score is {}'.format(dt_cv))
Decision Tree CV Mean Accuarcy Score is 80.52935756763547
Optimizing Decision Tree Model Performance
Impute the best parameters obtained in Hyper Parameter tuning for the newly created Decision Tree model to obtain Best Accuracy Score
# instantiate the estimator(new model)
dt_model_new = DecisionTreeClassifier(random_state = 42,criterion = 'gini',max_depth = 10, max_features = 'auto',
min_samples_leaf = 1,min_samples_split = 6)
#fit the new model with data
dt_model_new.fit(X_train,y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=10, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=6,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
#Accuracy score on Training Data
dt_model_new.score(X_train,y_train)
0.8099146853648905
#Accuracy score on Test Data
dt_model_new.score(X_test,y_test)
0.8047854370829495
# Visualizing Optimized Decision Tree
plt.figure(figsize=(80,20))
plot_tree(dt_model_new, feature_names=X_train.columns, max_depth=2, filled=True)
# Since the size of the pickle file is more, I will compress the pickle file using BZ2 library**
import bz2,pickle
sfile = bz2.BZ2File('dt_model_new_pickle.pkl','wb')
pickle.dump(dt_model_new,sfile)
sfile.close()
df4 = df3.copy()
df4.head()
| hotel | is_canceled | lead_time | arrival_date_day_of_month | adults | children | babies | meal | country | market_segment | distribution_channel | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | reserved_room_type | assigned_room_type | booking_changes | deposit_type | agent | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | total_stay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 342 | 1 | 2 | 0.0 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 3 | 0 | 0.0 | 0 | 2 | 0.0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 737 | 1 | 2 | 0.0 | 0 | 0 | 135 | 3 | 1 | 0 | 0 | 0 | 2 | 2 | 4 | 0 | 0.0 | 0 | 2 | 0.0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 7 | 1 | 1 | 0.0 | 0 | 0 | 59 | 3 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0.0 | 0 | 2 | 75.0 | 0 | 0 | 1 |
| 3 | 0 | 0 | 13 | 1 | 1 | 0.0 | 0 | 0 | 59 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 304.0 | 0 | 2 | 75.0 | 0 | 0 | 1 |
| 4 | 0 | 0 | 14 | 1 | 2 | 0.0 | 0 | 0 | 59 | 6 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 240.0 | 0 | 2 | 98.0 | 0 | 1 | 2 |
df4.shape
(119390, 25)
After building a model, the final stage of the machine learning lifecycle is to deploy the ML model.
ML model should be scalable and accessible to the business users/Customers by creating Web Application and deploying the model on the cloud using AWS, Google Cloud, Heroku cloud platforms.
Inorder to build the Web app, I will use the 10 most important features that are helpful in predicting the Hotel Booking Cancelation from the guests since it would be pain to front end user to fill all 23 features on the web app. I will drop the rest of features so we can build an interactive and user friendly web app
X = df4.drop(['is_canceled','hotel','arrival_date_day_of_month','adults','children','babies','meal','country','distribution_channel','is_repeated_guest',
'previous_bookings_not_canceled','reserved_room_type','agent','days_in_waiting_list'],axis = 1)
y = df4['is_canceled']
#Train and test data again to check if there is any drop in accuracy of the model after eliminating features
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42,stratify = y,test_size = 0.30)
from sklearn.tree import DecisionTreeClassifier
dt_model_new = DecisionTreeClassifier(random_state = 42)
dt_model_new.fit(X_train,y_train)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=42, splitter='best')
dt_model_new.score(X_train,y_train)
0.987675445418975
dt_model_new.score(X_test,y_test)
0.8125191947957674
# Compress the Pickle file to store Machine Learning Model for Web App Development
import bz2,pickle
file = bz2.BZ2File('dtmodelfinal.pkl','wb')
pickle.dump(dt_model_new,file)
file.close()
Once the training is completed, we need to expose the trained model as an API for the user to consume it. For prediction, the saved model is loaded first and then the predictions are made using it. If the web app works fine, the same app is deployed to the cloud platform.
ML Application flow chart
Link to the ML Web app :
https://cancelation-predictor.herokuapp.com/
Link for Python Code available on GitHub :
From our Data Analysis, we have observed that the top 5 most important features in the data set which helps in predicting hotel cancellations from the guests are :