Analysis of Taxi Trip Duration in NYC

22 Jan 2024 - Tags: Regression, XGBoost, Decision Tree, and K-means clustering

import os, sys
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
# 0 = all messages are logged (default behavior)
# 1 = INFO messages are not printed
# 2 = INFO and WARNING messages are not printed
# 3 = INFO, WARNING, and ERROR messages are not printed

# Ensure python module are found
sys.path.insert(0, '../src')
from models.train_model import train_xgboost

# Data visualization 
from matplotlib import pyplot as plt
from matplotlib.collections import EllipseCollection
from matplotlib.colors import  Normalize
import seaborn as sns

# Data manipulation 
import pandas as pd
import numpy as np

# Geospatial 
from geopy import distance
import geopandas as gpd

# Machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import  DecisionTreeRegressor
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import mean_squared_error, make_scorer
import xgboost as xgb


import requests

# Random state 
random_state = 42
np.random.seed(random_state)

# Palette color for plots
yellow ='#dfdc7bff'
grey = '#50566dff'
beige = '#d9cebfff'
orange = '#e5b97bff'

Outline

1. Data Preprocessing
2. Outlier
3. Feature Engineering
4. Split Data
5. Exploratory Data Analysis (EDA)
6. Model Training and Validation
7. Model Evaluation on Test Set
8. Conclusion
9. Deployment
- 9.1 Testing Local and Web Service

In this project, we will build a machine learning model to predict the trip duration of taxi rides in New York City. The dataset is from Kaggle. The original dataset has approximately 1.4 million entries for the training set and 630k for the test set, although only the training set is utilized in this project. The dataset is then preprocessed, outliers are removed, and new features are created, followed by splitting it into train, test, and validation sets.

The goal is to predict the trip duration of taxi rides in New York City. The evaluation metric is the Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated by taking the log of the predictions and actual values. This metric ensures that errors in predicting short trip durations are less penalized compared to errors in predicting longer ones. For this purpose, we employ three types of machine learning models: mini-batch k-means to create new cluster features for the locations, and the Decision Tree Regressor and XGBoost to predict the trip duration.

Initially, the dataset has just 11 columns, offering great possibilities for developing new features and visualizations. The dataset has the following features:

id - a unique identifier for each trip.
vendor_id - a code indicating the provider associated with the trip record.
pickup_datetime - date and time when the meter was engaged.
dropoff_datetime - date and time when the meter was disengaged.
passenger_count - the number of passengers in the vehicle (driver entered value).
pickup_longitude - the longitude where the meter was engaged.
pickup_latitude - the latitude where the meter was engaged.
dropoff_longitude - the longitude where the meter was disengaged.
dropoff_latitude - the latitude where the meter was disengaged.
store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server Y=store and forward; N=not a store and forward trip.
trip_duration - duration of the trip in seconds.

1. Data Preprocessing

taxi_trip = pd.read_csv('../data/raw/nyc-yellow-taxi-trip-2016.csv',
                parse_dates=['pickup_datetime','dropoff_datetime'])

display(taxi_trip.head(3))

print("Shape:", taxi_trip.shape)

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude
0	id2875421	2	2016-03-14 17:24:55	2016-03-14 17:32:30	1	-73.982155
1	id2377394	1	2016-06-12 00:43:35	2016-06-12 00:54:38	1	-73.980415
2	id3858529	2	2016-01-19 11:35:24	2016-01-19 12:10:48	1	-73.979027

	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
0	40.767937	-73.964630	40.765602	N	455
1	40.738564	-73.999481	40.731152	N	663
2	40.763939	-74.005333	40.710087	N	2124

Shape: (1458644, 11)

Let’s check the dataset information, and see if there is any missing value or duplicated rows:

print(taxi_trip.info())
print("\n Duplicated rows:", taxi_trip.duplicated().sum())
print("\n Missing values:", taxi_trip.isnull().sum().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   id                  1458644 non-null  object        
 1   vendor_id           1458644 non-null  int64         
 2   pickup_datetime     1458644 non-null  datetime64[ns]
 3   dropoff_datetime    1458644 non-null  datetime64[ns]
 4   passenger_count     1458644 non-null  int64         
 5   pickup_longitude    1458644 non-null  float64       
 6   pickup_latitude     1458644 non-null  float64       
 7   dropoff_longitude   1458644 non-null  float64       
 8   dropoff_latitude    1458644 non-null  float64       
 9   store_and_fwd_flag  1458644 non-null  object        
 10  trip_duration       1458644 non-null  int64         
dtypes: datetime64[ns](2), float64(4), int64(3), object(2)
memory usage: 122.4+ MB
None

 Duplicated rows: 0

 Missing values: 0

There are no missing values or duplicated rows. The only feature that we need to change is store_and_fwd_flag, which is a categorical feature ${N, Y}$. We will change it to a boolean mapping Y to True and N to False:

# Map store_and_fwd_flag to True and False 
taxi_trip['store_and_fwd_flag'] = taxi_trip['store_and_fwd_flag'].map({'N': False, 'Y': True})

Let’s verify if the target feature trip_duration is correct by calculating the time difference between dropoff_datetime and pickup_datetime in seconds :

trip_duration = (taxi_trip['dropoff_datetime'] - taxi_trip['pickup_datetime']).dt.seconds

is_diff = (trip_duration != taxi_trip['trip_duration'])
print( 'Number of incorrect values in trip_duration:',is_diff.sum())

display(taxi_trip[is_diff])

Number of incorrect values in trip_duration: 4

	id	vendor_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude
355003	id1864733	1	2016-01-05 00:19:42	2016-01-27 11:08:38	1	-73.789650
680594	id0369307	1	2016-02-13 22:38:00	2016-03-08 15:57:38	2	-73.921677
924150	id1325766	1	2016-01-05 06:14:15	2016-01-31 01:01:07	1	-73.983788
978383	id0053347	1	2016-02-13 22:46:52	2016-03-25 18:18:14	1	-73.783905

	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag	trip_duration
355003	40.643559	-73.956810	40.773087	False	1939736
680594	40.735252	-73.984749	40.759979	False	2049578
924150	40.742325	-73.985489	40.727676	False	2227612
978383	40.648632	-73.978271	40.750202	False	3526282

It appears that there are four extreme values in the target feature trip_duration that don’t make sense. We can correct then by substituting the values of the actual trip duration in seconds that was calculated. Let’s simple substitute for the correct column that we calculate:

taxi_trip['trip_duration'] = trip_duration

# Let's check again
is_diff = (trip_duration != taxi_trip['trip_duration'])
print( 'Number of incorrect values in trip_duration:',is_diff.sum())

Number of incorrect values in trip_duration: 0

2. Outliers

To identify outliers in the dataset, we will employ geospatial visualization for the pickup and dropoff locations and the interquartile range (IQR) with a box plot to detect outliers in the trip_duration. First, let’s examine all features with data types float64 and int64 in the dataset:

taxi_trip_filtered = taxi_trip.select_dtypes(include=['float64', 'int64'])
display(taxi_trip_filtered.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 6 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   vendor_id          1458644 non-null  int64  
 1   passenger_count    1458644 non-null  int64  
 2   pickup_longitude   1458644 non-null  float64
 3   pickup_latitude    1458644 non-null  float64
 4   dropoff_longitude  1458644 non-null  float64
 5   dropoff_latitude   1458644 non-null  float64
dtypes: float64(4), int64(2)
memory usage: 66.8 MB



None

2.1. Location Features

When looking for outlier in the dataset, we should focus on numerical features. To find out if there is any outliers, we can analyze the features pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude

trip_outliers = taxi_trip.copy()

# can be useful to separate the features
pickup_dropoff_features = [ 'pickup_longitude', 'pickup_latitude', 
                            'dropoff_longitude', 'dropoff_latitude' ]

pickup_long = trip_outliers['pickup_longitude']
pickup_lat = trip_outliers['pickup_latitude']
dropoff_long = trip_outliers['dropoff_longitude']
dropoff_lat = trip_outliers['dropoff_latitude']

# Summary statistics
display(    trip_outliers.groupby('passenger_count')[pickup_dropoff_features[:2]]
            .agg(['min', 'mean', 'max']))

display(    trip_outliers.groupby('passenger_count')[pickup_dropoff_features[2:]]
            .agg(['min', 'mean', 'max']))

	pickup_longitude			pickup_latitude
	min	mean	max	min	mean	max
passenger_count
0	-74.042969	-73.957256	-73.776367	40.598701	40.736596	40.871922
1	-79.569733	-73.973551	-65.897385	36.118538	40.751209	51.881084
2	-121.933342	-73.973343	-61.335529	34.359695	40.749811	41.319164
3	-75.455917	-73.974312	-73.722374	39.803932	40.750224	41.189957
4	-75.354332	-73.973836	-73.559799	34.712234	40.749328	40.945747
5	-74.191154	-73.972432	-71.799896	35.081532	40.750801	40.925323
6	-74.145752	-73.973227	-73.653358	40.213837	40.751611	40.896381
7	-74.173668	-73.948100	-73.631149	40.715031	40.740285	40.768551
8	-73.992653	-73.992653	-73.992653	40.768719	40.768719	40.768719
9	-73.710632	-73.710632	-73.710632	40.671581	40.671581	40.671581

	dropoff_longitude			dropoff_latitude
	min	mean	max	min	mean	max
passenger_count
0	-74.188072	-73.963539	-73.776360	40.598701	40.735779	40.872318
1	-80.355431	-73.973371	-65.897385	36.118538	40.751977	43.911762
2	-121.933304	-73.973464	-61.335529	34.359695	40.751070	43.921028
3	-74.550011	-73.974016	-73.055977	40.435646	40.751601	41.387001
4	-74.569748	-73.974422	-73.170937	32.181141	40.750786	41.029831
5	-79.352837	-73.973206	-72.954590	40.436329	40.751590	41.281796
6	-74.220955	-73.973192	-73.524658	40.553467	40.752375	40.982738
7	-74.173660	-73.948097	-73.631149	40.715019	40.740289	40.768551
8	-74.041374	-74.041374	-74.041374	40.729954	40.729954	40.729954
9	-73.710632	-73.710632	-73.710632	40.671581	40.671581	40.671581

We can observe that there are unrealistic values for passenger_count. The expected values for passenger_count are between 1 and 6, but there are atypical counts of 0 and values greater than 6. Let’s remove those outliers:

# Boolean series for passenger count outliers
passenger_outliers = (trip_outliers['passenger_count'] == 0) | (trip_outliers['passenger_count'] > 6)

not_outliers = ~passenger_outliers
trip_outliers = trip_outliers[not_outliers]


print(f'\nOutliers in passenger_count: {passenger_outliers.sum()}')

Outliers in passenger_count: 65

The boxplot for detecting outliers relies on the assumption that the data has a distribution close to normal (Gaussian), which is often not true with geospatial coordinates due to their unique patterns influenced by urban roads and traffic. Therefore, we will use a different approach to identify outliers in the location features.

For pickup and dropoff locations, the strategy is to use geospatial data using shapfile from NYC Borough Boundaries to identify points that fall outside the boundaries of NYC map as outliers.

def plot_points_map(df, gdf_map, geometry_col, ax, title='', within_map=True, 
                    color_map='beige', color_points='orange'):
    """
    Plots a map with points from a DataFrame onto a given Axes object.

    Parameters:
    - df: DataFrame with the coordinates to plot.
    - gdf_map: GeoDataFrame of the map boundaries.
    - geometry_col: GeoDataFrame column with geometry data.
    - ax: Matplotlib Axes object to plot on.
    - title: Title of the plot.
    - within_map: If True, only plots points within the map boundaries.
    - color_map: Color for the map.
    - color_points: Color for the points.
    """
    # convert to the same coordinate reference system
    gdf_points = gpd.GeoDataFrame(df, geometry=geometry_col)
    # convert to the same coordinate reference system
    gdf_points.crs = gdf_map.crs
    
    # If within_map is True, perform spatial join with the map boundaries
    # only points within the boundaries will be plotted
    if within_map:
        gdf_points = gpd.sjoin(gdf_points, gdf_map, how="inner", predicate='within')
        # Update the original DataFrame 
        df = df.loc[gdf_points.index]
        
    # Plot the map and the points
    gdf_map.plot(ax = ax, color=color_map, edgecolor='grey')
    gdf_points.plot(ax =ax, markersize=1, color = color_points)
    ax.set_xlabel('Longitude', fontsize=14)
    ax.set_ylabel('Latitude', fontsize=14)
    ax.set_title(title, fontsize=20)

    return df

nybb_path = '../data/external/nyc-borough-boundaries/geo_export_e13eede4-6de2-4ed8-98a0-58290fd6b0fa.shp'
nyc_boundary = gpd.read_file(nybb_path)

# Creating the geometry for points
geometry = gpd.points_from_xy(trip_outliers['pickup_longitude'], trip_outliers['pickup_latitude'])

# Creating the plot
fig, ax = plt.subplots(figsize=(12, 12))
pickup_within_nyc = plot_points_map(    df = trip_outliers, gdf_map = nyc_boundary, 
                                        geometry_col = geometry, ax = ax, title =  'Pickup Locations',
                                        color_map= beige, color_points= orange  ) 

# Showing the plot
plt.show()

print(f'Number of outliers removed: {len(trip_outliers) - len(pickup_within_nyc)}')

Number of outliers removed: 1216

geometry = gpd.points_from_xy(pickup_within_nyc['dropoff_longitude'], 
                                pickup_within_nyc['dropoff_latitude'])

fig, ax = plt.subplots(figsize=(12, 12))
pickup_dropoff_within_nyc = plot_points_map(    df = pickup_within_nyc, gdf_map = nyc_boundary, 
                                                geometry_col = geometry, ax = ax, title =  'Dropoff Locations',
                                                color_map= beige, color_points= orange  )
plt.show()

print(f'Number of outliers removed:{len(pickup_within_nyc) - len(pickup_dropoff_within_nyc)}' )

Number of outliers removed:5359

2.2. Target Feature (Trip Duration)

# Copy the dataset with removed outliers from locations
trip_outliers = pickup_dropoff_within_nyc.copy()

The interquartile range (IQR)

The IQR gives a sense of how spread out the values in a dataset are and is a robust measure of dispersion that is not influenced by outliers.

A quartile divides data into four equal parts, each comprising 25% of the data. $Q_1$ represents the 25th percentile of the data, meaning 25% of data points are less than or equal to this value. $Q_3$ represents the 75th percentile, implying that 75% of data points are less than or equal to this value.

Let’s calculate the IQR for the target feature trip_duration and plot a blox plot to visualize the outliers:

def boxplot_stats(series, whis = 1.5):
    """
    Calculate the statistics for a box plot.

    Returns:
    dict: A dictionary containing the quartiles, IQR, and whisker limits.
    """
    Q1 = series.quantile(0.25)
    Q2 = series.quantile(0.50)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lower_whisker = Q1 - whis * IQR
    upper_whisker = Q3 + whis * IQR

    return {
        'Q1': Q1,
        'Q2': Q2,
        'Q3': Q3,
        'IQR': IQR,
        'lower_whis': lower_whisker,
        'upper_whis': upper_whisker
    }

def plot_distribution_boxplot(  series, ax1, ax2, title='', label='', log1p=True, 
                                draw_quartiles=True, kde=True):
    """
    Plot the distribution and boxplot of a series on given axes.

    Args:
    - series (pandas.Series): The series to plot.
    - ax1 (matplotlib.axes.Axes): The axes for the histogram.
    - ax2 (matplotlib.axes.Axes): The axes for the boxplot.
    - title (str): The title of the plot.
    - label (str): The label for the x-axis.
    - log1p (bool): If True, applies log1p transformation to the series.
    - draw_quartiles (bool): If True, draws quartile lines on the histogram.
    - kde (bool): If True, plots a KDE over the histogram.
    """
    if log1p:
        series = np.log1p(series)
    stats = boxplot_stats(series)

    sns.histplot(   series, bins=40, linewidth=0.5, color='#dfdc7bff', alpha=0.2,
                    ax=ax1, kde=kde, line_kws={'lw': 3})
    ax1.set_title(f'{title} Histogram', fontsize=15)
    ax1.set_xlabel(label, fontsize=14)
    ax1.set_ylabel('Count', fontsize=14)

    sns.boxplot(data=series, color='#dfdc7bff', ax=ax2,
                fliersize=3, flierprops={'color': '#50566dff', 'markeredgecolor': '#50566dff'})
    
    ax2.set_title(f'{title} Boxplot', fontsize=15)
    ax2.set_ylabel(label, fontsize=14)

    if draw_quartiles:
        quartiles = [stats['Q1'], stats['Q3'], stats['lower_whis'], stats['upper_whis']]
        for line in quartiles:
            ax1.axvline(line, color='#50566dff', linestyle='--', alpha=1, lw=2)
            y_center = ax1.get_ylim()[1] / 2
            ax1.text(   line, y_center, f'{line:.2f}',
                        fontsize=18, color='black', va='center', ha='right', rotation=90)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
plot_distribution_boxplot(  trip_outliers['trip_duration'], ax1, ax2,
                            title='Trip Duration', label='Log of Trip Duration (s)')

plt.tight_layout()
plt.show()

Because of the high values and zeros in the target feature trip_duration, we apply the logarithmic transformation $\log(x+1)$. This transformation reduces the effect of outliers and the overall variance in the dataset.The histogram, illustrated in the first plot, resembles a bell curve after applying this transformation. From the boxplot, we can see:

The first and third quartiles ($Q_1$ and $Q_3$) representing the 25th and 75th percentiles, are:
- $Q_1 = 5.98$
- $Q_3 = 6.98$
The interquartile range (IQR) is the difference between the third and first quartiles:
- IQR = $Q_3 - Q_1 = 0.99$
The whiskers indicate the range of the data:
- The lower whisker : $Q_1 - 1.5 IQR = 4.50$
- The upper whisker : $Q_3 + 1.5 IQR = 8.46$

Data points outside the whiskers are potential outliers that need further validation. For better intuition, we convert these whisker values from the log scale back to the original scale in terms of hours ($\text{Hours} = \text{Seconds}/3600$ ):

$\frac{\exp(4.50) - 1}{3600} \approx 0.024 ~~\text{hours}$ $\frac{\exp(8.46) - 1}{3600} \approx 1.32 ~~\text{hours}$

The lower whisker corresponds to approximately 1.4 minutes, which is very short for a taxi trip duration, suggesting that these could be errors or special cases.The upper whisker suggests a more plausible trip duration of about 1.32 hours, but with very unrealistic high values with more than 20 hours of duration.

def count_outliers(series, whis = 1.5):
        """
        Count the number of upper and lower outliers in the series and print their percentages.

        Args:
        series (pd.Series): Series for which to count outliers.

        Returns:
        (pd.Series, pd.Series): Two boolean series, one for upper outliers and one for lower outliers.
        """

        stats = boxplot_stats(series, whis)


        upper_outliers = (series > stats['upper_whis'])
        lower_outliers = (series < stats['lower_whis'])

        # Percentage of outliers 
        percentage_upper = upper_outliers.sum() / len(series) * 100
        percentage_lower = lower_outliers.sum() / len(series) * 100

        print(  f'\nPotential upper outliers: {upper_outliers.sum()} '
                f'({percentage_upper:.2f}% of the dataset)')

        print(  f'\nPotential lower outliers: {lower_outliers.sum()} '
                f'({percentage_lower:.2f}% of the dataset)')

        return upper_outliers, lower_outliers

log_trip_duration = np.log1p(trip_outliers['trip_duration'])
upper_duration_outliers, lower_duration_outliers = count_outliers(log_trip_duration)

Potential upper outliers: 4292 (0.30% of the dataset)

Potential lower outliers: 14461 (1.00% of the dataset)

Let’s compare the trip durations with some distance measures from the pickup location to the dropoff location. In an article on New York City taxi trip duration prediction using MLP and XGBoost, some distance measures were discussed, like Manhattan, haversine, and bearing distances. Unlike the straight-line Euclidean distance, the Manhattan distance calculates the sum of the absolute differences between the horizontal and vertical coordinates, which resembles navigating an actual grid of city streets. However, this doesn’t account for the Earth’s curvature. To address this, the article also considers the haversine distance, which is more appropriate for large distances as it does account for the Earth’s curvature. Bearing distance calculates the direction between two points.

I will just adopt a simplified approach that still accounts for the Earth’s curvature. The Geodesic distance is the small path between two points on the surface of a curved object, for our case the Earth surface. To achieve This we use the geopy library:

# Function to calculate geodesic distance in a ellipsoid
def geodesic(start_lat, start_long, end_lat, end_long):
    """
    Calculate geodesic distances between pickup and dropoff locations.

    Args:
    - start_lat, start_long, end_lat, end_long: Lists or arrays of coordinates 
    for the start point and end point.

    Returns:
    List of distances in meters for each pair of start and end points.
    """
    coordinates = zip(start_lat, start_long, end_lat, end_long)
    distances = [distance.distance((lat1, lon1), (lat2, lon2)).meters for lat1, lon1, lat2, lon2 in coordinates]
    return distances

# Calculate the geodesic distance
geodesic_distance = geodesic(   trip_outliers['pickup_latitude'].to_numpy(), 
                                trip_outliers['pickup_longitude'].to_numpy(), 
                                trip_outliers['dropoff_latitude'].to_numpy(), 
                                trip_outliers['dropoff_longitude'].to_numpy())

Potential Lower Outliers

Approximately 1% of all trips have durations less than or equal to 1.4 minutes. Based on the table below, it appears that some of these trips may have been canceled, as indicated by a zero or minimal geodesic distance. Because our analysis aims to predict the trip duration, we will remove these potentially canceled trips from the dataset.

trip_outlier_geodesics = trip_outliers.copy()
trip_outlier_geodesics['geodesic_distance'] = geodesic_distance

display(    trip_outlier_geodesics[lower_duration_outliers]
            .groupby('passenger_count')[['trip_duration', 'geodesic_distance']]
            .agg(['min', 'mean','max',  'count'])   )

	trip_duration				geodesic_distance
	min	mean	max	count	min	mean	max	count
passenger_count
1	1	47.663972	88	11258	0.0	272.870639	19983.682930	11258
2	1	51.322023	88	1562	0.0	299.515846	1823.852419	1562
3	2	53.269321	88	427	0.0	315.255005	1491.261959	427
4	2	53.846890	88	209	0.0	302.298194	1665.249710	209
5	2	55.686591	88	619	0.0	347.513153	1509.483335	619
6	2	57.704663	88	386	0.0	364.417960	1120.139579	386

Potential Upper Outliers

Approximately 0.30% of all trips have durations of 1.32 hours or less. In this case, a table alone does not give much information about these potential outliers; therefore, we employ a histogram to better visualize these potential upper outliers.

From the histogram, it is apparent that around 1700 trips have durations exceeding 20 hours. In contrast, a larger cluster of over 2000 trips has durations ranging between 1.32 hours and 4 hours. Considering this distribution, only those with durations longer than 4 hours may be classified as actual outliers, as durations less than 4 hours are within a plausible range for taxi trips in NYC.

upper_outliers = trip_outliers[upper_duration_outliers]['trip_duration'].reset_index(drop=True)/3600

fig, axes = plt.subplots(1, 1, figsize= (10, 5))

sns.histplot(  upper_outliers, bins=20, linewidth=0.5, color=yellow, alpha=0.5)

axes.set_title(f'High Trip Duration Outliers', fontsize=15)
axes.set_xlabel('Trip Duration (hours)', fontsize=14)
axes.axvline(4, color=grey, linestyle='--', alpha=1, lw=2)
y_center = axes.get_ylim()[1] / 2
axes.text(4, y_center, f'{4:.2f} hours',
            fontsize=12, color='black', va='center', ha='right', rotation=90)
plt.tight_layout()
plt.show()

We observed that there are also unrealistic values for geodesic_distance. These unrealistic values can simple be deleted because they don’t represent a real patters for the trip duration. We’ll create variables to hold boolean series for each of the identified outliers conditions:

# Percentage of lower duration outliers
per_lower_duration = lower_duration_outliers.sum() / len(trip_outliers) * 100
print(f'\nOutliers with trip duration less than 1.4 minutes: {lower_duration_outliers.sum()} '
      f'({per_lower_duration:.2f}% of the dataset)')


# Boolean series for trip durations over 4 hours
upper_duration_outliers = trip_outliers['trip_duration'] > 4 * 3600
per_upper_duration = upper_duration_outliers.sum() / len(trip_outliers) * 100

print(f'\nOutliers with trip duration greater than 4 hours: {upper_duration_outliers.sum()} '
      f'({per_upper_duration:.2f}% of the dataset)')


# Boolean series for geodesic distance == zero
geodesic_outliers_zeros = (trip_outlier_geodesics['geodesic_distance'] == 0)
# Percentage of outliers
per_geodesic = geodesic_outliers_zeros.sum() / len(taxi_trip) * 100

print(f'\nOutliers with zero geodesic distances: {geodesic_outliers_zeros.sum()} '
      f'({per_geodesic:.2f}% of the dataset)')

Outliers with trip duration less than 1.4 minutes: 14461 (1.00% of the dataset)

Outliers with trip duration greater than 4 hours: 2059 (0.14% of the dataset)

Outliers with zero geodesic distances: 5490 (0.38% of the dataset)

Average Geodesic Velocity

Before removing all this outliers that we identify, let’s first investigate one more aspect of the trip duration. As direct observations may not be sufficient to identify all outliers, we can calculate an average velocity based on the geodesic distances to provide a better understanding of potential outliers.

# Calculate the average velocity in Km/h
geo_distance_km = trip_outlier_geodesics['geodesic_distance']/1000
trip_duration_hrs = trip_outlier_geodesics['trip_duration']/3600
velocity_geodesic_kmh = (geo_distance_km) / (trip_duration_hrs)

# For simplicity add velocity to the dataset
trip_outlier_geodesics['avg_velocity_kmh'] = velocity_geodesic_kmh


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
plot_distribution_boxplot(  trip_outlier_geodesics['avg_velocity_kmh'], ax1, ax2, 
                            'Average Geodesic Velocity', 
                            'Log of Average Velocity (Km/h)'    )
plt.tight_layout()
plt.show()


log_velocity_geodesic = np.log1p(velocity_geodesic_kmh)
upper_vel_outliers, lower_vel_outliers = count_outliers(log_velocity_geodesic)

Potential upper outliers: 4950 (0.34% of the dataset)
    
Potential lower outliers: 23999 (1.65% of the dataset)

The lower and upper values for potential outliers are:

Lower whisker : ${\exp(1.39) - 1} \approx 3.0 ~~ km/h $
Upper whisker : ${\exp(3.86) - 1} \approx 46.5 ~~ km/h $

New York City is a metropolis known for its traffic congestion. The average speed of a taxi in New York City with passengers on board is 20.3 km/h, as reported in the article “Taxi driver speeding”:

“… The average occupancy speed of taxi drivers in Shanghai (21.3 km/h) was similar to that of NYC (20.3 km/h)”

Considering that approximately 0.35% of the total number of trips have an average velocity greater than or equal to 46.5 km/h, these cases are not representative of typical taxi trips in NYC and thus can be removed.

3.1.1 Handling Outliers

We will now address the outliers identified in the target feature trip_duration using the following steps

Remove the trip_duration less then or equal to 1.4 minutes. This will remove 1.00% of instances from the original dataset.
Remove the trip_duration greater than 4 hours. This will remove 0.14% of instances from the original dataset.
Remove geodesic_distance equal to 0.0 meters. This will remove 0.38% of instances from the original dataset.
Remove the trip_duration corresponding to an average velocity greater than or equal to 46.5 km/h. This will remove 0.34% of instances from the original dataset.

We will use the boolean series previously defined to create a single series that identifies an entry as an outlier if it meets any one of the outlier conditions:

# Combine all outlier conditions
all_outliers =  lower_duration_outliers | upper_duration_outliers \
                | geodesic_outliers_zeros | upper_vel_outliers

not_outliers = ~all_outliers

# filter the original by not_outliers
original_length = len(taxi_trip)

per_outliers = all_outliers.sum()/original_length * 100
print(  f'\nOutliers removed: {all_outliers.sum()} '
        f'({per_outliers:.2f}% of the dataset)')

Outliers removed: 24848 (1.70% of the dataset)

taxi_trip = trip_outliers[not_outliers].copy()


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
plot_distribution_boxplot(taxi_trip['trip_duration'], ax1, ax2,
                        'Trip Duration', 'Log of Trip Duration (s)')


plt.tight_layout()
plt.show()

# save the dataset
taxi_trip.to_csv('../data/interim/nyc-taxi-trip-2016-remove-outliers.csv', index=False)

3. Feature Engineering

Before splitting the dataset and conducting the exploratory data analysis (EDA), let’s create some new features that will help in the model predictions and in the Exploratory Data Analysis section.

taxi_trip = pd.read_csv('../data/interim/nyc-taxi-trip-2016-remove-outliers.csv', 
                        parse_dates= ['pickup_datetime','dropoff_datetime'])

3.1. Datetime Features

Using the pickup time with format YYYY-MM-DD H:M:S, let’s extract features like the year, month, day, hours and minutes for more informative features and also the time of day, day of the week, day of the year to help with possible underlying patters in the data related to specific time intervals. So we will create a total of 8 new features:

# Days in a year: 1-365 (or 366)
taxi_trip['day_of_year'] = taxi_trip['pickup_datetime'].dt.dayofyear
# Day of week: 0-6 (0 for Monday, 6 for Sunday)
taxi_trip['day_of_week'] = taxi_trip['pickup_datetime'].dt.dayofweek
# Hour of the day :0-23
taxi_trip['hour_of_day'] = taxi_trip['pickup_datetime'].dt.hour
# Minutes in a day: 0-1440
taxi_trip['min_of_day'] = (taxi_trip['pickup_datetime'].dt.hour * 60) + taxi_trip['pickup_datetime'].dt.minute

# Date features
taxi_trip['pickup_date'] = taxi_trip['pickup_datetime'].dt.date
taxi_trip['dropoff_date'] = taxi_trip['pickup_datetime'].dt.date
taxi_trip['year'] = taxi_trip['pickup_datetime'].dt.year
taxi_trip['month'] = taxi_trip['pickup_datetime'].dt.month
taxi_trip['day'] = taxi_trip['pickup_datetime'].dt.day
taxi_trip['hour'] = taxi_trip['pickup_datetime'].dt.hour
taxi_trip['minute'] = taxi_trip['pickup_datetime'].dt.minute

# Drop datetime features, they are not necessary anymore
taxi_trip.drop('pickup_datetime', axis=1, inplace=True)
taxi_trip.drop('dropoff_datetime', axis=1, inplace=True)

taxi_trip.head(3)

	id	vendor_id	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id2875421	2	1	-73.982155	40.767937	-73.964630	40.765602	False
1	id2377394	1	1	-73.980415	40.738564	-73.999481	40.731152	False
2	id3858529	2	1	-73.979027	40.763939	-74.005333	40.710087	False

	trip_duration	day_of_year	day_of_week	hour_of_day	min_of_day	pickup_date	dropoff_date	year	month	day	hour	minute
0	455	74	0	17	1044	2016-03-14	2016-03-14	2016	3	14	17	24
1	663	164	6	0	43	2016-06-12	2016-06-12	2016	6	12	0	43
2	2124	19	1	11	695	2016-01-19	2016-01-19	2016	1	19	11	35

Another relevant information is weekends and holidays. For this we will use the holiday dataset from kaggle, which contains the New York City holidays for the year of 2016. We will create two new binary features, is_holiday and is_weekend:

holidays = pd.read_csv('../data/raw/nyc-holidays-2016.csv', sep = ';')
display(holidays.head())

	Day	Date	Holiday
0	Friday	January 01	New Years Day
1	Monday	January 18	Martin Luther King Jr. Day
2	Friday	February 12	Lincoln's Birthday
3	Monday	February 15	Presidents' Day
4	Sunday	May 08	Mother's Day

# Convert to datetime the Date column
holidays['Date'] = pd.to_datetime(holidays['Date'] + ', 2016')

# Extract month and day as numerical values
holidays['month'] = holidays['Date'].dt.month
holidays['day'] = holidays['Date'].dt.day
# Add the is_holiday column 
holidays['is_holiday'] = True

# Drop the original columns
holidays.drop(['Date', 'Day', 'Holiday'], axis=1, inplace=True)

# Save the processed holidays dataset
holidays.to_csv('../data/external/nyc-holidays-2016.csv',index=False)

display(holidays.head())

	month	day	is_holiday
0	1	1	True
1	1	18	True
2	2	12	True
3	2	15	True
4	5	8	True

We will then use the month and day features to merge the taxi trip dataset with the holidays dataset using a left join:

# Merge the taxi data with the holidays data
taxi_trip = taxi_trip.merge(holidays, on=['month', 'day'], how='left')

# Replace NaN values in the is_holiday column with False
taxi_trip['is_holiday'].fillna(False, inplace=True)

# Add the is_weekend column (5:Saturday or 6:Sunday)
taxi_trip['is_weekend'] = taxi_trip['day_of_week'].isin([5, 6])

display(taxi_trip.head())
print('Number of trips in holiday:', taxi_trip[taxi_trip['is_holiday'] == True].shape[0])

	id	vendor_id	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id2875421	2	1	-73.982155	40.767937	-73.964630	40.765602	False
1	id2377394	1	1	-73.980415	40.738564	-73.999481	40.731152	False
2	id3858529	2	1	-73.979027	40.763939	-74.005333	40.710087	False
3	id3504673	2	1	-74.010040	40.719971	-74.012268	40.706718	False
4	id2181028	2	1	-73.973053	40.793209	-73.972923	40.782520	False

	trip_duration	day_of_year	min_of_day	pickup_date	dropoff_date	year	month	day	hour	minute	is_holiday	is_weekend
0	455	74	1044	2016-03-14	2016-03-14	2016	3	14	17	24	False	False
1	663	164	43	2016-06-12	2016-06-12	2016	6	12	0	43	False	True
2	2124	19	695	2016-01-19	2016-01-19	2016	1	19	11	35	False	False
3	429	97	1172	2016-04-06	2016-04-06	2016	4	6	19	32	False	False
4	435	86	810	2016-03-26	2016-03-26	2016	3	26	13	30	False	True

Number of trips in holiday: 49736

3.2. Traffic Features

A brief search on Google about New York City’s traffic patterns revealed useful insights from sources such as A Guide to Traffic in NYC and in 10 Tips to Beat NYC Traffic sites. These resources suggest that high traffic times in and out of Manhattan typically occur between 8-9 a.m. and 3-7 p.m. Based on this information, we will create a new feature called is_rush_hour to determine whether the pickup time falls within rush hour on a given day. To simplify, weekends and holidays will be considered as having normal traffic hours throughout the day. Here based on this information and on the Exploratory Data Analysis section, we chose to consider the following time intervals for rush hours: 7-10 a.m. and 4-8 p.m . This feature will be a binary feature that indicates a increase in traffic during the rush hours:

# Define rush hours
morning_rush = (taxi_trip['hour'] >= 7) & (taxi_trip['hour'] < 10)
evening_rush = (taxi_trip['hour'] >= 16) & (taxi_trip['hour'] < 20)

# Weekdays are 0 (Monday) to 4 (Friday)
is_weekday = taxi_trip['day_of_week'] < 5


taxi_trip['is_rush_hour'] = (morning_rush | evening_rush) & is_weekday

display(taxi_trip.head(3))

	id	vendor_id	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id2875421	2	1	-73.982155	40.767937	-73.964630	40.765602	False
1	id2377394	1	1	-73.980415	40.738564	-73.999481	40.731152	False
2	id3858529	2	1	-73.979027	40.763939	-74.005333	40.710087	False

	trip_duration	day_of_year	pickup_date	dropoff_date	year	month	day	hour	minute	is_holiday	is_weekend	is_rush_hour
0	455	74	2016-03-14	2016-03-14	2016	3	14	17	24	False	False	True
1	663	164	2016-06-12	2016-06-12	2016	6	12	0	43	False	True	False
2	2124	19	2016-01-19	2016-01-19	2016	1	19	11	35	False	False	False

3.3. Distance Features

In an article on New York City taxi trip duration prediction using MLP and XGBoost, as discussed in the outliers section, some distance measures were discussed, like Manhattan, haversine, and bearing distances.

For this project, I will adopt a simplified approach that still accounts for the Earth’s curvature. This will involve calculating the geodesic distance using the geopy library and computing the bearing distance. The bearing distance measures the angle between the north line and the line connecting the pickup and dropoff locations.

# Again lets use the function geodesic to create new features
taxi_trip['geodesic_distance'] = geodesic(  taxi_trip['pickup_latitude'].to_numpy(), 
                                            taxi_trip['pickup_longitude'].to_numpy(), 
                                            taxi_trip['dropoff_latitude'].to_numpy(), 
                                            taxi_trip['dropoff_longitude'].to_numpy())

def bearing(start_lat, start_lon, end_lat, end_lon):
    """
    Calculate the initial bearing (azimuth) between two sets of latitude and longitude coordinates.

    Args:
    - start_lat (float): Latitude of the starting point in degrees.
    - start_lon (float): Longitude of the starting point in degrees.
    - end_lat (float): Latitude of the ending point in degrees.
    - end_lon (float): Longitude of the ending point in degrees.

    Returns:
    float: The initial bearing in degrees, normalized to the range [0, 360).
    """
    
    # Convert latitude and longitude from degrees to radians
    start_lat, start_lon, end_lat, end_lon = map(np.radians, [start_lat, start_lon, end_lat, end_lon])

    # Calculate the change in coordinates
    dlon = end_lon - start_lon

    # Calculate bearing
    x = np.sin(dlon) * np.cos(end_lat)
    y = np.cos(start_lat) * np.sin(end_lat) - np.sin(start_lat) * np.cos(end_lat) * np.cos(dlon)
    initial_bearing = np.arctan2(x, y)

    # Convert from radians to degrees and normalize (0-360)
    initial_bearing = np.degrees(initial_bearing)
    bearing = (initial_bearing + 360) % 360

    return bearing

# create bearing feature
taxi_trip['bearing'] = bearing( taxi_trip['pickup_latitude'], taxi_trip['pickup_longitude'], 
                                taxi_trip['dropoff_latitude'], taxi_trip['dropoff_longitude'])

display(taxi_trip.head(3))

	id	vendor_id	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id2875421	2	1	-73.982155	40.767937	-73.964630	40.765602	False
1	id2377394	1	1	-73.980415	40.738564	-73.999481	40.731152	False
2	id3858529	2	1	-73.979027	40.763939	-74.005333	40.710087	False

	trip_duration	day_of_year	year	month	day	hour	minute	is_holiday	is_weekend	is_rush_hour	geodesic_distance	bearing
0	455	74	2016	3	14	17	24	False	False	True	1502.171837	99.970196
1	663	164	2016	6	12	0	43	False	True	False	1808.659969	242.846232
2	2124	19	2016	1	19	11	35	False	False	False	6379.687175	200.319835

3.4. Weather Features

Even small weather changes, like rain or snow, can significantly affect traffic and transportation patterns in a large city. Rain or snow can impact driving conditions, increase traffic congestion, and influence public transportation usage, leading to longer taxi trip times and potentially higher demand for taxis. Therefore, we will use the weather dataset from kaggle. This dataset includes information on temperature, precipitation, visibility, and other relevant weather metrics. We plan to integrate this data with our taxi trip dataset by merging them based on date and time as we did before with the holidays dataset.

weather = pd.read_csv('../data/raw/nyc-weather-2016.csv', parse_dates=['Time'])
display(weather.head(3))

	Time	Temp.	Windchill	Heat Index	Humidity	Pressure	Dew Point	Visibility	Wind Dir	Wind Speed	Precip	Events	Conditions
0	2015-12-31 02:00:00	7.8	7.1	NaN	0.89	1017.0	6.1	8.0	NNE	5.6	0.8	NaN	Overcast
1	2015-12-31 03:00:00	7.2	5.9	NaN	0.90	1016.5	5.6	12.9	Variable	7.4	0.3	NaN	Overcast
2	2015-12-31 04:00:00	7.2	NaN	NaN	0.90	1016.7	5.6	12.9	Calm	0.0	0.0	NaN	Overcast

The weather dataset shows some missing values in features like Windchill, Heat Index, Pressure, and Visibility. For our analysis, Visibility is a critical feature due to its potential impact on driving conditions. The missing values in Windchill and Heat Index are not a concern for our study, as these factors are less directly related to traffic patterns. Other relevant features for our analysis, such as Conditions and Temp. have no missing values, and we do not need to worry.

# Checking for missing values, data types and duplicates:
print('duplicated values:', weather.duplicated().sum())
print('\n')
display(weather.info())
print('\n')
display(weather.isnull().sum()) 

duplicated values: 0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8787 entries, 0 to 8786
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Time        8787 non-null   datetime64[ns]
 1   Temp.       8787 non-null   float64       
 2   Windchill   2295 non-null   float64       
 3   Heat Index  815 non-null    float64       
 4   Humidity    8787 non-null   float64       
 5   Pressure    8556 non-null   float64       
 6   Dew Point   8787 non-null   float64       
 7   Visibility  8550 non-null   float64       
 8   Wind Dir    8787 non-null   object        
 9   Wind Speed  8787 non-null   float64       
 10  Gust Speed  8787 non-null   float64       
 11  Precip      8787 non-null   float64       
 12  Events      455 non-null    object        
 13  Conditions  8787 non-null   object        
dtypes: datetime64[ns](1), float64(10), object(3)
memory usage: 961.2+ KB



None







Time             0
Temp.            0
Windchill     6492
Heat Index    7972
Humidity         0
Pressure       231
Dew Point        0
Visibility     237
Wind Dir         0
Wind Speed       0
Gust Speed       0
Precip           0
Events        8332
Conditions       0
dtype: int64

The feature Conditions is a categorical feature that describes the weather conditions at the time of the observation. the feature have the following unique values of categorical data:

print(weather['Conditions'].unique())

['Overcast' 'Clear' 'Partly Cloudy' 'Mostly Cloudy' 'Scattered Clouds'
 'Unknown' 'Light Rain' 'Haze' 'Rain' 'Heavy Rain' 'Light Snow' 'Snow'
 'Heavy Snow' 'Light Freezing Fog' 'Light Freezing Rain' 'Fog']

This features can be used to create new binary features that indicate the presence of a specific weather condition. We can create new features as snow, heavy_snow, rain, heavy_rain that indicates whether it was snowing or raining at the time of the observation. to create this features, we first group the Conditions feature into 4 categories:

Snow: Light Snow, Snow
Heavy Snow: Heavy Snow,Lightning freezing fog, Fog
Rain: Light Rain, Rain
Heavy Rain: Heavy Rain, Light Freezing Rain

The idea to group the conditions this way is to consider the conditions that can impact the traffic in a similar way. For example, I add Fog in Heavy Snow because it can affect the visibility in a similar way. The same for Light Freezing Rain and Heavy Rain.

To integrate weather data into the dataset of taxi trip, we must first preprocessed the weather dataset to align it with the taxi trip dataset. Let’s separate the date and time into different columns for year, month, day, hour.

weather['year'] = weather['Time'].dt.year
weather['month'] = weather['Time'].dt.month
weather['day'] = weather['Time'].dt.day
weather['hour'] = weather['Time'].dt.hour

# Select only the year of 2016
weather = weather[(weather['year'] == 2016)]

# Select only the months that have in the taxi dataset
months_taxi_trip = taxi_trip['month'].unique()
select_month = weather['month'].isin(months_taxi_trip)
weather = weather[select_month]


# Create new features based in the conditions column
snow = (weather['Conditions'] == 'Light Snow') | (weather['Conditions'] == 'Snow')
heavy_snow = (weather['Conditions'] == 'Heavy Snow') | (weather['Conditions'] == 'Light Freezing Fog')\
                                                    | (weather['Conditions'] == 'Fog')

rain = (weather['Conditions'] == 'Light Rain') | (weather['Conditions'] == 'Rain')
heavy_rain = (weather['Conditions'] == 'Heavy Rain') | (weather['Conditions'] == 'Light Freezing Rain') 

weather['snow'] = snow
weather['heavy_snow'] = heavy_snow
weather['rain'] = rain
weather['heavy_rain'] = heavy_rain

# Drop the features that are not needed
weather.drop([ 'Time', 'Windchill', 'Heat Index', 'Humidity', 'Pressure', 'Dew Point', 'Wind Dir',
                'Wind Speed', 'Events', 'Conditions', 'Gust Speed'], axis=1, inplace=True)

# Rename columns to lowercase as in the taxi dataset
weather.rename(columns={'Temp.': 'temperature'}, inplace=True)
weather.rename(columns={'Visibility': 'visibility'}, inplace=True)
weather.rename(columns={'Precip': 'precipitation'}, inplace=True)

display(weather.head(3))

	temperature	visibility	year	month	day	hour	snow	heavy_snow	rain	heavy_rain
22	5.6	16.1	2016	1	1	0	False	False	False	False
23	5.6	16.1	2016	1	1	1	False	False	False	False
24	5.6	16.1	2016	1	1	2	False	False	False	False

We can check the maximum and minimum values for each feature to see if there is any inconsistency. As shown below, all features appear to be consistent with the range of values expected for each feature.

# max and min values
display(weather.describe().loc[['min', 'max']])

	temperature	visibility	precipitation	year	month	day	hour
min	-18.3	0.4	0.0	2016.0	1.0	1.0	0.0
max	32.2	16.1	11.9	2016.0	6.0	31.0	23.0

Missing values

We need to handle the missing values in visibility. This feature has values approximately between 0-16, where 0 indicates the worse visibility and 16 the best visibility.

print('Unique values for visibility:\n', weather['visibility'].unique())
print('\n')
display(weather.isna().sum().head())

Unique values for visibility:
 [16.1  nan 14.5  8.  12.9 11.3  6.4  4.8  2.4  2.8  2.   4.   3.2  9.7
  1.2  0.8  1.6  0.4]





temperature        0
visibility       132
precipitation      0
year               0
month              0
dtype: int64

To better understand the nature of the missing values, let’s create a table that correlates these missing values with different weather conditions. The missing values do not appear randomly distributed, they occur when the weather condition is good. To visualize this we can create a table that shows the number of NaN values in visibility named is_visibility_nan by the total number of days snowing and raining :

is_visibility_nan = weather['visibility'].isna()

weather_summary = pd.DataFrame({
    'is_visibility_nan': is_visibility_nan,
    '#snowing_days': snow,
    '#heavy_snowing_days': heavy_snow,
    '#raining_days': rain,
    '#heavy_raining_days': heavy_rain})

# Group by 'Visibility NaN' and calculate the sum for each weather condition
summary = weather_summary.groupby('is_visibility_nan').sum()

summary['total_counts'] =  is_visibility_nan.value_counts().values

# Reset index to turn 'Visibility NaN' back into a column
summary = summary.reset_index()


display(summary)

	is_visibility_nan	#snowing_days	#heavy_snowing_days	#raining_days	#heavy_raining_days	total_counts
0	False	61	6	171	10	4200
1	True	0	0	0	0	132

When is_visibility_nan is True (visibility data is missing), there are 132 days in total, and notably, none of these days are recorded as having snowing, heavy snowing, raining, or heavy raining conditions. This absence of precipitation-related weather conditions on days with missing visibility data suggests that visibility tends to be unrecorded primarily on days with good or clear weather.

This could make sense, as maybe under clear conditions, they overlook recording some values. Therefore, we can logically fill the missing values with the maximum value of visibility.

weather['visibility'].fillna(weather['visibility'].max(), inplace=True)
print("Missing values in visibility: ",weather['visibility'].isna().sum())

Missing values in visibility:  0

We will use the combination of year, month, day, hour features to merge the taxi trip dataset with the weather dataset using a left join:

taxi_trip_merge = taxi_trip.merge(weather, on=['year','month', 'day', 'hour'], how='left')

display(taxi_trip_merge.isna().sum().tail(10))

# Save the weather dataset
weather.to_csv('../data/external/nyc-weather-2016.csv', index=False)

is_rush_hour             0
geodesic_distance        0
bearing                  0
temperature          11751
visibility           11751
precipitation        11751
snow                 11751
heavy_snow           11751
rain                 11751
heavy_rain           11751
dtype: int64

After merging the two datasets, we can see that there are some missing dates in the weather dataset, causing 11751 missing values for the weather features. This is less then 1% of our entire dataset, so we can take a more simplified approach by dropping these rows:

taxi_trip = taxi_trip_merge.dropna().copy()

print('Old size:',taxi_trip_merge.shape[0])

print('\nNew size:',taxi_trip.shape[0])

display(taxi_trip.head(3))

Old size: 1427156

New size: 1415405

	id	vendor_id	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	store_and_fwd_flag
0	id2875421	2	1	-73.982155	40.767937	-73.964630	40.765602	False
1	id2377394	1	1	-73.980415	40.738564	-73.999481	40.731152	False
2	id3858529	2	1	-73.979027	40.763939	-74.005333	40.710087	False

	trip_duration	day_of_year	is_rush_hour	geodesic_distance	bearing	temperature	visibility	precipitation	snow	heavy_snow	rain	heavy_rain
0	455	74	True	1502.171837	99.970196	4.4	8.0	0.3	False	False	False	False
1	663	164	False	1808.659969	242.846232	28.9	16.1	0.0	False	False	False	False
2	2124	19	False	6379.687175	200.319835	-6.7	16.1	0.0	False	False	False	False

3.5 Cluster Features

Here we will use the mini-batch k-means algorithm to cluster the pickup and dropoff locations into 100 clusters. We will then use the cluster centers to create new features that represent the distance between the pickup and dropoff locations and the cluster centers. This can help the model to identify patterns in the data related to the cluster centers.

def plot_clusters_map(  df, gdf_map, latitude_col, longitude_col, cluster_col, ax, 
                        title=' ', sample_size=0, color_map= 'beige', edgecolor= 'grey', 
                        random_state=42, cmap='tab20', alpha=0.2):
    
    """
    Plots a geographical map with clusters from a DataFrame.

    Parameters:
    - df: DataFrame containing the data to be plotted.
    - gdf_map: GeoDataFrame representing the geographical boundaries.
    - latitude_col: Name of the column containing latitude data.
    - longitude_col: Name of the column containing longitude data.
    - cluster_col: Name of the column containing cluster identifiers.
    - ax: Matplotlib Axes object for plotting.
    - title: Title of the plot.
    - sample_size: Number of data points to sample from df (0 for all).
    - color_map: Color for the map.
    - edgecolor: Edge color for the map.
    - random_state: Random state for reproducibility in sampling.
    - cmap: Colormap for clusters.
    - alpha: Transparency level for cluster points.
    """
    
    
    if sample_size > 0:
        df = df.sample(sample_size, random_state=random_state)
        
    # Create geometry and GeoDataFrame
    geometry = gpd.points_from_xy(df[longitude_col], df[latitude_col])
    gdf_clusters = gpd.GeoDataFrame(df, geometry=geometry).set_crs(epsg=4326)
    
    # Convert map to the same CRS
    gdf_map = gdf_map.to_crs(epsg=4326)
    gdf_map.plot(ax=ax, color=color_map, edgecolor=edgecolor)  # Plot the NYC boundary
    
    # Scatter plot for clusters
    ax.scatter( gdf_clusters.geometry.x, gdf_clusters.geometry.y, s=1, 
                c=df[cluster_col].values, cmap=cmap, alpha=alpha)  # Plot clusters
    
    # Set labels and title
    ax.set_xlabel('Longitude', fontsize=14)
    ax.set_ylabel('Latitude', fontsize=14)
    ax.set_title(title, fontsize=20)

Let’s create more six features based in the results of the mini-batch k-means algorithm. We will create two features for the predict clusters, one for pickup and one for dropoff locations. From this features we extract the longitude and latitude centers of each cluster, creating four features: pickup_center_lat, pickup_center_long, dropoff_center_lat, dropoff_center_long. We will also create two features for the distance between the pickup and dropoff cluster centers, pickup_cluster_distance and dropoff_cluster_distance.

pickup_locations = taxi_trip.loc[:,['pickup_latitude', 'pickup_longitude']]
dropoff_locations = taxi_trip.loc[:,['dropoff_latitude', 'dropoff_longitude']]


coords = np.vstack( [pickup_locations.values, dropoff_locations.values])

# Randomly permute the combined coordinates 
sample_coords = np.random.permutation(coords)[:500000]


kmeans_coords = MiniBatchKMeans(    n_clusters= 100 , random_state=42, 
                                    batch_size=10000, n_init = 'auto')

kmeans_coords.fit(sample_coords)


# Predict clusters for the full dataset
taxi_trip['pickup_cluster'] = kmeans_coords.predict(pickup_locations.values)
taxi_trip['dropoff_cluster'] = kmeans_coords.predict(dropoff_locations.values)

cluster_centers = kmeans_coords.cluster_centers_
taxi_trip['pickup_center_lat'] = cluster_centers[taxi_trip['pickup_cluster'], 0]
taxi_trip['pickup_center_lon'] = cluster_centers[taxi_trip['pickup_cluster'], 1]
taxi_trip['dropoff_center_lat'] = cluster_centers[taxi_trip['dropoff_cluster'], 0]
taxi_trip['dropoff_center_lon'] = cluster_centers[taxi_trip['dropoff_cluster'], 1]

# Calculate the geodesic distance
taxi_trip['center_geodesic_distances'] = geodesic(  taxi_trip['pickup_center_lat'].to_numpy(), 
                                                    taxi_trip['pickup_center_lon'].to_numpy(), 
                                                    taxi_trip['dropoff_center_lat'].to_numpy(), 
                                                    taxi_trip['dropoff_center_lon'].to_numpy())

taxi_trip[['center_geodesic_distances', 'geodesic_distance']].head()

	center_geodesic_distances	geodesic_distance
0	1381.650448	1502.171837
1	1909.482823	1808.659969
2	6462.250705	6379.687175
3	1864.313219	1483.632481
4	844.527437	1187.037659

For a visualization, we can plot the clusters using the shape file from NYC Borough Boundaries and the cluster centers:

fig, axes = plt.subplots(1, 2, figsize=(24, 16))
titles = ['Pickup Locations', 'Dropoff Locations']
# Column names 
location_columns = [('pickup_latitude', 'pickup_longitude', 'pickup_cluster'), 
                    ('dropoff_latitude', 'dropoff_longitude', 'dropoff_cluster')]

# Loop through each subplot
for ax, title, (lat_col, lon_col, cluster_col) in zip(axes, titles, location_columns):
    plot_clusters_map(  taxi_trip, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title, color_map= beige, edgecolor=grey)
    
    # Plot the cluster centers
    ax.scatter( cluster_centers[:, 1], cluster_centers[:, 0], s = 3, 
                c = 'black', marker='*', alpha = 0.7)
    
plt.show()

# Save the dataset
taxi_trip.to_csv('../data/interim/nyc-taxi-trip-2016-new-features.csv', index=False)

4. Split Dataset

taxi_trip = pd.read_csv('../data/interim/nyc-taxi-trip-2016-new-features.csv', 
                        parse_dates= ['pickup_date','dropoff_date'])

Let’s choose to split the dataset first into two parts, df_train_large and df_test. The df_train_large will be used for exploratory data analysis (EDA) to understand the dataset deeply and identify key features. Then, using the df_train_large dataset, we will split it again into df_train and df_val. The df_train will be used to train our machine learning models and validate their prediction with the df_val dataset. The df_test set, is reserved until the end, and will provide an unbiased assessment of the model’s performance on unseen data, preventing any kind of data leakage. Also, we will use the df_train_large to compare the performance with the df_test set.

# Split in 35.75% Train /29.25% Validation/ 35% Test
df_train_large, df_test = train_test_split(taxi_trip, test_size = 0.40, random_state = random_state)
df_train, df_val = train_test_split(df_train_large, train_size = 0.55, random_state = random_state)

print(f"Train large:{len(df_train_large)}({round(100*len(df_train_large)/ len(taxi_trip), 2)}%)")
print(f"Test: {len(df_test)}({round(100*len(df_test)/len(taxi_trip),2)}%)")
print(f"Train:{len(df_train)}({round(100*len(df_train)/len(taxi_trip),2)}%)")
print(f"Validation: {len(df_val)}({round(100*len(df_val)/len(taxi_trip),2)}%)")

Train large:849243(60.0%)
Test: 566162(40.0%)
Train:467083(33.0%)
Validation: 382160(27.0%)

Also, to visualize the split data and check if the spatial and temporal features are representative in each dataset, we plot the NYC map for the pickup and dropoff locations clusters and the counts of pickup by the dates. We can see that the split is representative for the spatial and temporal features:

nybb_path = '../data/external/nyc-borough-boundaries/geo_export_e13eede4-6de2-4ed8-98a0-58290fd6b0fa.shp'
nyc_boundary = gpd.read_file(nybb_path)

fig, axes = plt.subplots(2, 2, figsize=(20, 16))
axes = axes.flatten()  # Flatten the axes array for easy iteration

# Titles and Column names
titles = ['Pickup Cluster Locations', 'Dropoff Cluster Locations']
# Column names 
location_columns = [('pickup_latitude', 'pickup_longitude', 'pickup_cluster'), 
                    ('dropoff_latitude', 'dropoff_longitude', 'dropoff_cluster')]

# Loop for training data
for ax, title, (lat_col, lon_col, cluster_col) in zip(axes[:2], titles, location_columns):
    plot_clusters_map(  df_train, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title + ' (Train)', color_map= beige, edgecolor=grey)

# Loop for test data
for ax, title, (lat_col, lon_col, cluster_col) in zip(axes[2:], titles, location_columns):
    plot_clusters_map(  df_test, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title + ' (Test)', color_map= beige, edgecolor=grey)

plt.tight_layout()  
plt.subplots_adjust(left=0.10, right=0.90, top=0.95, bottom=0.05)
plt.show()

fig, axes = plt.subplots(1, 1, figsize=(15, 5))

axes.plot(df_train.groupby('pickup_date')[['id']].agg(['count']), 'o-' , label = "Train", color = yellow)
axes.plot(df_test.groupby('pickup_date')[['id']].agg(['count']), 'o-', label = "Test", color = orange)

plt.title('Pickup records per day')
plt.legend(loc ='lower right')
plt.ylabel('Number of records')
plt.show()

5. Exploratory Data Analysis (EDA)

We proceed with the EDA in the df_train_large dataset, which is the combination of both training and validation sets. To prevent data leakage, we don’t include the test set for the EDA. This will ensure the trained model evaluation on the test set is unbiased, without information from the train set mistakenly leakage to the test.

5.1 Feature Visualizations

Let’s first plot one more time the maps of NYC pickup and dropoff cluster locations. We can see that for both the clusters are concentrated in Manhattan. The main difference between pickup and dropoff locations is the concentration of the clusters outside Manhattan. For dropoff locations, are also more concentration in other regions like in Brooklyn and Queens, when compared to the pickup locations. This is expected because Manhattan is a commercial and financial center of the United States, and Brooklyn and Queens are residential boroughs.

nybb_path = '../data/external/nyc-borough-boundaries/geo_export_e13eede4-6de2-4ed8-98a0-58290fd6b0fa.shp'
nyc_boundary = gpd.read_file(nybb_path)

fig, axes = plt.subplots(1, 2, figsize=(20, 16))
axes = axes.flatten()

titles = ['Pickup Cluster Locations', 'Dropoff Cluster Locations']
location_columns = [('pickup_latitude', 'pickup_longitude', 'pickup_cluster'), 
                    ('dropoff_latitude', 'dropoff_longitude', 'dropoff_cluster')]

for ax, title, (lat_col, lon_col, cluster_col) in zip(axes[:2], titles, location_columns):
    plot_clusters_map(  df_train_large, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title, color_map = beige , edgecolor=grey)
    
    # iterrows: Loop through each row in the DataFrame.
    # return index of each row
    # return each row as pandas series
    for idx, row in nyc_boundary.iterrows():
        centroid = row['geometry'].centroid
        ax.text(centroid.x, centroid.y, row['boro_name'], fontsize=10, ha='right', va='center')

plt.tight_layout()  
plt.subplots_adjust(left=0.10, right=0.90, top=0.95, bottom=0.05)
plt.show()

Let’s look at a simple overview visualization of the pickup and dropoff latitudes and longitudes:

fig, axes = plt.subplots(2, 2, figsize=(20, 10))  # 2x2 grid for subplots

# Pickup Longitude Histogram
sns.histplot(   df_train_large.pickup_longitude, bins=40, linewidth=0.5, color=yellow, alpha=0.2,
                ax=axes[0, 0], kde=True, line_kws={'lw': 3})
axes[0, 0].set_title('Pickup Longitude', fontsize=15)
axes[0, 0].set_xlabel('Pickup Longitude', fontsize=14)
axes[0, 0].set_ylabel('Count', fontsize=14)

# Pickup Latitude Histogram
sns.histplot( df_train_large.pickup_latitude, bins=40, linewidth=0.5, color=yellow, alpha=0.2,
                ax=axes[0, 1], kde=True, line_kws={'lw': 3})
axes[0, 1].set_title('Pickup Latitude', fontsize=15)
axes[0, 1].set_xlabel('Pickup Latitude', fontsize=14)
axes[0, 1].set_ylabel('Count', fontsize=14)

# Dropoff Longitude Histogram
sns.histplot(  df_train_large.dropoff_longitude, bins=40, linewidth=0.5, color=orange, alpha=0.2,
                ax=axes[1, 0], kde=True, line_kws={'lw': 3})
axes[1, 0].set_title('Dropoff Longitude', fontsize=15)
axes[1, 0].set_xlabel('Dropoff Longitude', fontsize=14)
axes[1, 0].set_ylabel('Count', fontsize=14)

# Dropoff Latitude Histogram
sns.histplot(   df_train_large.dropoff_latitude, bins=40, linewidth=0.5, color=orange, alpha=0.2,
                ax=axes[1, 1], kde=True, line_kws={'lw': 3})
axes[1, 1].set_title('Dropoff Latitude', fontsize=15)
axes[1, 1].set_xlabel('Dropoff Latitude', fontsize=14)
axes[1, 1].set_ylabel('Count', fontsize=14)

plt.tight_layout()
plt.show()

For the bearing feature created in the 3. Feature Engineering section, we visualized the bearing feature’s distribution using a histogram in polar coordinates. The radial variable is the geodesic distance in log scale and the polar variable is the bearing angles in degrees. On the left we plot the histogram for the bearing in cartesian coordinates, and On the right we plot the histogram in polar coordinates.

fig = plt.figure(figsize=(20, 8))

# Add a regular subplot for the linear histogram of bearing
ax1 = fig.add_subplot(121)  # 1 row, 2 columns, first subplot
sns.histplot(   df_train_large['bearing'], bins=40, linewidth=0.2, color=yellow, alpha=0.2,
                ax=ax1, kde=True, line_kws={'lw': 3})
ax1.set_title('Bearing Histogram', fontsize=15)
ax1.set_xlabel('Bearing (Degrees)', fontsize=14)
ax1.set_ylabel('Count', fontsize=14)

# Add a polar subplot for the 2D histogram
ax2 = fig.add_subplot(122, projection='polar')  # 1 row, 2 columns, second subplot with polar projection

# Convert bearing to radians
bearing_rad = np.radians(df_train_large['bearing'])

# Create a 2D histogram in polar coordinates
hist, theta, r, mappable = ax2.hist2d(  bearing_rad, 
                                        np.log10(df_train_large['geodesic_distance']), 
                                        bins=[20, 20], 
                                        cmap='YlOrBr')

# Customize the polar plot
ax2.set_theta_zero_location('N')  # Set 0 degrees to the south
ax2.set_theta_direction(-1)  # Clockwise
ax2.set_xlabel('Bearing (Degrees)', fontsize=14, labelpad=20)
ax2.set_ylabel('Log10 of Geodesic Distance (m)', fontsize=14, labelpad=40)
ax2.grid(True)

# Add a color bar to represent counts on the polar plot
cbar = plt.colorbar(mappable, ax=ax2, pad=0.1, fraction=0.035)
cbar.ax.set_ylabel('Counts')

plt.tight_layout() 
plt.show()

We can draw the following observations from these two plots:

There are two peaks for the bearing direction around 22 and 200 degrees for geodesic distances between 1 and 10 kilometers. We could infer that these are bearing directions pointing towards Manhattan.
There are other three lower peaks around 125, 300 and 350 degrees.

It appears that the bearing feature has some capability to distinguish the direction of the trip. This indicates potential importance of the bearing feature for the model. Also, it is important to note that the distance is on a logarithmic scale, emphasizing shorter trips.

For the categorical features vendor_id, passanger_count and store_and_fwd_flag we can use bar plots to visualize the distribution. Here the vendor_id refers to the companies that provide the taxi services, and the store_and_fwd_flag indicates whether the trip record was stored in vehicle memory before sending to the vendor.

fig, axes = plt.subplots(1, 3, figsize=(20, 5))  # 3 rows, 1 column

# Plot for vendor_id
vendor_id = df_train_large['vendor_id'].value_counts().sort_index()
vendor_id.plot(kind='bar', width=0.9, color=[yellow, orange], edgecolor=grey, ax=axes[0])
axes[0].set_title('Company that Provides the Taxi Service')
axes[0].set_xlabel('vendor_id')
axes[0].set_ylabel('Frequency')
axes[0].tick_params(axis='x', rotation=0)

# Plot for store_and_fwd_flag
store_fwd_flag = df_train_large['store_and_fwd_flag'].value_counts().sort_index()
np.log(store_fwd_flag).plot(kind='bar', width=0.9, color=[yellow, orange], edgecolor=grey, ax=axes[1])
axes[1].set_title('Trip Data Stored in Vehicle Memory (?)')
axes[1].set_xlabel('store_and_fwd_flag')
axes[1].set_ylabel('Frequency (log scale)')
axes[1].tick_params(axis='x', rotation=0)

# Plot for passenger_counts
passenger_counts = df_train_large['passenger_count'].value_counts().sort_index()
passenger_counts.plot(kind='bar', width=0.9, color=[orange], edgecolor=grey, ax=axes[2])
axes[2].set_title('Number of Passengers')
axes[2].set_xlabel('passenger_count')
axes[2].set_ylabel('Frequency')
axes[2].tick_params(axis='x', rotation=0)

plt.tight_layout() 
plt.show()

We can state the following observations:

The vendor_id 2 has more trips than the vendor_id 1. This can be explained by the fact that the vendor_id 2 has more vehicles than the vendor_id 1.
The store_and_fwd_flag is False for the majority of the trips (Note the use of log scale in the y-axis). This means that the majority of the trips are sent to the vendor immediately after the trip is completed. This could mean a lower chance of data loss.
The majority of the trips have only one passenger.

For insight about temporal feature, like day_of_week and hour_of_day, we can plot those feature for each vendor_id. Let’s compare the taxi services of each company for the pickup location by the day of the week and hours of the day.

fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Plot for day_of_week
grouped_data_day = df_train_large.groupby(['vendor_id', 'day_of_week'])['pickup_date'].count().reset_index()
sns.scatterplot(data=grouped_data_day, x='day_of_week', y='pickup_date', 
                hue='vendor_id', palette=['orange', 'yellow'], s=150, ax=axes[0])
axes[0].set_xticks(range(7))
axes[0].set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 
                        'Thursday', 'Friday', 'Saturday', 'Sunday'])
axes[0].set_xlabel(' ')
axes[0].set_title('Number of Pickups by Day of the Week')
axes[0].set_ylabel('Total Number of Pickups')
# Highlight weekdays for rush hour
axes[0].grid()

# Plot for hour_of_day
grouped_data_hour = df_train_large.groupby(['vendor_id', 'hour_of_day'])['pickup_date'].count().reset_index()
sns.scatterplot(data=grouped_data_hour, x='hour_of_day', y='pickup_date',  
                hue='vendor_id', palette=['orange', 'yellow'], s=150, ax=axes[1])
axes[1].set_title('Number of Pickups by Hour of the Day')
axes[1].set_xlabel('Hour of the Day')
axes[1].set_ylabel('Total Number of Pickups')
# Highlight rush hour times
axes[1].axvspan(7, 10, color=grey, alpha=0.2)
axes[1].axvspan(16, 20, color=grey, alpha=0.2)
axes[1].grid()

# Adjust the legend for both subplots
for ax in axes:
    ax.legend(title='Vendor ID', loc='upper left')

plt.tight_layout()
plt.show()

We can state the following observations based on the scatter plots:

There is lower number of trips on monday and sunday for both vendors, if vendor 2 having more trips than vendor 1. The higher number of trips is on thursday and friday.
For hours of the day, we can see that the vendor 2 has a closer number of trips than vendor 1.
We add a grey region in the second plot to indicate the rush hours feature is_rush_hour, showing a well defined pattern of increase trips for both vendors between 7-10 a.m. and 4-8 p.m.
There is a strong dip during the early morning hours. Here there is not much difference between the two vendors. Another dip occur around 4pm and then the numbers increase towards the evening.

For further insight about temporal features, we can count the trips for the hours of the day and plot the curves for months and days of the week.

grouped_month = df_train_large.groupby(['hour', 'month']).size().reset_index(name='count')
grouped_weekday = df_train_large.groupby(['hour', 'day_of_week']).size().reset_index(name='count')

# Define palettes
palette_month = sns.color_palette("tab20", n_colors=6)
palette_weekday = sns.color_palette("tab20", n_colors=7)

fig, axes = plt.subplots(1, 2, figsize=(25, 6))

# Month plot
sns.lineplot(   data=grouped_month, x='hour', y='count', hue='month',
                palette=palette_month, linewidth=2.5, ax=axes[0])

# Update the month legend
month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
handles, labels = axes[0].get_legend_handles_labels()
axes[0].legend(handles=handles, title='Month', labels=month_labels, loc='upper left')

axes[0].set_xlabel('Hour of the Day')
axes[0].set_ylabel('Count')
axes[0].set_title('Hour of Day vs Count by Month')
axes[0].axvspan(7, 10, color=grey, alpha=0.2)
axes[0].axvspan(16, 20, color=grey, alpha=0.2)
axes[0].grid()

# Weekday plot
sns.lineplot(   data=grouped_weekday, x='hour', y='count', hue='day_of_week', 
                palette=palette_weekday, linewidth=2.5, ax=axes[1])

# Update the weekday legend
weekday_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
handles, labels = axes[1].get_legend_handles_labels()
axes[1].legend(handles=handles, title='Day of the Week', labels=weekday_labels, loc='lower right')

axes[1].set_xlabel('Hour of the Day')
axes[1].set_ylabel('Count')
axes[1].set_title('Hour of Day vs Count by Weekday')

# Highlight rush hour times
axes[1].axvspan(7, 10, color=grey, alpha=0.2)
axes[1].axvspan(16, 20, color=grey, alpha=0.2)
axes[1].grid()

plt.show()

We can state the following observations based on the line plots:

For the months they show similar patterns as before for the hours of the day. Again respecting the rush hours feature is_rush_hour, showing a well defined pattern of increased trips between 7-10 a.m. and 4-8 p.m.
January and June experience fewer trips, while March and April see a greater number of trips.
In the Feature Engineering section, we excluded Saturdays and Sundays from the is_rush_hour definition. This choice is supported by the lines for Saturday and Sunday, which show significantly fewer trips than the other days of the week during the morning rush hours of 7-10 AM. However, during the evening rush hours of 4-8 PM, an interesting pattern occurs: Friday and Saturday see an increase in trips that extends into late night. This can be explained by the fact that Friday and Saturday are the days of the week with more nightlife activity in NYC.

5.2 Feature Correlations

In the previous section, we visualized the distribution of some individual features to understand and get insights about the data. Now, we will look at the correlation between those features and the target feature trip_duration and the correlation between each other.

On the left, we plot a hexbin for the correlation between the log trip_duration and the log geodesic_distance, and in the right again we plot a 2d histogram in polar coordinates, where the radial is the log of the trip_duration and the polar angle is the log of the bearing feature. We can state the following observations:

fig = plt.figure(figsize=(20, 8))

log_geodesic_distance = np.log1p(df_train_large['geodesic_distance'])
log_trip_duration = np.log1p(df_train_large['trip_duration'])

ax1 = fig.add_subplot(121)  
ax1.hexbin(log_geodesic_distance, log_trip_duration, gridsize=100, cmap='YlOrBr', mincnt=1)
ax1.set_xlabel('Log10 of Geodesic Distance (m)', fontsize=14)
ax1.set_ylabel('Log1p of Trip Duration (s)', fontsize=14)

# Add a color bar to represent counts on the polar plot
cbar_hexbin = plt.colorbar(mappable, ax=ax1, pad=0.1, fraction=0.035)
cbar_hexbin.ax.set_ylabel('Counts')

# Add a polar subplot for the 2D histogram
ax2 = fig.add_subplot(122, projection='polar')  # 1 row, 2 columns, second subplot with polar projection
bearing_rad = np.radians(df_train_large['bearing'])
hist, theta, r, mappable = ax2.hist2d(  bearing_rad, 
                                        log_trip_duration, 
                                        bins=[20, 20], 
                                        cmap='YlOrBr')

# Customize the polar plot
ax2.set_theta_zero_location('N')  # Set 0 degrees to the south
ax2.set_theta_direction(-1)  # Clockwise
ax2.set_xlabel('Bearing (Degrees)', fontsize=14, labelpad=20)
ax2.set_ylabel('Log1p of Trip Duration (s)', fontsize=14, labelpad=40)
ax2.grid(True)

# Add a color bar to represent counts on the polar plot
cbar_polar = plt.colorbar(mappable, ax=ax2, pad=0.1, fraction=0.035)
cbar_polar.ax.set_ylabel('Counts')

plt.tight_layout() 
plt.show()

By the left plot, we can see that there is a positive correlation between the log of the trip_duration and the log of the geodesic_distance. This is expected because the longer the distance, the longer the trip duration. However, this correlation is not linear, as we can see that the correlation is stronger for shorter distances. This can be explained by the fact that the longer the distance, the more traffic and other factors can influence the trip duration.

In the polar plot, there are two peaks for the bearing direction around 22 and 200 degrees for trip duration approximated between 405 and 1098 seconds, which implies that most trips are short in this directions, probably covering distances between 1 and 10 kilometers as we previously stated.

def plot_feature_importance(df, x, y, ax, threshold=0.002, title='Feature Importance', 
                            xlabel='Features', ylabel='Importance', palette=None):
    """
    Function to plot the feature importance with a distinction of importance based on a threshold.

    Parameters:
    - df: pandas.DataFrame
        DataFrame containing features and their importance scores.
    - x: str
        Name of the column representing feature names.
    - y: str
        Name of the column representing feature importance scores.
    - ax: matplotlib axis object
        Axis on which to draw the plot.
    - threshold: float, optional (default=0.002)
        Value above which bars will be colored differently.
    - pad: float, optional (default=5.0)
        Adjust the layout of the plot.
    - title: str, optional (default='Feature Importance')
        Title of the plot.
    - xlabel: str, optional (default='Features')
        Label for the x-axis.
    - ylabel: str, optional (default='Importance')
        Label for the y-axis.
    - palette: list, optional
        A list of two colors. The first color is for bars 
        below the threshold and the second is for bars above.

    Returns:
    - None (modifies ax in-place)
    """
    if palette is None:
        palette = ["blue", "red"]
    
    blue, red = palette
    colors = [red if value >= threshold else blue for value in df[y]]
    sns.barplot(x=x, y=y, data=df, ax=ax, alpha=0.5, palette=colors, legend= False, hue = x)
    ax.set_xticks(range(len(df[x])))
    ax.set_xticklabels(df[x], rotation=90, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=15)
    ax.set_xlabel(xlabel, fontsize=15) 
    ax.grid(axis='y')
    ax.set_title(title, fontsize=15)

To better understand the relationship between our target feature, trip_duration, and the other features, we explore two distinct statistical measures: mutual information and the Spearman correlation coefficient.

Mutual information evaluates the mutual dependence between two variables and captures both linear and non-linear relationships. This measure is robust against outliers and provides reliability for any numerical or categorical features. On the other hand, Spearman correlation, while also addressing both linear and non-linear relationships, does not assume any specific data distribution. We can interpret the comparison of these metrics as follows:

If mutual information has a strong relationship and weak Spearman correlation: can suggest the presence of a complex, non-linear or non-monotonic relationship. Mutual information captures all kinds of dependencies, including non-linear ones, so it can detect relationships that Spearman correlation might miss if they don’t involve consistently increasing or decreasing ranks.
If mutual information has a weak relationship and strong Spearman correlation: can suggests a monotonic relationship. Spearman correlation assesses how well the relationship between two variables can be described by a monotonic function, which means as one variable increases, the other also consistently increases (or decreases).

features_to_drop = ['id', 'pickup_date', 'dropoff_date', "year", 
                    "min_of_day", "hour", "day_of_year", "minute" ]
df_train_dropped = df_train_large.drop(features_to_drop, axis=1).copy()

# Sample 50% with Random permutation
sample_size = int(len(df_train_large) * 0.5)
random_indices = np.random.permutation(df_train_large.index)
sample_indices = random_indices[:sample_size]

# Create a sample DataFrame
df_sampled = df_train_dropped.loc[sample_indices].drop('trip_duration', axis=1).copy()
Y_train_large = df_train_large['trip_duration'].copy()
Y_sampled = Y_train_large.loc[sample_indices]

# Mutual information
mi_values = mutual_info_classif(df_sampled, Y_sampled, discrete_features=False)
mi_column = pd.DataFrame(   mi_values, index=pd.Index(df_sampled.columns, name='features'), 
                            columns=['mutual_inf']).sort_values(by='mutual_inf', ascending=False)
mi_column_reset = mi_column.reset_index()

# Spearmann correlation
spearman_corr = df_train_dropped.corr(method='spearman')['trip_duration'].agg(np.abs)\
                                                .sort_values(ascending=False)\
                                                .drop('trip_duration')\
                                                .to_frame()                                        
spearman_column = spearman_corr.reset_index()
spearman_column.columns = ['features', 'spearman_corr']


# Plots
fig, ax = plt.subplots(1, 2, figsize=(25, 10))

plot_feature_importance(mi_column_reset, x='features', y='mutual_inf', ax=ax[0], threshold=0.02, 
                        title="Mutual Information", xlabel=" ", 
                        ylabel="Taxi Trip Duration Correlation", palette=[yellow, orange])

plot_feature_importance(spearman_column, x='features', y='spearman_corr', ax=ax[1], threshold=0.02, 
                        title="Spearman Correlation", xlabel=" ", 
                        ylabel="Taxi Trip Duration Correlation", palette=[yellow, orange])

plt.tight_layout(pad=5.0)
plt.show()

The strongest correlation for trip_duration are the distance measures, geodesic_distance and center_geodesic_distances, as expected. The other features compared to the distances are less correlated, but yet have some relevance. Most of those features are related with pickup and dropoff locations.

The two metrics show that most of the feature location have a strong correlation, but they do not agree with some features locations. For this purpose we could create a plot only for those features for a clear understanding of the correlation. I will skip this step for now.
The bearing shows a much more strong correlation in the mutual information than the Spearman. In both plots the bearing is less correlated then the location features. This can suggest the bearing has a more complex, non-linear or non-monotonic relationship with trip_duration. This could be explained due to the directional effects that aren’t increasing or decreasing with time, because the bearing only provides the angle from the direction of pickup to the dropoff locations.
The majority of climate features do not have a strong correlation with the target feature, with exception for the visibility and temperature features. For visibility we have a strong correlation for the mutual information and a weak correlation for Spearman, suggesting a complex, non-linear or non-monotonic relationship. For temperature, the opposite happens, suggesting a more monotonic relationship, that could be explained when extreme temperatures could consistently affect traffic patterns.
Almost all temporal features do not show a strong correlation with the target feature trip_duration for mutual information. However, for the Spearman correlation, the temporal features hour_of_day, month, is_weekend and day_of_week, is_holliday show a better correlation, suggesting a monotonic relationship. This is expected because temporal features can increase or decrease traffic related to the time of day, month, weekends, and days of the week.

We can consider selecting two sets of relevant features from both metrics, and than take the union of these two sets. This will ensure that we select features that are relevant for both metrics.

threshold = 0.02
best_spearman = spearman_corr[spearman_corr['trip_duration'] >= threshold].index.tolist()
print("Best Spearman features for trip_duration correlation:\n", best_spearman)

best_mi = list(mi_column[mi_column['mutual_inf'] > threshold].index)
print("\n Best mutual information features for trip_duration:\n", best_mi)


mi_set = set(best_mi)
spearman_set = set(best_spearman)

union_mi_spearman  = mi_set.union(spearman_set)
print('\nUnion set:\n', union_mi_spearman)
intersec_mi_spearman  = mi_set.intersection(spearman_set)
print('\nIntersection set:\n', intersec_mi_spearman)

Best Spearman features for trip_duration correlation:
 ['geodesic_distance', 'center_geodesic_distances', 'dropoff_latitude', 'dropoff_center_lat', 
 'pickup_latitude', 'pickup_center_lat', 'dropoff_center_lon', 'dropoff_longitude', 'temperature', 
 'is_weekend', 'pickup_cluster', 'month', 'hour_of_day', 'day_of_week', 'is_holiday', 'pickup_center_lon', 
 'pickup_longitude', 'is_rush_hour', 'passenger_count', 'bearing', 'store_and_fwd_flag']

 Best mutual information features for trip_duration:
 ['geodesic_distance', 'center_geodesic_distances', 'pickup_cluster', 'pickup_center_lon', 
 'pickup_center_lat', 'dropoff_center_lat', 'dropoff_center_lon', 'dropoff_cluster', 'pickup_longitude', 
 'dropoff_latitude', 'pickup_latitude', 'dropoff_longitude', 'visibility', 
 'passenger_count', 'bearing', 'vendor_id']

Union set:
 {'is_holiday', 'dropoff_center_lon', 'dropoff_center_lat', 'passenger_count', 'store_and_fwd_flag', 
 'pickup_longitude', 'hour_of_day', 'day_of_week', 'center_geodesic_distances', 'pickup_cluster', 
 'pickup_latitude', 'dropoff_cluster', 'pickup_center_lat', 'dropoff_longitude', 'temperature', 
 'month', 'is_rush_hour', 'is_weekend', 'vendor_id', 'visibility', 'geodesic_distance', 
 'pickup_center_lon', 'bearing', 'dropoff_latitude'}

Intersection set:
 {'pickup_center_lat', 'dropoff_center_lon', 'dropoff_longitude', 'passenger_count', 'dropoff_center_lat', 
 'pickup_longitude', 'geodesic_distance', 'pickup_cluster', 'pickup_center_lon', 'center_geodesic_distances', 
 'bearing', 'dropoff_latitude', 'pickup_latitude'}

The intersection set has no feature related with climate feature. This confirm our previously assumption about the complex relationship between visibility and temperature with the target feature trip_duration. What it’s probably happening here is that mutual information can account for complex relation, but these features do not have a clear monotonic relationship necessary for a strong Spearman correlation. But we can still consider this features for our model, so we choose to use the union set of features to account both relevant feature in the two metrics.

def plot_corr_ellipses(data, figsize, **kwargs):
    """
    Plots a correlation matrix using ellipses to represent the correlations.

    Parameters:
    - data (pd.DataFrame): A 2D array or DataFrame containing the correlation matrix.
    - figsize: Tuple specifying the figure size.
    - kwargs: Additional keyword arguments for EllipseCollection.

    Returns:
    - A tuple containing the EllipseCollection object and the Axes object.

    """
    M = np.array(data)
    if not M.ndim == 2:
        raise ValueError('Data must be a 2D array.')

    # Mask the upper triangle of the matrix
    mask = np.triu(np.ones_like(M, dtype=bool), k=1)
    M[mask] = np.nan

    # Initialize the plot
    fig, ax = plt.subplots(1, 1, figsize=figsize, subplot_kw={'aspect': 'equal'})
    ax.set_xlim(-0.5, M.shape[1] - 0.5)
    ax.set_ylim(-0.5, M.shape[0] - 0.5)
    ax.invert_yaxis()
    ax.set_xticklabels([])
    ax.grid(False)

    # Determine xy locations of each ellipse center
    xy = np.indices(M.shape)[::-1].reshape(2, -1).T

    # Define ellipse properties
    w = np.ones_like(M).ravel() + 0.01  # Widths of ellipses
    h = 1 - np.abs(M).ravel() - 0.01   # Heights of ellipses
    a = 45 * np.sign(M).ravel()        # Rotation angles

    # Create and add the ellipse collection
    ec = EllipseCollection( widths=w, heights=h, angles=a, units='x', offsets=xy,
                            norm=Normalize(vmin=-1, vmax=1), transOffset=ax.transData, 
                            array=M.ravel(), **kwargs)
    ax.add_collection(ec)

    # Add a color bar for correlation values
    cb = fig.colorbar(ec, ax=ax, orientation='horizontal', fraction=0.047, pad=0.00)
    cb.ax.xaxis.set_ticks_position('bottom')
    cb.ax.xaxis.set_label_position('bottom')
    cb.ax.tick_params(top=False, labeltop=False)

    # Feature names on the diagonal
    if isinstance(data, pd.DataFrame):
        diagonal_positions = np.arange(M.shape[1])
        for i, label in enumerate(data.columns):
            ax.annotate(" -  " + label, (i - 0.4, i - 1), ha='left', va='bottom', rotation=90)
        ax.set_yticks(diagonal_positions)
        ax.set_yticklabels(data.index)

    # Hide the plot spines
    for spine in ax.spines.values():
        spine.set_visible(False)

    return ec, ax

Let’s plot a special kind of heatmap using ellipses to better see the correlation between each feature. We choose to use the spearman correlation. The orientation of the ellipse indicates whether two variables are positively correlated (ellipse is pointed to the top right) or negatively correlated (ellipse is pointed to the top left). The shading and width of the ellipse indicate the strength of the association: thinner and darker ellipses correspond to stronger relationships. We choose to use the union set of features to account both relevant feature in the two metrics:

df_train_union = df_train_large[list(union_mi_spearman)].copy()

correlation_matrix = df_train_union.corr(method='spearman')
# Use the function with the Spearman correlation matrix
ec, ax = plot_corr_ellipses(correlation_matrix, figsize=(10, 10), cmap='PuOr')
plt.tight_layout()
plt.show()

We can observe that some features are highly correlated, which may render certain features redundant for our model:

The geodesic_distance is highly correlated with center_geodesic_distances.
All the pickup and dropoff location features are highly correlated with the pickup and dropoff cluster features.

We have chosen to retain these features to evaluate their impact on the model’s performance.

5.3 Conclusion from EDA

The EDA revealed that Out of 39 features, with 28 having been crafted in the Feature Engineering section, a total of 24 were identified having some relevant correlation with the target feature or with another features. Let’s select the relevant features to train our machine learning models.

# Select relevant features from EDA and the target feature
relevant_features = list(union_mi_spearman) + ['trip_duration']

df_train_processed = df_train[relevant_features].copy()
df_val_processed = df_val[relevant_features].copy()
df_test_processed = df_test[relevant_features].copy()
df_train_large_processed = df_train_large[relevant_features].copy()

# Save the processed datasets with target feature
df_train_processed.to_csv('../data/processed/train.csv', index=False)
df_val_processed.to_csv('../data/processed/validation.csv', index=False)
df_test_processed.to_csv('../data/processed/test.csv', index=False)
df_train_large_processed.to_csv('../data/processed/train_large.csv', index=False)

We could see that the features geodesic_distance and center_geodesic_distances have a strong and a positive correlation, yet they do not have linear correlation. This align with our conclusion that longer distances generally require longer travel times, but the relationship is not linear due to factors like traffic conditions. Also, certain direction was preferred for shorter trips, probably accounting trips within Manhattan borough. This could be seen for the bearing feature, which shows a complex correlation with trip_duration. Those feature can be expected to be important for our model.

For Climate feature, most show a very weak correlation with trip_duration, except for visibility and temperature. Only this two features having a better correlation was contraintuitive, because was expected that more climate features could have a more strong correlation with trip_duration. Probably, because we only have records from the first 6 months of 2016, the impact of some climate features could not be fully captured, resulting in a weak correlation.

For temporal features, we could see that the hour_of_day, month, is_weekend, day_of_week, is_holliday show a better correlation with trip_duration, suggesting a monotonic relationship. This is expected because temporal features can increase or decrease traffic related to the time.

For pickup and dropoff locations features, we could see that the majority of the features show a good correlation with trip_duration. This is expected because the location features can indicate for our model traffic patterns related to the location.

6. Model Training and Validation

In this project, we choose to employ the XGBoost and Decision Tree Regressor models to train and predict the taxi_trip_duration. These models can capture complex relationships between the features and the target feature and they are robust against outliers. Even with our outliers handling, this kind of dataset can still have outliers due to unusual traffic conditions or other anomalies that we could not detect. In this section, We will use the df_train dataset to train the models and the df_val dataset to validate the predictions.

# Load the processed datasets for train, validation
df_train = pd.read_csv('../data/processed/train.csv')
df_val = pd.read_csv('../data/processed/validation.csv')

# Save Target feature
Y_train = df_train['trip_duration'].values
Y_val = df_val['trip_duration'].values
Y_train_log = np.log1p(Y_train)
Y_val_log = np.log1p(Y_val)

# Drop target feature
df_train.drop('trip_duration', axis=1, inplace=True)
df_val.drop('trip_duration', axis=1, inplace=True)
X_train = df_train.values
X_val = df_val.values

# Decision Tree Regressor
param_grid = {
    'max_depth': [ 2, 5, 10, None],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 3, 5, 6]
}
dtr = DecisionTreeRegressor(random_state=random_state)

# Grid Search using RMSE as scoring metric
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)), 
                            greater_is_better=False)
grid_search = GridSearchCV(estimator=dtr, param_grid=param_grid, scoring=rmse_scorer, cv= 5, n_jobs= 8)
grid_search.fit(X_train, Y_train_log)

# Best parameters and score
best_params = grid_search.best_params_
best_score = np.sqrt(-grid_search.best_score_)  

print("Best parameters found: ", best_params)
print("Best RMSE score from Grid Search: ", best_score)

Y_pred = grid_search.best_estimator_.predict(X_val)
rmse_val = np.sqrt(mean_squared_error(Y_val_log, Y_pred))

Y_pred = grid_search.best_estimator_.predict(X_train)
rmse_train = np.sqrt(mean_squared_error(Y_train_log, Y_pred))

print(f'\n Train-rmse:{rmse_train} \t\t Validation-rmse:{rmse_val}')

Best parameters found:  {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5}
Best RMSE score from Grid Search:  0.6045244076485581

 Train-rmse:0.35810464512423246 		 Validation-rmse:0.3645839928468323

dmatrix_train = xgb.DMatrix(X_train, label = Y_train_log)
dmatrix_val  = xgb.DMatrix(X_val , label = Y_val_log)
watchlist = [(dmatrix_train, 'train'), (dmatrix_val, 'validation')]


parms = {   'max_depth':8,
            'objective':'reg:squarederror',
            'eta'      :0.3,
            'subsample':0.7,
            'lambda'  :4,  
            'colsample_bytree':0.9,
            'colsample_bylevel':1,
            'min_child_weight': 10,
            'nthread'  :8,
            'eval_metric': "rmse",
            'seed': random_state}

evals_test_val = {}
xgb_model = xgb.train(  parms, dmatrix_train, num_boost_round=1000, evals = watchlist, 
                        early_stopping_rounds=30, maximize=False, verbose_eval=50, 
                        evals_result = evals_test_val)

print(f'\n Score = {xgb_model.best_score:1.5f}\n n_boost_round = {xgb_model.best_iteration:d}')

[0]	train-rmse:0.56721	validation-rmse:0.56764
[50]	train-rmse:0.29455	validation-rmse:0.30990
[100]	train-rmse:0.27962	validation-rmse:0.30561
[150]	train-rmse:0.26882	validation-rmse:0.30426
[200]	train-rmse:0.26045	validation-rmse:0.30408
[246]	train-rmse:0.25376	validation-rmse:0.30415

 Score = 0.30402
 n_boost_round = 217

7. Model Evaluation on Test Set

The XGBoost model outperforms the Decision Tree Regressor in both the training and validation datasets. To verify that our model truly has predictive power and isn’t just overfitting, we’ll assess it against the df_test set. Additionally, we’ll compare its performance with the df_train_large set for a more comprehensive evaluation. Testing on the df_train_large set, helps confirm that our model’s effectiveness isn’t limited to the smaller df_train dataset. This dual approach of testing against the unseen data (df_test) and a larger training set (df_train_large) can confirm the predictive power of the model.

# Load the processed datasets for train large and validation
df_train_large = pd.read_csv('../data/processed/train_large.csv')
df_test = pd.read_csv('../data/processed/test.csv')

# Save Target feature
Y_train_large = df_train_large['trip_duration'].values
Y_test = df_test['trip_duration'].values
Y_train_large_log = np.log1p(Y_train_large)
Y_test_log = np.log1p(Y_test)

# Drop target feature
df_train_large.drop('trip_duration', axis=1, inplace=True)
df_test.drop('trip_duration', axis=1, inplace=True)
X_train_large = df_train_large.values
X_test = df_test.values

dmatrix_train_large = xgb.DMatrix(X_train_large, label = Y_train_large_log)
dmatrix_test  = xgb.DMatrix(X_test   , label = Y_test_log)
watchlist = [(dmatrix_train_large, 'train_large'), (dmatrix_test, 'test')]

parms = {   'max_depth':8, 
            'objective':'reg:squarederror',
            'eta'      :0.3,
            'subsample':0.7,
            'lambda'  :4,  
            'colsample_bytree':0.9,
            'colsample_bylevel':1,
            'min_child_weight': 10,
            'nthread'  :8,
            'eval_metric': "rmse",
            'seed': random_state}

evals_test_train= {}
xgb_model = xgb.train(    parms, dmatrix_train_large, num_boost_round=1000, evals = watchlist,
                            early_stopping_rounds=30, maximize=False, verbose_eval=50, 
                            evals_result = evals_test_train)

print(f'\n Score = {xgb_model.best_score:1.5f}\n n_boost_round = {xgb_model.best_iteration:d}')

[0]	train_large-rmse:0.56734	test-rmse:0.56765
[50]	train_large-rmse:0.29908	test-rmse:0.30796
[100]	train_large-rmse:0.28683	test-rmse:0.30243
[150]	train_large-rmse:0.27813	test-rmse:0.30017
[200]	train_large-rmse:0.27132	test-rmse:0.29918
[250]	train_large-rmse:0.26580	test-rmse:0.29838
[300]	train_large-rmse:0.26079	test-rmse:0.29799
[350]	train_large-rmse:0.25593	test-rmse:0.29770
[400]	train_large-rmse:0.25169	test-rmse:0.29765
[412]	train_large-rmse:0.25070	test-rmse:0.29760

 Score = 0.29759
 n_boost_round = 382

8. Conclusion

rmse_train = evals_test_val['train']['rmse']
rmse_val = evals_test_val['validation']['rmse']

rmse_train_large = evals_test_train['train_large']['rmse']
rmse_test = evals_test_train['test']['rmse']

# Calculating the number of epochs for all datasets
epochs_train = range(len(rmse_train))
epochs_val = range(len(rmse_val))
epochs_train_large = range(len(rmse_train_large))
epochs_test = range(len(rmse_test))

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plotting Train and Validation RMSE
axes[0].plot(epochs_train, rmse_train, label='Train', color = orange, linewidth=3)
axes[0].plot(epochs_val, rmse_val, label='Validation', linestyle='--', color = yellow, linewidth=3)
axes[0].set_xlabel('Epochs')
axes[0].set_ylabel('RMSE')
axes[0].set_title('Train vs Validation')
axes[0].grid()
axes[0].legend()

# Plotting Train Large and Test RMSE
axes[1].plot(epochs_train_large, rmse_train_large, label='Train Large', color = orange, linewidth=3)
axes[1].plot(epochs_test, rmse_test, label='Test', linestyle='--', color = yellow, linewidth=3)
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Train Large vs Test')
axes[1].grid()
axes[1].legend()

plt.tight_layout()
plt.show()

The plots show the RMSE score over the number of epochs for the df_train and df_val on the left, and for the df_train_large and df_test on the right. There’s a small gap of approximately 0.05 between the training and validation/test RMSEs, which is typical as models tend to perform better on the data they have seen (training data). Both validation/test sets reach a plateau around 0.3 RMSE, which is expected with data the model has not seen before.

Overall, the model performs well on the test/validation set with a RMSE of approximately 0.30. The plots suggest that the model has predictive power and is learning without showing signs of overfitting or data leakage.

feature_importances = xgb_model.get_score(importance_type='gain')

# Map indices to names
feature_importances_named = {relevant_features[int(k[1:])]: v for k, v in feature_importances.items()}
df_feature_importance = pd.DataFrame(list(feature_importances_named.items()), 
                        columns=['features', 'importance']).sort_values(by='importance', ascending=False)


fig, ax = plt.subplots(1, 1, figsize=(15, 8))
plot_feature_importance(df_feature_importance, x='features', y='importance', ax=ax, threshold= 4, 
                        title="XGBoost Feature Importance", xlabel=" ", 
                        ylabel="Information Gain", palette=[yellow, orange])

plt.tight_layout(pad=5.0)
plt.show()

The feature importance using Information Gain from the XGBoost model validates our EDA conclusions. The geodesic_distance feature has the highest information gain as we would expect. On the other hand, the center_geodesic_distance has very low information gain, as we previously suggested due to the redundant information. When features are highly correlated, tree-based models favor one over the other during the training process.

The hour_of_day, is_weekend, is_holiday, and day_of_week indicate that temporal features are important for our model. These features probably can indicate an increase or decrease in traffic related to the time. For example, is_weekend can indicate to our model that traffic patterns for weekends can be quite different from weekdays, having more traffic at different hours of the day, as we observed in the Feature Visualization section.

The dropoff_center_lat among other dropoff and pickup features was chosen by the model as the most important feature. This was also expected given the high correlation between the pickup and dropoff location and the pickup and dropoff for center of the clusters.

Overall, the XGBoost feature importance plot seems to validate our EDA findings, which is a good sign that the model is learning and making predictions based on the data. This is also a good sign that the model is not overfitting or presenting any indications of data leakage.

# Prediction  on test set the trained model
Y_test_pred_log = xgb_model.predict(dmatrix_test)


fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

# distributions
sns.kdeplot(Y_test_pred_log, ax=axes[0], color=yellow, edgecolor= grey ,alpha=0.4, fill=True)
sns.kdeplot(Y_test_log, ax=axes[0], color=orange, edgecolor= grey, alpha=0.4, fill=True)

axes[0].grid()
axes[0].set_title('Trip Duration Distribution')
axes[0].legend(['Predicted on Test', 'Actual on Test'])

# CDFs
sns.ecdfplot(Y_test_pred_log, ax=axes[1], label="Predict on Test", color=yellow, alpha=1, linewidth=2)
sns.ecdfplot(Y_test_log, ax=axes[1], label="Actual on Test", color=orange, alpha=1, linewidth=2)

axes[1].grid()
axes[1].set_title('Trip Duration Cumulative Distribution')
axes[1].legend()


plt.tight_layout(pad=5.0)
plt.show()

Finally, for a better visualization of how well our model is making predictions, we plot the distribution and the cumulative distribution of trip duration for the actual values in Y_test and the predicted values from the dataset df_test. Note that both plots consider the log1p of trip_duration. We can conclude that our model is making very good predictions, with an RMSE of 0.30, and a good distribution of the predicted values.

9. Deployment

# Training logic inside the script train_model.py
xgb_model = train_xgboost('../data/processed/train_large.csv', 'trip_duration')

# Save the model
xgb_model.save_model('../models/xgb_taxi_trip.json')

9.1 Testing Local and Web Service

df_test = pd.read_csv('../data/processed/test.csv').drop('trip_duration', axis=1)

url_local = "http://127.0.0.1:9696/predict"
# Sample one trip to make prediction
trip = df_test.sample(n=1).to_dict(orient='records')[0]
# Make a POST request to predict trip duration
requests.post(url_local, json = trip).json()

{'Trip Duration (hours:minutes:seconds)': '1:1:18'}

# Render Clound URL
url_clound = "https://taxi-trip-predict.onrender.com/predict"
trip = df_test.sample(n=1).to_dict(orient='records')[0]
requests.post(url_clound, json = trip).json()

{'Trip Duration (hours:minutes:seconds)': '0:20:16'}

Marcos Benício