Logo
M.sc in Physics

Analysis of Taxi Trip Duration in NYC

22 Jan 2024 - Tags: Regression, XGBoost, Decision Tree, and K-means clustering

import os, sys
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
# 0 = all messages are logged (default behavior)
# 1 = INFO messages are not printed
# 2 = INFO and WARNING messages are not printed
# 3 = INFO, WARNING, and ERROR messages are not printed

# Ensure python module are found
sys.path.insert(0, '../src')
from models.train_model import train_xgboost

# Data visualization 
from matplotlib import pyplot as plt
from matplotlib.collections import EllipseCollection
from matplotlib.colors import  Normalize
import seaborn as sns

# Data manipulation 
import pandas as pd
import numpy as np

# Geospatial 
from geopy import distance
import geopandas as gpd

# Machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import  DecisionTreeRegressor
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_selection import mutual_info_classif
from sklearn.metrics import mean_squared_error, make_scorer
import xgboost as xgb


import requests

# Random state 
random_state = 42
np.random.seed(random_state)

# Palette color for plots
yellow ='#dfdc7bff'
grey = '#50566dff'
beige = '#d9cebfff'
orange = '#e5b97bff'

Outline

In this project, we will build a machine learning model to predict the trip duration of taxi rides in New York City. The dataset is from Kaggle. The original dataset has approximately 1.4 million entries for the training set and 630k for the test set, although only the training set is utilized in this project. The dataset is then preprocessed, outliers are removed, and new features are created, followed by splitting it into train, test, and validation sets.

The goal is to predict the trip duration of taxi rides in New York City. The evaluation metric is the Root Mean Squared Logarithmic Error (RMSLE). The RMSLE is calculated by taking the log of the predictions and actual values. This metric ensures that errors in predicting short trip durations are less penalized compared to errors in predicting longer ones. For this purpose, we employ three types of machine learning models: mini-batch k-means to create new cluster features for the locations, and the Decision Tree Regressor and XGBoost to predict the trip duration.

Initially, the dataset has just 11 columns, offering great possibilities for developing new features and visualizations. The dataset has the following features:

1. Data Preprocessing

taxi_trip = pd.read_csv('../data/raw/nyc-yellow-taxi-trip-2016.csv',
                parse_dates=['pickup_datetime','dropoff_datetime'])

display(taxi_trip.head(3))

print("Shape:", taxi_trip.shape)
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude
0 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1 -73.982155
1 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1 -73.980415
2 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1 -73.979027
pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
0 40.767937 -73.964630 40.765602 N 455
1 40.738564 -73.999481 40.731152 N 663
2 40.763939 -74.005333 40.710087 N 2124
Shape: (1458644, 11)

Let’s check the dataset information, and see if there is any missing value or duplicated rows:

print(taxi_trip.info())
print("\n Duplicated rows:", taxi_trip.duplicated().sum())
print("\n Missing values:", taxi_trip.isnull().sum().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
 #   Column              Non-Null Count    Dtype         
---  ------              --------------    -----         
 0   id                  1458644 non-null  object        
 1   vendor_id           1458644 non-null  int64         
 2   pickup_datetime     1458644 non-null  datetime64[ns]
 3   dropoff_datetime    1458644 non-null  datetime64[ns]
 4   passenger_count     1458644 non-null  int64         
 5   pickup_longitude    1458644 non-null  float64       
 6   pickup_latitude     1458644 non-null  float64       
 7   dropoff_longitude   1458644 non-null  float64       
 8   dropoff_latitude    1458644 non-null  float64       
 9   store_and_fwd_flag  1458644 non-null  object        
 10  trip_duration       1458644 non-null  int64         
dtypes: datetime64[ns](2), float64(4), int64(3), object(2)
memory usage: 122.4+ MB
None

 Duplicated rows: 0

 Missing values: 0

There are no missing values or duplicated rows. The only feature that we need to change is store_and_fwd_flag, which is a categorical feature ${N, Y}$. We will change it to a boolean mapping Y to True and N to False:

# Map store_and_fwd_flag to True and False 
taxi_trip['store_and_fwd_flag'] = taxi_trip['store_and_fwd_flag'].map({'N': False, 'Y': True})

Let’s verify if the target feature trip_duration is correct by calculating the time difference between dropoff_datetime and pickup_datetime in seconds :

trip_duration = (taxi_trip['dropoff_datetime'] - taxi_trip['pickup_datetime']).dt.seconds

is_diff = (trip_duration != taxi_trip['trip_duration'])
print( 'Number of incorrect values in trip_duration:',is_diff.sum())

display(taxi_trip[is_diff])
Number of incorrect values in trip_duration: 4
id vendor_id pickup_datetime dropoff_datetime passenger_count pickup_longitude
355003 id1864733 1 2016-01-05 00:19:42 2016-01-27 11:08:38 1 -73.789650
680594 id0369307 1 2016-02-13 22:38:00 2016-03-08 15:57:38 2 -73.921677
924150 id1325766 1 2016-01-05 06:14:15 2016-01-31 01:01:07 1 -73.983788
978383 id0053347 1 2016-02-13 22:46:52 2016-03-25 18:18:14 1 -73.783905
pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
355003 40.643559 -73.956810 40.773087 False 1939736
680594 40.735252 -73.984749 40.759979 False 2049578
924150 40.742325 -73.985489 40.727676 False 2227612
978383 40.648632 -73.978271 40.750202 False 3526282

It appears that there are four extreme values in the target feature trip_duration that don’t make sense. We can correct then by substituting the values of the actual trip duration in seconds that was calculated. Let’s simple substitute for the correct column that we calculate:

taxi_trip['trip_duration'] = trip_duration

# Let's check again
is_diff = (trip_duration != taxi_trip['trip_duration'])
print( 'Number of incorrect values in trip_duration:',is_diff.sum())
Number of incorrect values in trip_duration: 0

2. Outliers

To identify outliers in the dataset, we will employ geospatial visualization for the pickup and dropoff locations and the interquartile range (IQR) with a box plot to detect outliers in the trip_duration. First, let’s examine all features with data types float64 and int64 in the dataset:

taxi_trip_filtered = taxi_trip.select_dtypes(include=['float64', 'int64'])
display(taxi_trip_filtered.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 6 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   vendor_id          1458644 non-null  int64  
 1   passenger_count    1458644 non-null  int64  
 2   pickup_longitude   1458644 non-null  float64
 3   pickup_latitude    1458644 non-null  float64
 4   dropoff_longitude  1458644 non-null  float64
 5   dropoff_latitude   1458644 non-null  float64
dtypes: float64(4), int64(2)
memory usage: 66.8 MB



None

2.1. Location Features

When looking for outlier in the dataset, we should focus on numerical features. To find out if there is any outliers, we can analyze the features pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude

trip_outliers = taxi_trip.copy()

# can be useful to separate the features
pickup_dropoff_features = [ 'pickup_longitude', 'pickup_latitude', 
                            'dropoff_longitude', 'dropoff_latitude' ]

pickup_long = trip_outliers['pickup_longitude']
pickup_lat = trip_outliers['pickup_latitude']
dropoff_long = trip_outliers['dropoff_longitude']
dropoff_lat = trip_outliers['dropoff_latitude']

# Summary statistics
display(    trip_outliers.groupby('passenger_count')[pickup_dropoff_features[:2]]
            .agg(['min', 'mean', 'max']))

display(    trip_outliers.groupby('passenger_count')[pickup_dropoff_features[2:]]
            .agg(['min', 'mean', 'max']))
pickup_longitude pickup_latitude
min mean max min mean max
passenger_count
0 -74.042969 -73.957256 -73.776367 40.598701 40.736596 40.871922
1 -79.569733 -73.973551 -65.897385 36.118538 40.751209 51.881084
2 -121.933342 -73.973343 -61.335529 34.359695 40.749811 41.319164
3 -75.455917 -73.974312 -73.722374 39.803932 40.750224 41.189957
4 -75.354332 -73.973836 -73.559799 34.712234 40.749328 40.945747
5 -74.191154 -73.972432 -71.799896 35.081532 40.750801 40.925323
6 -74.145752 -73.973227 -73.653358 40.213837 40.751611 40.896381
7 -74.173668 -73.948100 -73.631149 40.715031 40.740285 40.768551
8 -73.992653 -73.992653 -73.992653 40.768719 40.768719 40.768719
9 -73.710632 -73.710632 -73.710632 40.671581 40.671581 40.671581
dropoff_longitude dropoff_latitude
min mean max min mean max
passenger_count
0 -74.188072 -73.963539 -73.776360 40.598701 40.735779 40.872318
1 -80.355431 -73.973371 -65.897385 36.118538 40.751977 43.911762
2 -121.933304 -73.973464 -61.335529 34.359695 40.751070 43.921028
3 -74.550011 -73.974016 -73.055977 40.435646 40.751601 41.387001
4 -74.569748 -73.974422 -73.170937 32.181141 40.750786 41.029831
5 -79.352837 -73.973206 -72.954590 40.436329 40.751590 41.281796
6 -74.220955 -73.973192 -73.524658 40.553467 40.752375 40.982738
7 -74.173660 -73.948097 -73.631149 40.715019 40.740289 40.768551
8 -74.041374 -74.041374 -74.041374 40.729954 40.729954 40.729954
9 -73.710632 -73.710632 -73.710632 40.671581 40.671581 40.671581

We can observe that there are unrealistic values for passenger_count. The expected values for passenger_count are between 1 and 6, but there are atypical counts of 0 and values greater than 6. Let’s remove those outliers:

# Boolean series for passenger count outliers
passenger_outliers = (trip_outliers['passenger_count'] == 0) | (trip_outliers['passenger_count'] > 6)

not_outliers = ~passenger_outliers
trip_outliers = trip_outliers[not_outliers]


print(f'\nOutliers in passenger_count: {passenger_outliers.sum()}')

Outliers in passenger_count: 65

The boxplot for detecting outliers relies on the assumption that the data has a distribution close to normal (Gaussian), which is often not true with geospatial coordinates due to their unique patterns influenced by urban roads and traffic. Therefore, we will use a different approach to identify outliers in the location features.

For pickup and dropoff locations, the strategy is to use geospatial data using shapfile from NYC Borough Boundaries to identify points that fall outside the boundaries of NYC map as outliers.

def plot_points_map(df, gdf_map, geometry_col, ax, title='', within_map=True, 
                    color_map='beige', color_points='orange'):
    """
    Plots a map with points from a DataFrame onto a given Axes object.

    Parameters:
    - df: DataFrame with the coordinates to plot.
    - gdf_map: GeoDataFrame of the map boundaries.
    - geometry_col: GeoDataFrame column with geometry data.
    - ax: Matplotlib Axes object to plot on.
    - title: Title of the plot.
    - within_map: If True, only plots points within the map boundaries.
    - color_map: Color for the map.
    - color_points: Color for the points.
    """
    # convert to the same coordinate reference system
    gdf_points = gpd.GeoDataFrame(df, geometry=geometry_col)
    # convert to the same coordinate reference system
    gdf_points.crs = gdf_map.crs
    
    # If within_map is True, perform spatial join with the map boundaries
    # only points within the boundaries will be plotted
    if within_map:
        gdf_points = gpd.sjoin(gdf_points, gdf_map, how="inner", predicate='within')
        # Update the original DataFrame 
        df = df.loc[gdf_points.index]
        
    # Plot the map and the points
    gdf_map.plot(ax = ax, color=color_map, edgecolor='grey')
    gdf_points.plot(ax =ax, markersize=1, color = color_points)
    ax.set_xlabel('Longitude', fontsize=14)
    ax.set_ylabel('Latitude', fontsize=14)
    ax.set_title(title, fontsize=20)

    return df
nybb_path = '../data/external/nyc-borough-boundaries/geo_export_e13eede4-6de2-4ed8-98a0-58290fd6b0fa.shp'
nyc_boundary = gpd.read_file(nybb_path)

# Creating the geometry for points
geometry = gpd.points_from_xy(trip_outliers['pickup_longitude'], trip_outliers['pickup_latitude'])

# Creating the plot
fig, ax = plt.subplots(figsize=(12, 12))
pickup_within_nyc = plot_points_map(    df = trip_outliers, gdf_map = nyc_boundary, 
                                        geometry_col = geometry, ax = ax, title =  'Pickup Locations',
                                        color_map= beige, color_points= orange  ) 

# Showing the plot
plt.show()

print(f'Number of outliers removed: {len(trip_outliers) - len(pickup_within_nyc)}')
Number of outliers removed: 1216
geometry = gpd.points_from_xy(pickup_within_nyc['dropoff_longitude'], 
                                pickup_within_nyc['dropoff_latitude'])

fig, ax = plt.subplots(figsize=(12, 12))
pickup_dropoff_within_nyc = plot_points_map(    df = pickup_within_nyc, gdf_map = nyc_boundary, 
                                                geometry_col = geometry, ax = ax, title =  'Dropoff Locations',
                                                color_map= beige, color_points= orange  )
plt.show()

print(f'Number of outliers removed:{len(pickup_within_nyc) - len(pickup_dropoff_within_nyc)}' )
Number of outliers removed:5359

2.2. Target Feature (Trip Duration)

# Copy the dataset with removed outliers from locations
trip_outliers = pickup_dropoff_within_nyc.copy()

The interquartile range (IQR)

The IQR gives a sense of how spread out the values in a dataset are and is a robust measure of dispersion that is not influenced by outliers.

A quartile divides data into four equal parts, each comprising 25% of the data. $Q_1$ represents the 25th percentile of the data, meaning 25% of data points are less than or equal to this value. $Q_3$ represents the 75th percentile, implying that 75% of data points are less than or equal to this value.

Let’s calculate the IQR for the target feature trip_duration and plot a blox plot to visualize the outliers:

def boxplot_stats(series, whis = 1.5):
    """
    Calculate the statistics for a box plot.

    Returns:
    dict: A dictionary containing the quartiles, IQR, and whisker limits.
    """
    Q1 = series.quantile(0.25)
    Q2 = series.quantile(0.50)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1

    lower_whisker = Q1 - whis * IQR
    upper_whisker = Q3 + whis * IQR

    return {
        'Q1': Q1,
        'Q2': Q2,
        'Q3': Q3,
        'IQR': IQR,
        'lower_whis': lower_whisker,
        'upper_whis': upper_whisker
    }

def plot_distribution_boxplot(  series, ax1, ax2, title='', label='', log1p=True, 
                                draw_quartiles=True, kde=True):
    """
    Plot the distribution and boxplot of a series on given axes.

    Args:
    - series (pandas.Series): The series to plot.
    - ax1 (matplotlib.axes.Axes): The axes for the histogram.
    - ax2 (matplotlib.axes.Axes): The axes for the boxplot.
    - title (str): The title of the plot.
    - label (str): The label for the x-axis.
    - log1p (bool): If True, applies log1p transformation to the series.
    - draw_quartiles (bool): If True, draws quartile lines on the histogram.
    - kde (bool): If True, plots a KDE over the histogram.
    """
    if log1p:
        series = np.log1p(series)
    stats = boxplot_stats(series)

    sns.histplot(   series, bins=40, linewidth=0.5, color='#dfdc7bff', alpha=0.2,
                    ax=ax1, kde=kde, line_kws={'lw': 3})
    ax1.set_title(f'{title} Histogram', fontsize=15)
    ax1.set_xlabel(label, fontsize=14)
    ax1.set_ylabel('Count', fontsize=14)

    sns.boxplot(data=series, color='#dfdc7bff', ax=ax2,
                fliersize=3, flierprops={'color': '#50566dff', 'markeredgecolor': '#50566dff'})
    
    ax2.set_title(f'{title} Boxplot', fontsize=15)
    ax2.set_ylabel(label, fontsize=14)

    if draw_quartiles:
        quartiles = [stats['Q1'], stats['Q3'], stats['lower_whis'], stats['upper_whis']]
        for line in quartiles:
            ax1.axvline(line, color='#50566dff', linestyle='--', alpha=1, lw=2)
            y_center = ax1.get_ylim()[1] / 2
            ax1.text(   line, y_center, f'{line:.2f}',
                        fontsize=18, color='black', va='center', ha='right', rotation=90)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
plot_distribution_boxplot(  trip_outliers['trip_duration'], ax1, ax2,
                            title='Trip Duration', label='Log of Trip Duration (s)')

plt.tight_layout()
plt.show()

Because of the high values and zeros in the target feature trip_duration, we apply the logarithmic transformation $\log(x+1)$. This transformation reduces the effect of outliers and the overall variance in the dataset.The histogram, illustrated in the first plot, resembles a bell curve after applying this transformation. From the boxplot, we can see:

Data points outside the whiskers are potential outliers that need further validation. For better intuition, we convert these whisker values from the log scale back to the original scale in terms of hours ($\text{Hours} = \text{Seconds}/3600$ ):

\(\frac{\exp(4.50) - 1}{3600} \approx 0.024 ~~\text{hours}\) \(\frac{\exp(8.46) - 1}{3600} \approx 1.32 ~~\text{hours}\)

The lower whisker corresponds to approximately 1.4 minutes, which is very short for a taxi trip duration, suggesting that these could be errors or special cases.The upper whisker suggests a more plausible trip duration of about 1.32 hours, but with very unrealistic high values with more than 20 hours of duration.

def count_outliers(series, whis = 1.5):
        """
        Count the number of upper and lower outliers in the series and print their percentages.

        Args:
        series (pd.Series): Series for which to count outliers.

        Returns:
        (pd.Series, pd.Series): Two boolean series, one for upper outliers and one for lower outliers.
        """

        stats = boxplot_stats(series, whis)


        upper_outliers = (series > stats['upper_whis'])
        lower_outliers = (series < stats['lower_whis'])

        # Percentage of outliers 
        percentage_upper = upper_outliers.sum() / len(series) * 100
        percentage_lower = lower_outliers.sum() / len(series) * 100

        print(  f'\nPotential upper outliers: {upper_outliers.sum()} '
                f'({percentage_upper:.2f}% of the dataset)')

        print(  f'\nPotential lower outliers: {lower_outliers.sum()} '
                f'({percentage_lower:.2f}% of the dataset)')

        return upper_outliers, lower_outliers
log_trip_duration = np.log1p(trip_outliers['trip_duration'])
upper_duration_outliers, lower_duration_outliers = count_outliers(log_trip_duration)
Potential upper outliers: 4292 (0.30% of the dataset)

Potential lower outliers: 14461 (1.00% of the dataset)

Let’s compare the trip durations with some distance measures from the pickup location to the dropoff location. In an article on New York City taxi trip duration prediction using MLP and XGBoost, some distance measures were discussed, like Manhattan, haversine, and bearing distances. Unlike the straight-line Euclidean distance, the Manhattan distance calculates the sum of the absolute differences between the horizontal and vertical coordinates, which resembles navigating an actual grid of city streets. However, this doesn’t account for the Earth’s curvature. To address this, the article also considers the haversine distance, which is more appropriate for large distances as it does account for the Earth’s curvature. Bearing distance calculates the direction between two points.

I will just adopt a simplified approach that still accounts for the Earth’s curvature. The Geodesic distance is the small path between two points on the surface of a curved object, for our case the Earth surface. To achieve This we use the geopy library:

# Function to calculate geodesic distance in a ellipsoid
def geodesic(start_lat, start_long, end_lat, end_long):
    """
    Calculate geodesic distances between pickup and dropoff locations.

    Args:
    - start_lat, start_long, end_lat, end_long: Lists or arrays of coordinates 
    for the start point and end point.

    Returns:
    List of distances in meters for each pair of start and end points.
    """
    coordinates = zip(start_lat, start_long, end_lat, end_long)
    distances = [distance.distance((lat1, lon1), (lat2, lon2)).meters for lat1, lon1, lat2, lon2 in coordinates]
    return distances

# Calculate the geodesic distance
geodesic_distance = geodesic(   trip_outliers['pickup_latitude'].to_numpy(), 
                                trip_outliers['pickup_longitude'].to_numpy(), 
                                trip_outliers['dropoff_latitude'].to_numpy(), 
                                trip_outliers['dropoff_longitude'].to_numpy())

Potential Lower Outliers

Approximately 1% of all trips have durations less than or equal to 1.4 minutes. Based on the table below, it appears that some of these trips may have been canceled, as indicated by a zero or minimal geodesic distance. Because our analysis aims to predict the trip duration, we will remove these potentially canceled trips from the dataset.


trip_outlier_geodesics = trip_outliers.copy()
trip_outlier_geodesics['geodesic_distance'] = geodesic_distance

display(    trip_outlier_geodesics[lower_duration_outliers]
            .groupby('passenger_count')[['trip_duration', 'geodesic_distance']]
            .agg(['min', 'mean','max',  'count'])   )
trip_duration geodesic_distance
min mean max count min mean max count
passenger_count
1 1 47.663972 88 11258 0.0 272.870639 19983.682930 11258
2 1 51.322023 88 1562 0.0 299.515846 1823.852419 1562
3 2 53.269321 88 427 0.0 315.255005 1491.261959 427
4 2 53.846890 88 209 0.0 302.298194 1665.249710 209
5 2 55.686591 88 619 0.0 347.513153 1509.483335 619
6 2 57.704663 88 386 0.0 364.417960 1120.139579 386

Potential Upper Outliers

Approximately 0.30% of all trips have durations of 1.32 hours or less. In this case, a table alone does not give much information about these potential outliers; therefore, we employ a histogram to better visualize these potential upper outliers.

From the histogram, it is apparent that around 1700 trips have durations exceeding 20 hours. In contrast, a larger cluster of over 2000 trips has durations ranging between 1.32 hours and 4 hours. Considering this distribution, only those with durations longer than 4 hours may be classified as actual outliers, as durations less than 4 hours are within a plausible range for taxi trips in NYC.

upper_outliers = trip_outliers[upper_duration_outliers]['trip_duration'].reset_index(drop=True)/3600

fig, axes = plt.subplots(1, 1, figsize= (10, 5))

sns.histplot(  upper_outliers, bins=20, linewidth=0.5, color=yellow, alpha=0.5)

axes.set_title(f'High Trip Duration Outliers', fontsize=15)
axes.set_xlabel('Trip Duration (hours)', fontsize=14)
axes.axvline(4, color=grey, linestyle='--', alpha=1, lw=2)
y_center = axes.get_ylim()[1] / 2
axes.text(4, y_center, f'{4:.2f} hours',
            fontsize=12, color='black', va='center', ha='right', rotation=90)
plt.tight_layout()
plt.show()

We observed that there are also unrealistic values for geodesic_distance. These unrealistic values can simple be deleted because they don’t represent a real patters for the trip duration. We’ll create variables to hold boolean series for each of the identified outliers conditions:

# Percentage of lower duration outliers
per_lower_duration = lower_duration_outliers.sum() / len(trip_outliers) * 100
print(f'\nOutliers with trip duration less than 1.4 minutes: {lower_duration_outliers.sum()} '
      f'({per_lower_duration:.2f}% of the dataset)')


# Boolean series for trip durations over 4 hours
upper_duration_outliers = trip_outliers['trip_duration'] > 4 * 3600
per_upper_duration = upper_duration_outliers.sum() / len(trip_outliers) * 100

print(f'\nOutliers with trip duration greater than 4 hours: {upper_duration_outliers.sum()} '
      f'({per_upper_duration:.2f}% of the dataset)')


# Boolean series for geodesic distance == zero
geodesic_outliers_zeros = (trip_outlier_geodesics['geodesic_distance'] == 0)
# Percentage of outliers
per_geodesic = geodesic_outliers_zeros.sum() / len(taxi_trip) * 100

print(f'\nOutliers with zero geodesic distances: {geodesic_outliers_zeros.sum()} '
      f'({per_geodesic:.2f}% of the dataset)')

Outliers with trip duration less than 1.4 minutes: 14461 (1.00% of the dataset)

Outliers with trip duration greater than 4 hours: 2059 (0.14% of the dataset)

Outliers with zero geodesic distances: 5490 (0.38% of the dataset)

Average Geodesic Velocity

Before removing all this outliers that we identify, let’s first investigate one more aspect of the trip duration. As direct observations may not be sufficient to identify all outliers, we can calculate an average velocity based on the geodesic distances to provide a better understanding of potential outliers.

# Calculate the average velocity in Km/h
geo_distance_km = trip_outlier_geodesics['geodesic_distance']/1000
trip_duration_hrs = trip_outlier_geodesics['trip_duration']/3600
velocity_geodesic_kmh = (geo_distance_km) / (trip_duration_hrs)

# For simplicity add velocity to the dataset
trip_outlier_geodesics['avg_velocity_kmh'] = velocity_geodesic_kmh


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
plot_distribution_boxplot(  trip_outlier_geodesics['avg_velocity_kmh'], ax1, ax2, 
                            'Average Geodesic Velocity', 
                            'Log of Average Velocity (Km/h)'    )
plt.tight_layout()
plt.show()


log_velocity_geodesic = np.log1p(velocity_geodesic_kmh)
upper_vel_outliers, lower_vel_outliers = count_outliers(log_velocity_geodesic)
Potential upper outliers: 4950 (0.34% of the dataset)
    
Potential lower outliers: 23999 (1.65% of the dataset)

The lower and upper values for potential outliers are:

New York City is a metropolis known for its traffic congestion. The average speed of a taxi in New York City with passengers on board is 20.3 km/h, as reported in the article “Taxi driver speeding”:

“… The average occupancy speed of taxi drivers in Shanghai (21.3 km/h) was similar to that of NYC (20.3 km/h)

Considering that approximately 0.35% of the total number of trips have an average velocity greater than or equal to 46.5 km/h, these cases are not representative of typical taxi trips in NYC and thus can be removed.

3.1.1 Handling Outliers

We will now address the outliers identified in the target feature trip_duration using the following steps

  1. Remove the trip_duration less then or equal to 1.4 minutes. This will remove 1.00% of instances from the original dataset.

  2. Remove the trip_duration greater than 4 hours. This will remove 0.14% of instances from the original dataset.

  3. Remove geodesic_distance equal to 0.0 meters. This will remove 0.38% of instances from the original dataset.

  4. Remove the trip_duration corresponding to an average velocity greater than or equal to 46.5 km/h. This will remove 0.34% of instances from the original dataset.

We will use the boolean series previously defined to create a single series that identifies an entry as an outlier if it meets any one of the outlier conditions:

# Combine all outlier conditions
all_outliers =  lower_duration_outliers | upper_duration_outliers \
                | geodesic_outliers_zeros | upper_vel_outliers

not_outliers = ~all_outliers

# filter the original by not_outliers
original_length = len(taxi_trip)

per_outliers = all_outliers.sum()/original_length * 100
print(  f'\nOutliers removed: {all_outliers.sum()} '
        f'({per_outliers:.2f}% of the dataset)')
Outliers removed: 24848 (1.70% of the dataset)
taxi_trip = trip_outliers[not_outliers].copy()


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8))
plot_distribution_boxplot(taxi_trip['trip_duration'], ax1, ax2,
                        'Trip Duration', 'Log of Trip Duration (s)')


plt.tight_layout()
plt.show()
# save the dataset
taxi_trip.to_csv('../data/interim/nyc-taxi-trip-2016-remove-outliers.csv', index=False)

3. Feature Engineering

Before splitting the dataset and conducting the exploratory data analysis (EDA), let’s create some new features that will help in the model predictions and in the Exploratory Data Analysis section.

taxi_trip = pd.read_csv('../data/interim/nyc-taxi-trip-2016-remove-outliers.csv', 
                        parse_dates= ['pickup_datetime','dropoff_datetime'])

3.1. Datetime Features

Using the pickup time with format YYYY-MM-DD H:M:S, let’s extract features like the year, month, day, hours and minutes for more informative features and also the time of day, day of the week, day of the year to help with possible underlying patters in the data related to specific time intervals. So we will create a total of 8 new features:

# Days in a year: 1-365 (or 366)
taxi_trip['day_of_year'] = taxi_trip['pickup_datetime'].dt.dayofyear
# Day of week: 0-6 (0 for Monday, 6 for Sunday)
taxi_trip['day_of_week'] = taxi_trip['pickup_datetime'].dt.dayofweek
# Hour of the day :0-23
taxi_trip['hour_of_day'] = taxi_trip['pickup_datetime'].dt.hour
# Minutes in a day: 0-1440
taxi_trip['min_of_day'] = (taxi_trip['pickup_datetime'].dt.hour * 60) + taxi_trip['pickup_datetime'].dt.minute

# Date features
taxi_trip['pickup_date'] = taxi_trip['pickup_datetime'].dt.date
taxi_trip['dropoff_date'] = taxi_trip['pickup_datetime'].dt.date
taxi_trip['year'] = taxi_trip['pickup_datetime'].dt.year
taxi_trip['month'] = taxi_trip['pickup_datetime'].dt.month
taxi_trip['day'] = taxi_trip['pickup_datetime'].dt.day
taxi_trip['hour'] = taxi_trip['pickup_datetime'].dt.hour
taxi_trip['minute'] = taxi_trip['pickup_datetime'].dt.minute

# Drop datetime features, they are not necessary anymore
taxi_trip.drop('pickup_datetime', axis=1, inplace=True)
taxi_trip.drop('dropoff_datetime', axis=1, inplace=True)

taxi_trip.head(3)
id vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
0 id2875421 2 1 -73.982155 40.767937 -73.964630 40.765602 False
1 id2377394 1 1 -73.980415 40.738564 -73.999481 40.731152 False
2 id3858529 2 1 -73.979027 40.763939 -74.005333 40.710087 False
trip_duration day_of_year day_of_week hour_of_day min_of_day pickup_date dropoff_date year month day hour minute
0 455 74 0 17 1044 2016-03-14 2016-03-14 2016 3 14 17 24
1 663 164 6 0 43 2016-06-12 2016-06-12 2016 6 12 0 43
2 2124 19 1 11 695 2016-01-19 2016-01-19 2016 1 19 11 35

Another relevant information is weekends and holidays. For this we will use the holiday dataset from kaggle, which contains the New York City holidays for the year of 2016. We will create two new binary features, is_holiday and is_weekend:

holidays = pd.read_csv('../data/raw/nyc-holidays-2016.csv', sep = ';')
display(holidays.head())
Day Date Holiday
0 Friday January 01 New Years Day
1 Monday January 18 Martin Luther King Jr. Day
2 Friday February 12 Lincoln's Birthday
3 Monday February 15 Presidents' Day
4 Sunday May 08 Mother's Day
# Convert to datetime the Date column
holidays['Date'] = pd.to_datetime(holidays['Date'] + ', 2016')

# Extract month and day as numerical values
holidays['month'] = holidays['Date'].dt.month
holidays['day'] = holidays['Date'].dt.day
# Add the is_holiday column 
holidays['is_holiday'] = True

# Drop the original columns
holidays.drop(['Date', 'Day', 'Holiday'], axis=1, inplace=True)

# Save the processed holidays dataset
holidays.to_csv('../data/external/nyc-holidays-2016.csv',index=False)

display(holidays.head())
month day is_holiday
0 1 1 True
1 1 18 True
2 2 12 True
3 2 15 True
4 5 8 True

We will then use the month and day features to merge the taxi trip dataset with the holidays dataset using a left join:

# Merge the taxi data with the holidays data
taxi_trip = taxi_trip.merge(holidays, on=['month', 'day'], how='left')

# Replace NaN values in the is_holiday column with False
taxi_trip['is_holiday'].fillna(False, inplace=True)

# Add the is_weekend column (5:Saturday or 6:Sunday)
taxi_trip['is_weekend'] = taxi_trip['day_of_week'].isin([5, 6])

display(taxi_trip.head())
print('Number of trips in holiday:', taxi_trip[taxi_trip['is_holiday'] == True].shape[0])
id vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
0 id2875421 2 1 -73.982155 40.767937 -73.964630 40.765602 False
1 id2377394 1 1 -73.980415 40.738564 -73.999481 40.731152 False
2 id3858529 2 1 -73.979027 40.763939 -74.005333 40.710087 False
3 id3504673 2 1 -74.010040 40.719971 -74.012268 40.706718 False
4 id2181028 2 1 -73.973053 40.793209 -73.972923 40.782520 False
trip_duration day_of_year min_of_day pickup_date dropoff_date year month day hour minute is_holiday is_weekend
0 455 74 1044 2016-03-14 2016-03-14 2016 3 14 17 24 False False
1 663 164 43 2016-06-12 2016-06-12 2016 6 12 0 43 False True
2 2124 19 695 2016-01-19 2016-01-19 2016 1 19 11 35 False False
3 429 97 1172 2016-04-06 2016-04-06 2016 4 6 19 32 False False
4 435 86 810 2016-03-26 2016-03-26 2016 3 26 13 30 False True
Number of trips in holiday: 49736

3.2. Traffic Features

A brief search on Google about New York City’s traffic patterns revealed useful insights from sources such as A Guide to Traffic in NYC and in 10 Tips to Beat NYC Traffic sites. These resources suggest that high traffic times in and out of Manhattan typically occur between 8-9 a.m. and 3-7 p.m. Based on this information, we will create a new feature called is_rush_hour to determine whether the pickup time falls within rush hour on a given day. To simplify, weekends and holidays will be considered as having normal traffic hours throughout the day. Here based on this information and on the Exploratory Data Analysis section, we chose to consider the following time intervals for rush hours: 7-10 a.m. and 4-8 p.m . This feature will be a binary feature that indicates a increase in traffic during the rush hours:

# Define rush hours
morning_rush = (taxi_trip['hour'] >= 7) & (taxi_trip['hour'] < 10)
evening_rush = (taxi_trip['hour'] >= 16) & (taxi_trip['hour'] < 20)

# Weekdays are 0 (Monday) to 4 (Friday)
is_weekday = taxi_trip['day_of_week'] < 5


taxi_trip['is_rush_hour'] = (morning_rush | evening_rush) & is_weekday

display(taxi_trip.head(3))
id vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
0 id2875421 2 1 -73.982155 40.767937 -73.964630 40.765602 False
1 id2377394 1 1 -73.980415 40.738564 -73.999481 40.731152 False
2 id3858529 2 1 -73.979027 40.763939 -74.005333 40.710087 False
trip_duration day_of_year pickup_date dropoff_date year month day hour minute is_holiday is_weekend is_rush_hour
0 455 74 2016-03-14 2016-03-14 2016 3 14 17 24 False False True
1 663 164 2016-06-12 2016-06-12 2016 6 12 0 43 False True False
2 2124 19 2016-01-19 2016-01-19 2016 1 19 11 35 False False False

3.3. Distance Features

In an article on New York City taxi trip duration prediction using MLP and XGBoost, as discussed in the outliers section, some distance measures were discussed, like Manhattan, haversine, and bearing distances.

For this project, I will adopt a simplified approach that still accounts for the Earth’s curvature. This will involve calculating the geodesic distance using the geopy library and computing the bearing distance. The bearing distance measures the angle between the north line and the line connecting the pickup and dropoff locations.

# Again lets use the function geodesic to create new features
taxi_trip['geodesic_distance'] = geodesic(  taxi_trip['pickup_latitude'].to_numpy(), 
                                            taxi_trip['pickup_longitude'].to_numpy(), 
                                            taxi_trip['dropoff_latitude'].to_numpy(), 
                                            taxi_trip['dropoff_longitude'].to_numpy())

def bearing(start_lat, start_lon, end_lat, end_lon):
    """
    Calculate the initial bearing (azimuth) between two sets of latitude and longitude coordinates.

    Args:
    - start_lat (float): Latitude of the starting point in degrees.
    - start_lon (float): Longitude of the starting point in degrees.
    - end_lat (float): Latitude of the ending point in degrees.
    - end_lon (float): Longitude of the ending point in degrees.

    Returns:
    float: The initial bearing in degrees, normalized to the range [0, 360).
    """
    
    # Convert latitude and longitude from degrees to radians
    start_lat, start_lon, end_lat, end_lon = map(np.radians, [start_lat, start_lon, end_lat, end_lon])

    # Calculate the change in coordinates
    dlon = end_lon - start_lon

    # Calculate bearing
    x = np.sin(dlon) * np.cos(end_lat)
    y = np.cos(start_lat) * np.sin(end_lat) - np.sin(start_lat) * np.cos(end_lat) * np.cos(dlon)
    initial_bearing = np.arctan2(x, y)

    # Convert from radians to degrees and normalize (0-360)
    initial_bearing = np.degrees(initial_bearing)
    bearing = (initial_bearing + 360) % 360

    return bearing
# create bearing feature
taxi_trip['bearing'] = bearing( taxi_trip['pickup_latitude'], taxi_trip['pickup_longitude'], 
                                taxi_trip['dropoff_latitude'], taxi_trip['dropoff_longitude'])



display(taxi_trip.head(3))
id vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
0 id2875421 2 1 -73.982155 40.767937 -73.964630 40.765602 False
1 id2377394 1 1 -73.980415 40.738564 -73.999481 40.731152 False
2 id3858529 2 1 -73.979027 40.763939 -74.005333 40.710087 False
trip_duration day_of_year year month day hour minute is_holiday is_weekend is_rush_hour geodesic_distance bearing
0 455 74 2016 3 14 17 24 False False True 1502.171837 99.970196
1 663 164 2016 6 12 0 43 False True False 1808.659969 242.846232
2 2124 19 2016 1 19 11 35 False False False 6379.687175 200.319835

3.4. Weather Features

Even small weather changes, like rain or snow, can significantly affect traffic and transportation patterns in a large city. Rain or snow can impact driving conditions, increase traffic congestion, and influence public transportation usage, leading to longer taxi trip times and potentially higher demand for taxis. Therefore, we will use the weather dataset from kaggle. This dataset includes information on temperature, precipitation, visibility, and other relevant weather metrics. We plan to integrate this data with our taxi trip dataset by merging them based on date and time as we did before with the holidays dataset.

weather = pd.read_csv('../data/raw/nyc-weather-2016.csv', parse_dates=['Time'])
display(weather.head(3))
Time Temp. Windchill Heat Index Humidity Pressure Dew Point Visibility Wind Dir Wind Speed Gust Speed Precip Events Conditions
0 2015-12-31 02:00:00 7.8 7.1 NaN 0.89 1017.0 6.1 8.0 NNE 5.6 0.0 0.8 NaN Overcast
1 2015-12-31 03:00:00 7.2 5.9 NaN 0.90 1016.5 5.6 12.9 Variable 7.4 0.0 0.3 NaN Overcast
2 2015-12-31 04:00:00 7.2 NaN NaN 0.90 1016.7 5.6 12.9 Calm 0.0 0.0 0.0 NaN Overcast

The weather dataset shows some missing values in features like Windchill, Heat Index, Pressure, and Visibility. For our analysis, Visibility is a critical feature due to its potential impact on driving conditions. The missing values in Windchill and Heat Index are not a concern for our study, as these factors are less directly related to traffic patterns. Other relevant features for our analysis, such as Conditions and Temp. have no missing values, and we do not need to worry.

# Checking for missing values, data types and duplicates:
print('duplicated values:', weather.duplicated().sum())
print('\n')
display(weather.info())
print('\n')
display(weather.isnull().sum()) 

duplicated values: 0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8787 entries, 0 to 8786
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Time        8787 non-null   datetime64[ns]
 1   Temp.       8787 non-null   float64       
 2   Windchill   2295 non-null   float64       
 3   Heat Index  815 non-null    float64       
 4   Humidity    8787 non-null   float64       
 5   Pressure    8556 non-null   float64       
 6   Dew Point   8787 non-null   float64       
 7   Visibility  8550 non-null   float64       
 8   Wind Dir    8787 non-null   object        
 9   Wind Speed  8787 non-null   float64       
 10  Gust Speed  8787 non-null   float64       
 11  Precip      8787 non-null   float64       
 12  Events      455 non-null    object        
 13  Conditions  8787 non-null   object        
dtypes: datetime64[ns](1), float64(10), object(3)
memory usage: 961.2+ KB



None







Time             0
Temp.            0
Windchill     6492
Heat Index    7972
Humidity         0
Pressure       231
Dew Point        0
Visibility     237
Wind Dir         0
Wind Speed       0
Gust Speed       0
Precip           0
Events        8332
Conditions       0
dtype: int64

The feature Conditions is a categorical feature that describes the weather conditions at the time of the observation. the feature have the following unique values of categorical data:

print(weather['Conditions'].unique())
['Overcast' 'Clear' 'Partly Cloudy' 'Mostly Cloudy' 'Scattered Clouds'
 'Unknown' 'Light Rain' 'Haze' 'Rain' 'Heavy Rain' 'Light Snow' 'Snow'
 'Heavy Snow' 'Light Freezing Fog' 'Light Freezing Rain' 'Fog']

This features can be used to create new binary features that indicate the presence of a specific weather condition. We can create new features as snow, heavy_snow, rain, heavy_rain that indicates whether it was snowing or raining at the time of the observation. to create this features, we first group the Conditions feature into 4 categories:

The idea to group the conditions this way is to consider the conditions that can impact the traffic in a similar way. For example, I add Fog in Heavy Snow because it can affect the visibility in a similar way. The same for Light Freezing Rain and Heavy Rain.

To integrate weather data into the dataset of taxi trip, we must first preprocessed the weather dataset to align it with the taxi trip dataset. Let’s separate the date and time into different columns for year, month, day, hour.

weather['year'] = weather['Time'].dt.year
weather['month'] = weather['Time'].dt.month
weather['day'] = weather['Time'].dt.day
weather['hour'] = weather['Time'].dt.hour

# Select only the year of 2016
weather = weather[(weather['year'] == 2016)]

# Select only the months that have in the taxi dataset
months_taxi_trip = taxi_trip['month'].unique()
select_month = weather['month'].isin(months_taxi_trip)
weather = weather[select_month]


# Create new features based in the conditions column
snow = (weather['Conditions'] == 'Light Snow') | (weather['Conditions'] == 'Snow')
heavy_snow = (weather['Conditions'] == 'Heavy Snow') | (weather['Conditions'] == 'Light Freezing Fog')\
                                                    | (weather['Conditions'] == 'Fog')

rain = (weather['Conditions'] == 'Light Rain') | (weather['Conditions'] == 'Rain')
heavy_rain = (weather['Conditions'] == 'Heavy Rain') | (weather['Conditions'] == 'Light Freezing Rain') 

weather['snow'] = snow
weather['heavy_snow'] = heavy_snow
weather['rain'] = rain
weather['heavy_rain'] = heavy_rain

# Drop the features that are not needed
weather.drop([ 'Time', 'Windchill', 'Heat Index', 'Humidity', 'Pressure', 'Dew Point', 'Wind Dir',
                'Wind Speed', 'Events', 'Conditions', 'Gust Speed'], axis=1, inplace=True)

# Rename columns to lowercase as in the taxi dataset
weather.rename(columns={'Temp.': 'temperature'}, inplace=True)
weather.rename(columns={'Visibility': 'visibility'}, inplace=True)
weather.rename(columns={'Precip': 'precipitation'}, inplace=True)

display(weather.head(3))
temperature visibility precipitation year month day hour snow heavy_snow rain heavy_rain
22 5.6 16.1 0.0 2016 1 1 0 False False False False
23 5.6 16.1 0.0 2016 1 1 1 False False False False
24 5.6 16.1 0.0 2016 1 1 2 False False False False

We can check the maximum and minimum values for each feature to see if there is any inconsistency. As shown below, all features appear to be consistent with the range of values expected for each feature.

# max and min values
display(weather.describe().loc[['min', 'max']])
temperature visibility precipitation year month day hour
min -18.3 0.4 0.0 2016.0 1.0 1.0 0.0
max 32.2 16.1 11.9 2016.0 6.0 31.0 23.0

Missing values

We need to handle the missing values in visibility. This feature has values approximately between 0-16, where 0 indicates the worse visibility and 16 the best visibility.

print('Unique values for visibility:\n', weather['visibility'].unique())
print('\n')
display(weather.isna().sum().head())
Unique values for visibility:
 [16.1  nan 14.5  8.  12.9 11.3  6.4  4.8  2.4  2.8  2.   4.   3.2  9.7
  1.2  0.8  1.6  0.4]





temperature        0
visibility       132
precipitation      0
year               0
month              0
dtype: int64

To better understand the nature of the missing values, let’s create a table that correlates these missing values with different weather conditions. The missing values do not appear randomly distributed, they occur when the weather condition is good. To visualize this we can create a table that shows the number of NaN values in visibility named is_visibility_nan by the total number of days snowing and raining :

is_visibility_nan = weather['visibility'].isna()

weather_summary = pd.DataFrame({
    'is_visibility_nan': is_visibility_nan,
    '#snowing_days': snow,
    '#heavy_snowing_days': heavy_snow,
    '#raining_days': rain,
    '#heavy_raining_days': heavy_rain})

# Group by 'Visibility NaN' and calculate the sum for each weather condition
summary = weather_summary.groupby('is_visibility_nan').sum()

summary['total_counts'] =  is_visibility_nan.value_counts().values

# Reset index to turn 'Visibility NaN' back into a column
summary = summary.reset_index()


display(summary)

is_visibility_nan #snowing_days #heavy_snowing_days #raining_days #heavy_raining_days total_counts
0 False 61 6 171 10 4200
1 True 0 0 0 0 132

When is_visibility_nan is True (visibility data is missing), there are 132 days in total, and notably, none of these days are recorded as having snowing, heavy snowing, raining, or heavy raining conditions. This absence of precipitation-related weather conditions on days with missing visibility data suggests that visibility tends to be unrecorded primarily on days with good or clear weather.

This could make sense, as maybe under clear conditions, they overlook recording some values. Therefore, we can logically fill the missing values with the maximum value of visibility.

weather['visibility'].fillna(weather['visibility'].max(), inplace=True)
print("Missing values in visibility: ",weather['visibility'].isna().sum())
Missing values in visibility:  0

We will use the combination of year, month, day, hour features to merge the taxi trip dataset with the weather dataset using a left join:

taxi_trip_merge = taxi_trip.merge(weather, on=['year','month', 'day', 'hour'], how='left')

display(taxi_trip_merge.isna().sum().tail(10))

# Save the weather dataset
weather.to_csv('../data/external/nyc-weather-2016.csv', index=False)
is_rush_hour             0
geodesic_distance        0
bearing                  0
temperature          11751
visibility           11751
precipitation        11751
snow                 11751
heavy_snow           11751
rain                 11751
heavy_rain           11751
dtype: int64

After merging the two datasets, we can see that there are some missing dates in the weather dataset, causing 11751 missing values for the weather features. This is less then 1% of our entire dataset, so we can take a more simplified approach by dropping these rows:

taxi_trip = taxi_trip_merge.dropna().copy()

print('Old size:',taxi_trip_merge.shape[0])

print('\nNew size:',taxi_trip.shape[0])

display(taxi_trip.head(3))
Old size: 1427156

New size: 1415405
id vendor_id passenger_count pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude store_and_fwd_flag
0 id2875421 2 1 -73.982155 40.767937 -73.964630 40.765602 False
1 id2377394 1 1 -73.980415 40.738564 -73.999481 40.731152 False
2 id3858529 2 1 -73.979027 40.763939 -74.005333 40.710087 False
trip_duration day_of_year is_rush_hour geodesic_distance bearing temperature visibility precipitation snow heavy_snow rain heavy_rain
0 455 74 True 1502.171837 99.970196 4.4 8.0 0.3 False False False False
1 663 164 False 1808.659969 242.846232 28.9 16.1 0.0 False False False False
2 2124 19 False 6379.687175 200.319835 -6.7 16.1 0.0 False False False False

3.5 Cluster Features

Here we will use the mini-batch k-means algorithm to cluster the pickup and dropoff locations into 100 clusters. We will then use the cluster centers to create new features that represent the distance between the pickup and dropoff locations and the cluster centers. This can help the model to identify patterns in the data related to the cluster centers.

def plot_clusters_map(  df, gdf_map, latitude_col, longitude_col, cluster_col, ax, 
                        title=' ', sample_size=0, color_map= 'beige', edgecolor= 'grey', 
                        random_state=42, cmap='tab20', alpha=0.2):
    
    """
    Plots a geographical map with clusters from a DataFrame.

    Parameters:
    - df: DataFrame containing the data to be plotted.
    - gdf_map: GeoDataFrame representing the geographical boundaries.
    - latitude_col: Name of the column containing latitude data.
    - longitude_col: Name of the column containing longitude data.
    - cluster_col: Name of the column containing cluster identifiers.
    - ax: Matplotlib Axes object for plotting.
    - title: Title of the plot.
    - sample_size: Number of data points to sample from df (0 for all).
    - color_map: Color for the map.
    - edgecolor: Edge color for the map.
    - random_state: Random state for reproducibility in sampling.
    - cmap: Colormap for clusters.
    - alpha: Transparency level for cluster points.
    """
    
    
    if sample_size > 0:
        df = df.sample(sample_size, random_state=random_state)
        
    # Create geometry and GeoDataFrame
    geometry = gpd.points_from_xy(df[longitude_col], df[latitude_col])
    gdf_clusters = gpd.GeoDataFrame(df, geometry=geometry).set_crs(epsg=4326)
    
    # Convert map to the same CRS
    gdf_map = gdf_map.to_crs(epsg=4326)
    gdf_map.plot(ax=ax, color=color_map, edgecolor=edgecolor)  # Plot the NYC boundary
    
    # Scatter plot for clusters
    ax.scatter( gdf_clusters.geometry.x, gdf_clusters.geometry.y, s=1, 
                c=df[cluster_col].values, cmap=cmap, alpha=alpha)  # Plot clusters
    
    # Set labels and title
    ax.set_xlabel('Longitude', fontsize=14)
    ax.set_ylabel('Latitude', fontsize=14)
    ax.set_title(title, fontsize=20)

Let’s create more six features based in the results of the mini-batch k-means algorithm. We will create two features for the predict clusters, one for pickup and one for dropoff locations. From this features we extract the longitude and latitude centers of each cluster, creating four features: pickup_center_lat, pickup_center_long, dropoff_center_lat, dropoff_center_long. We will also create two features for the distance between the pickup and dropoff cluster centers, pickup_cluster_distance and dropoff_cluster_distance.

pickup_locations = taxi_trip.loc[:,['pickup_latitude', 'pickup_longitude']]
dropoff_locations = taxi_trip.loc[:,['dropoff_latitude', 'dropoff_longitude']]


coords = np.vstack( [pickup_locations.values, dropoff_locations.values])

# Randomly permute the combined coordinates 
sample_coords = np.random.permutation(coords)[:500000]


kmeans_coords = MiniBatchKMeans(    n_clusters= 100 , random_state=42, 
                                    batch_size=10000, n_init = 'auto')

kmeans_coords.fit(sample_coords)


# Predict clusters for the full dataset
taxi_trip['pickup_cluster'] = kmeans_coords.predict(pickup_locations.values)
taxi_trip['dropoff_cluster'] = kmeans_coords.predict(dropoff_locations.values)

cluster_centers = kmeans_coords.cluster_centers_
taxi_trip['pickup_center_lat'] = cluster_centers[taxi_trip['pickup_cluster'], 0]
taxi_trip['pickup_center_lon'] = cluster_centers[taxi_trip['pickup_cluster'], 1]
taxi_trip['dropoff_center_lat'] = cluster_centers[taxi_trip['dropoff_cluster'], 0]
taxi_trip['dropoff_center_lon'] = cluster_centers[taxi_trip['dropoff_cluster'], 1]

# Calculate the geodesic distance
taxi_trip['center_geodesic_distances'] = geodesic(  taxi_trip['pickup_center_lat'].to_numpy(), 
                                                    taxi_trip['pickup_center_lon'].to_numpy(), 
                                                    taxi_trip['dropoff_center_lat'].to_numpy(), 
                                                    taxi_trip['dropoff_center_lon'].to_numpy())

taxi_trip[['center_geodesic_distances', 'geodesic_distance']].head()
center_geodesic_distances geodesic_distance
0 1381.650448 1502.171837
1 1909.482823 1808.659969
2 6462.250705 6379.687175
3 1864.313219 1483.632481
4 844.527437 1187.037659

For a visualization, we can plot the clusters using the shape file from NYC Borough Boundaries and the cluster centers:

fig, axes = plt.subplots(1, 2, figsize=(24, 16))
titles = ['Pickup Locations', 'Dropoff Locations']
# Column names 
location_columns = [('pickup_latitude', 'pickup_longitude', 'pickup_cluster'), 
                    ('dropoff_latitude', 'dropoff_longitude', 'dropoff_cluster')]

# Loop through each subplot
for ax, title, (lat_col, lon_col, cluster_col) in zip(axes, titles, location_columns):
    plot_clusters_map(  taxi_trip, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title, color_map= beige, edgecolor=grey)
    
    # Plot the cluster centers
    ax.scatter( cluster_centers[:, 1], cluster_centers[:, 0], s = 3, 
                c = 'black', marker='*', alpha = 0.7)
    
plt.show()
# Save the dataset
taxi_trip.to_csv('../data/interim/nyc-taxi-trip-2016-new-features.csv', index=False)

4. Split Dataset

taxi_trip = pd.read_csv('../data/interim/nyc-taxi-trip-2016-new-features.csv', 
                        parse_dates= ['pickup_date','dropoff_date'])

Let’s choose to split the dataset first into two parts, df_train_large and df_test. The df_train_large will be used for exploratory data analysis (EDA) to understand the dataset deeply and identify key features. Then, using the df_train_large dataset, we will split it again into df_train and df_val. The df_train will be used to train our machine learning models and validate their prediction with the df_val dataset. The df_test set, is reserved until the end, and will provide an unbiased assessment of the model’s performance on unseen data, preventing any kind of data leakage. Also, we will use the df_train_large to compare the performance with the df_test set.

# Split in 35.75% Train /29.25% Validation/ 35% Test
df_train_large, df_test = train_test_split(taxi_trip, test_size = 0.40, random_state = random_state)
df_train, df_val = train_test_split(df_train_large, train_size = 0.55, random_state = random_state)

print(f"Train large:{len(df_train_large)}({round(100*len(df_train_large)/ len(taxi_trip), 2)}%)")
print(f"Test: {len(df_test)}({round(100*len(df_test)/len(taxi_trip),2)}%)")
print(f"Train:{len(df_train)}({round(100*len(df_train)/len(taxi_trip),2)}%)")
print(f"Validation: {len(df_val)}({round(100*len(df_val)/len(taxi_trip),2)}%)")
Train large:849243(60.0%)
Test: 566162(40.0%)
Train:467083(33.0%)
Validation: 382160(27.0%)

Also, to visualize the split data and check if the spatial and temporal features are representative in each dataset, we plot the NYC map for the pickup and dropoff locations clusters and the counts of pickup by the dates. We can see that the split is representative for the spatial and temporal features:

nybb_path = '../data/external/nyc-borough-boundaries/geo_export_e13eede4-6de2-4ed8-98a0-58290fd6b0fa.shp'
nyc_boundary = gpd.read_file(nybb_path)

fig, axes = plt.subplots(2, 2, figsize=(20, 16))
axes = axes.flatten()  # Flatten the axes array for easy iteration

# Titles and Column names
titles = ['Pickup Cluster Locations', 'Dropoff Cluster Locations']
# Column names 
location_columns = [('pickup_latitude', 'pickup_longitude', 'pickup_cluster'), 
                    ('dropoff_latitude', 'dropoff_longitude', 'dropoff_cluster')]

# Loop for training data
for ax, title, (lat_col, lon_col, cluster_col) in zip(axes[:2], titles, location_columns):
    plot_clusters_map(  df_train, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title + ' (Train)', color_map= beige, edgecolor=grey)

# Loop for test data
for ax, title, (lat_col, lon_col, cluster_col) in zip(axes[2:], titles, location_columns):
    plot_clusters_map(  df_test, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title + ' (Test)', color_map= beige, edgecolor=grey)

plt.tight_layout()  
plt.subplots_adjust(left=0.10, right=0.90, top=0.95, bottom=0.05)
plt.show()
fig, axes = plt.subplots(1, 1, figsize=(15, 5))

axes.plot(df_train.groupby('pickup_date')[['id']].agg(['count']), 'o-' , label = "Train", color = yellow)
axes.plot(df_test.groupby('pickup_date')[['id']].agg(['count']), 'o-', label = "Test", color = orange)

plt.title('Pickup records per day')
plt.legend(loc ='lower right')
plt.ylabel('Number of records')
plt.show()

5. Exploratory Data Analysis (EDA)

We proceed with the EDA in the df_train_large dataset, which is the combination of both training and validation sets. To prevent data leakage, we don’t include the test set for the EDA. This will ensure the trained model evaluation on the test set is unbiased, without information from the train set mistakenly leakage to the test.

5.1 Feature Visualizations

Let’s first plot one more time the maps of NYC pickup and dropoff cluster locations. We can see that for both the clusters are concentrated in Manhattan. The main difference between pickup and dropoff locations is the concentration of the clusters outside Manhattan. For dropoff locations, are also more concentration in other regions like in Brooklyn and Queens, when compared to the pickup locations. This is expected because Manhattan is a commercial and financial center of the United States, and Brooklyn and Queens are residential boroughs.

nybb_path = '../data/external/nyc-borough-boundaries/geo_export_e13eede4-6de2-4ed8-98a0-58290fd6b0fa.shp'
nyc_boundary = gpd.read_file(nybb_path)

fig, axes = plt.subplots(1, 2, figsize=(20, 16))
axes = axes.flatten()

titles = ['Pickup Cluster Locations', 'Dropoff Cluster Locations']
location_columns = [('pickup_latitude', 'pickup_longitude', 'pickup_cluster'), 
                    ('dropoff_latitude', 'dropoff_longitude', 'dropoff_cluster')]

for ax, title, (lat_col, lon_col, cluster_col) in zip(axes[:2], titles, location_columns):
    plot_clusters_map(  df_train_large, nyc_boundary, lat_col, lon_col, cluster_col, ax, 
                        title=title, color_map = beige , edgecolor=grey)
    
    # iterrows: Loop through each row in the DataFrame.
    # return index of each row
    # return each row as pandas series
    for idx, row in nyc_boundary.iterrows():
        centroid = row['geometry'].centroid
        ax.text(centroid.x, centroid.y, row['boro_name'], fontsize=10, ha='right', va='center')

plt.tight_layout()  
plt.subplots_adjust(left=0.10, right=0.90, top=0.95, bottom=0.05)
plt.show()

Let’s look at a simple overview visualization of the pickup and dropoff latitudes and longitudes:

fig, axes = plt.subplots(2, 2, figsize=(20, 10))  # 2x2 grid for subplots

# Pickup Longitude Histogram
sns.histplot(   df_train_large.pickup_longitude, bins=40, linewidth=0.5, color=yellow, alpha=0.2,
                ax=axes[0, 0], kde=True, line_kws={'lw': 3})
axes[0, 0].set_title('Pickup Longitude', fontsize=15)
axes[0, 0].set_xlabel('Pickup Longitude', fontsize=14)
axes[0, 0].set_ylabel('Count', fontsize=14)

# Pickup Latitude Histogram
sns.histplot( df_train_large.pickup_latitude, bins=40, linewidth=0.5, color=yellow, alpha=0.2,
                ax=axes[0, 1], kde=True, line_kws={'lw': 3})
axes[0, 1].set_title('Pickup Latitude', fontsize=15)
axes[0, 1].set_xlabel('Pickup Latitude', fontsize=14)
axes[0, 1].set_ylabel('Count', fontsize=14)

# Dropoff Longitude Histogram
sns.histplot(  df_train_large.dropoff_longitude, bins=40, linewidth=0.5, color=orange, alpha=0.2,
                ax=axes[1, 0], kde=True, line_kws={'lw': 3})
axes[1, 0].set_title('Dropoff Longitude', fontsize=15)
axes[1, 0].set_xlabel('Dropoff Longitude', fontsize=14)
axes[1, 0].set_ylabel('Count', fontsize=14)

# Dropoff Latitude Histogram
sns.histplot(   df_train_large.dropoff_latitude, bins=40, linewidth=0.5, color=orange, alpha=0.2,
                ax=axes[1, 1], kde=True, line_kws={'lw': 3})
axes[1, 1].set_title('Dropoff Latitude', fontsize=15)
axes[1, 1].set_xlabel('Dropoff Latitude', fontsize=14)
axes[1, 1].set_ylabel('Count', fontsize=14)

plt.tight_layout()
plt.show()

For the bearing feature created in the 3. Feature Engineering section, we visualized the bearing feature’s distribution using a histogram in polar coordinates. The radial variable is the geodesic distance in log scale and the polar variable is the bearing angles in degrees. On the left we plot the histogram for the bearing in cartesian coordinates, and On the right we plot the histogram in polar coordinates.

fig = plt.figure(figsize=(20, 8))

# Add a regular subplot for the linear histogram of bearing
ax1 = fig.add_subplot(121)  # 1 row, 2 columns, first subplot
sns.histplot(   df_train_large['bearing'], bins=40, linewidth=0.2, color=yellow, alpha=0.2,
                ax=ax1, kde=True, line_kws={'lw': 3})
ax1.set_title('Bearing Histogram', fontsize=15)
ax1.set_xlabel('Bearing (Degrees)', fontsize=14)
ax1.set_ylabel('Count', fontsize=14)

# Add a polar subplot for the 2D histogram
ax2 = fig.add_subplot(122, projection='polar')  # 1 row, 2 columns, second subplot with polar projection

# Convert bearing to radians
bearing_rad = np.radians(df_train_large['bearing'])

# Create a 2D histogram in polar coordinates
hist, theta, r, mappable = ax2.hist2d(  bearing_rad, 
                                        np.log10(df_train_large['geodesic_distance']), 
                                        bins=[20, 20], 
                                        cmap='YlOrBr')

# Customize the polar plot
ax2.set_theta_zero_location('N')  # Set 0 degrees to the south
ax2.set_theta_direction(-1)  # Clockwise
ax2.set_xlabel('Bearing (Degrees)', fontsize=14, labelpad=20)
ax2.set_ylabel('Log10 of Geodesic Distance (m)', fontsize=14, labelpad=40)
ax2.grid(True)

# Add a color bar to represent counts on the polar plot
cbar = plt.colorbar(mappable, ax=ax2, pad=0.1, fraction=0.035)
cbar.ax.set_ylabel('Counts')

plt.tight_layout() 
plt.show()

We can draw the following observations from these two plots:

It appears that the bearing feature has some capability to distinguish the direction of the trip. This indicates potential importance of the bearing feature for the model. Also, it is important to note that the distance is on a logarithmic scale, emphasizing shorter trips.

For the categorical features vendor_id, passanger_count and store_and_fwd_flag we can use bar plots to visualize the distribution. Here the vendor_id refers to the companies that provide the taxi services, and the store_and_fwd_flag indicates whether the trip record was stored in vehicle memory before sending to the vendor.

fig, axes = plt.subplots(1, 3, figsize=(20, 5))  # 3 rows, 1 column

# Plot for vendor_id
vendor_id = df_train_large['vendor_id'].value_counts().sort_index()
vendor_id.plot(kind='bar', width=0.9, color=[yellow, orange], edgecolor=grey, ax=axes[0])
axes[0].set_title('Company that Provides the Taxi Service')
axes[0].set_xlabel('vendor_id')
axes[0].set_ylabel('Frequency')
axes[0].tick_params(axis='x', rotation=0)

# Plot for store_and_fwd_flag
store_fwd_flag = df_train_large['store_and_fwd_flag'].value_counts().sort_index()
np.log(store_fwd_flag).plot(kind='bar', width=0.9, color=[yellow, orange], edgecolor=grey, ax=axes[1])
axes[1].set_title('Trip Data Stored in Vehicle Memory (?)')
axes[1].set_xlabel('store_and_fwd_flag')
axes[1].set_ylabel('Frequency (log scale)')
axes[1].tick_params(axis='x', rotation=0)

# Plot for passenger_counts
passenger_counts = df_train_large['passenger_count'].value_counts().sort_index()
passenger_counts.plot(kind='bar', width=0.9, color=[orange], edgecolor=grey, ax=axes[2])
axes[2].set_title('Number of Passengers')
axes[2].set_xlabel('passenger_count')
axes[2].set_ylabel('Frequency')
axes[2].tick_params(axis='x', rotation=0)

plt.tight_layout() 
plt.show()

We can state the following observations:

For insight about temporal feature, like day_of_week and hour_of_day, we can plot those feature for each vendor_id. Let’s compare the taxi services of each company for the pickup location by the day of the week and hours of the day.

fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Plot for day_of_week
grouped_data_day = df_train_large.groupby(['vendor_id', 'day_of_week'])['pickup_date'].count().reset_index()
sns.scatterplot(data=grouped_data_day, x='day_of_week', y='pickup_date', 
                hue='vendor_id', palette=['orange', 'yellow'], s=150, ax=axes[0])
axes[0].set_xticks(range(7))
axes[0].set_xticklabels(['Monday', 'Tuesday', 'Wednesday', 
                        'Thursday', 'Friday', 'Saturday', 'Sunday'])
axes[0].set_xlabel(' ')
axes[0].set_title('Number of Pickups by Day of the Week')
axes[0].set_ylabel('Total Number of Pickups')
# Highlight weekdays for rush hour
axes[0].grid()

# Plot for hour_of_day
grouped_data_hour = df_train_large.groupby(['vendor_id', 'hour_of_day'])['pickup_date'].count().reset_index()
sns.scatterplot(data=grouped_data_hour, x='hour_of_day', y='pickup_date',  
                hue='vendor_id', palette=['orange', 'yellow'], s=150, ax=axes[1])
axes[1].set_title('Number of Pickups by Hour of the Day')
axes[1].set_xlabel('Hour of the Day')
axes[1].set_ylabel('Total Number of Pickups')
# Highlight rush hour times
axes[1].axvspan(7, 10, color=grey, alpha=0.2)
axes[1].axvspan(16, 20, color=grey, alpha=0.2)
axes[1].grid()

# Adjust the legend for both subplots
for ax in axes:
    ax.legend(title='Vendor ID', loc='upper left')

plt.tight_layout()
plt.show()

We can state the following observations based on the scatter plots:

For further insight about temporal features, we can count the trips for the hours of the day and plot the curves for months and days of the week.

grouped_month = df_train_large.groupby(['hour', 'month']).size().reset_index(name='count')
grouped_weekday = df_train_large.groupby(['hour', 'day_of_week']).size().reset_index(name='count')

# Define palettes
palette_month = sns.color_palette("tab20", n_colors=6)
palette_weekday = sns.color_palette("tab20", n_colors=7)

fig, axes = plt.subplots(1, 2, figsize=(25, 6))

# Month plot
sns.lineplot(   data=grouped_month, x='hour', y='count', hue='month',
                palette=palette_month, linewidth=2.5, ax=axes[0])

# Update the month legend
month_labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
handles, labels = axes[0].get_legend_handles_labels()
axes[0].legend(handles=handles, title='Month', labels=month_labels, loc='upper left')

axes[0].set_xlabel('Hour of the Day')
axes[0].set_ylabel('Count')
axes[0].set_title('Hour of Day vs Count by Month')
axes[0].axvspan(7, 10, color=grey, alpha=0.2)
axes[0].axvspan(16, 20, color=grey, alpha=0.2)
axes[0].grid()

# Weekday plot
sns.lineplot(   data=grouped_weekday, x='hour', y='count', hue='day_of_week', 
                palette=palette_weekday, linewidth=2.5, ax=axes[1])

# Update the weekday legend
weekday_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
handles, labels = axes[1].get_legend_handles_labels()
axes[1].legend(handles=handles, title='Day of the Week', labels=weekday_labels, loc='lower right')

axes[1].set_xlabel('Hour of the Day')
axes[1].set_ylabel('Count')
axes[1].set_title('Hour of Day vs Count by Weekday')

# Highlight rush hour times
axes[1].axvspan(7, 10, color=grey, alpha=0.2)
axes[1].axvspan(16, 20, color=grey, alpha=0.2)
axes[1].grid()

plt.show()

We can state the following observations based on the line plots:

5.2 Feature Correlations

In the previous section, we visualized the distribution of some individual features to understand and get insights about the data. Now, we will look at the correlation between those features and the target feature trip_duration and the correlation between each other.

On the left, we plot a hexbin for the correlation between the log trip_duration and the log geodesic_distance, and in the right again we plot a 2d histogram in polar coordinates, where the radial is the log of the trip_duration and the polar angle is the log of the bearing feature. We can state the following observations:

fig = plt.figure(figsize=(20, 8))

log_geodesic_distance = np.log1p(df_train_large['geodesic_distance'])
log_trip_duration = np.log1p(df_train_large['trip_duration'])

ax1 = fig.add_subplot(121)  
ax1.hexbin(log_geodesic_distance, log_trip_duration, gridsize=100, cmap='YlOrBr', mincnt=1)
ax1.set_xlabel('Log10 of Geodesic Distance (m)', fontsize=14)
ax1.set_ylabel('Log1p of Trip Duration (s)', fontsize=14)

# Add a color bar to represent counts on the polar plot
cbar_hexbin = plt.colorbar(mappable, ax=ax1, pad=0.1, fraction=0.035)
cbar_hexbin.ax.set_ylabel('Counts')

# Add a polar subplot for the 2D histogram
ax2 = fig.add_subplot(122, projection='polar')  # 1 row, 2 columns, second subplot with polar projection
bearing_rad = np.radians(df_train_large['bearing'])
hist, theta, r, mappable = ax2.hist2d(  bearing_rad, 
                                        log_trip_duration, 
                                        bins=[20, 20], 
                                        cmap='YlOrBr')

# Customize the polar plot
ax2.set_theta_zero_location('N')  # Set 0 degrees to the south
ax2.set_theta_direction(-1)  # Clockwise
ax2.set_xlabel('Bearing (Degrees)', fontsize=14, labelpad=20)
ax2.set_ylabel('Log1p of Trip Duration (s)', fontsize=14, labelpad=40)
ax2.grid(True)

# Add a color bar to represent counts on the polar plot
cbar_polar = plt.colorbar(mappable, ax=ax2, pad=0.1, fraction=0.035)
cbar_polar.ax.set_ylabel('Counts')

plt.tight_layout() 
plt.show()

By the left plot, we can see that there is a positive correlation between the log of the trip_duration and the log of the geodesic_distance. This is expected because the longer the distance, the longer the trip duration. However, this correlation is not linear, as we can see that the correlation is stronger for shorter distances. This can be explained by the fact that the longer the distance, the more traffic and other factors can influence the trip duration.

In the polar plot, there are two peaks for the bearing direction around 22 and 200 degrees for trip duration approximated between 405 and 1098 seconds, which implies that most trips are short in this directions, probably covering distances between 1 and 10 kilometers as we previously stated.

def plot_feature_importance(df, x, y, ax, threshold=0.002, title='Feature Importance', 
                            xlabel='Features', ylabel='Importance', palette=None):
    """
    Function to plot the feature importance with a distinction of importance based on a threshold.

    Parameters:
    - df: pandas.DataFrame
        DataFrame containing features and their importance scores.
    - x: str
        Name of the column representing feature names.
    - y: str
        Name of the column representing feature importance scores.
    - ax: matplotlib axis object
        Axis on which to draw the plot.
    - threshold: float, optional (default=0.002)
        Value above which bars will be colored differently.
    - pad: float, optional (default=5.0)
        Adjust the layout of the plot.
    - title: str, optional (default='Feature Importance')
        Title of the plot.
    - xlabel: str, optional (default='Features')
        Label for the x-axis.
    - ylabel: str, optional (default='Importance')
        Label for the y-axis.
    - palette: list, optional
        A list of two colors. The first color is for bars 
        below the threshold and the second is for bars above.

    Returns:
    - None (modifies ax in-place)
    """
    if palette is None:
        palette = ["blue", "red"]
    
    blue, red = palette
    colors = [red if value >= threshold else blue for value in df[y]]
    sns.barplot(x=x, y=y, data=df, ax=ax, alpha=0.5, palette=colors, legend= False, hue = x)
    ax.set_xticks(range(len(df[x])))
    ax.set_xticklabels(df[x], rotation=90, fontsize=12)
    ax.set_ylabel(ylabel, fontsize=15)
    ax.set_xlabel(xlabel, fontsize=15) 
    ax.grid(axis='y')
    ax.set_title(title, fontsize=15)

To better understand the relationship between our target feature, trip_duration, and the other features, we explore two distinct statistical measures: mutual information and the Spearman correlation coefficient.

Mutual information evaluates the mutual dependence between two variables and captures both linear and non-linear relationships. This measure is robust against outliers and provides reliability for any numerical or categorical features. On the other hand, Spearman correlation, while also addressing both linear and non-linear relationships, does not assume any specific data distribution. We can interpret the comparison of these metrics as follows:

features_to_drop = ['id', 'pickup_date', 'dropoff_date', "year", 
                    "min_of_day", "hour", "day_of_year", "minute" ]
df_train_dropped = df_train_large.drop(features_to_drop, axis=1).copy()

# Sample 50% with Random permutation
sample_size = int(len(df_train_large) * 0.5)
random_indices = np.random.permutation(df_train_large.index)
sample_indices = random_indices[:sample_size]

# Create a sample DataFrame
df_sampled = df_train_dropped.loc[sample_indices].drop('trip_duration', axis=1).copy()
Y_train_large = df_train_large['trip_duration'].copy()
Y_sampled = Y_train_large.loc[sample_indices]

# Mutual information
mi_values = mutual_info_classif(df_sampled, Y_sampled, discrete_features=False)
mi_column = pd.DataFrame(   mi_values, index=pd.Index(df_sampled.columns, name='features'), 
                            columns=['mutual_inf']).sort_values(by='mutual_inf', ascending=False)
mi_column_reset = mi_column.reset_index()

# Spearmann correlation
spearman_corr = df_train_dropped.corr(method='spearman')['trip_duration'].agg(np.abs)\
                                                .sort_values(ascending=False)\
                                                .drop('trip_duration')\
                                                .to_frame()                                        
spearman_column = spearman_corr.reset_index()
spearman_column.columns = ['features', 'spearman_corr']


# Plots
fig, ax = plt.subplots(1, 2, figsize=(25, 10))

plot_feature_importance(mi_column_reset, x='features', y='mutual_inf', ax=ax[0], threshold=0.02, 
                        title="Mutual Information", xlabel=" ", 
                        ylabel="Taxi Trip Duration Correlation", palette=[yellow, orange])

plot_feature_importance(spearman_column, x='features', y='spearman_corr', ax=ax[1], threshold=0.02, 
                        title="Spearman Correlation", xlabel=" ", 
                        ylabel="Taxi Trip Duration Correlation", palette=[yellow, orange])

plt.tight_layout(pad=5.0)
plt.show()

The strongest correlation for trip_duration are the distance measures, geodesic_distance and center_geodesic_distances, as expected. The other features compared to the distances are less correlated, but yet have some relevance. Most of those features are related with pickup and dropoff locations.

We can consider selecting two sets of relevant features from both metrics, and than take the union of these two sets. This will ensure that we select features that are relevant for both metrics.

threshold = 0.02
best_spearman = spearman_corr[spearman_corr['trip_duration'] >= threshold].index.tolist()
print("Best Spearman features for trip_duration correlation:\n", best_spearman)

best_mi = list(mi_column[mi_column['mutual_inf'] > threshold].index)
print("\n Best mutual information features for trip_duration:\n", best_mi)


mi_set = set(best_mi)
spearman_set = set(best_spearman)

union_mi_spearman  = mi_set.union(spearman_set)
print('\nUnion set:\n', union_mi_spearman)
intersec_mi_spearman  = mi_set.intersection(spearman_set)
print('\nIntersection set:\n', intersec_mi_spearman)
Best Spearman features for trip_duration correlation:
 ['geodesic_distance', 'center_geodesic_distances', 'dropoff_latitude', 'dropoff_center_lat', 
 'pickup_latitude', 'pickup_center_lat', 'dropoff_center_lon', 'dropoff_longitude', 'temperature', 
 'is_weekend', 'pickup_cluster', 'month', 'hour_of_day', 'day_of_week', 'is_holiday', 'pickup_center_lon', 
 'pickup_longitude', 'is_rush_hour', 'passenger_count', 'bearing', 'store_and_fwd_flag']

 Best mutual information features for trip_duration:
 ['geodesic_distance', 'center_geodesic_distances', 'pickup_cluster', 'pickup_center_lon', 
 'pickup_center_lat', 'dropoff_center_lat', 'dropoff_center_lon', 'dropoff_cluster', 'pickup_longitude', 
 'dropoff_latitude', 'pickup_latitude', 'dropoff_longitude', 'visibility', 
 'passenger_count', 'bearing', 'vendor_id']

Union set:
 {'is_holiday', 'dropoff_center_lon', 'dropoff_center_lat', 'passenger_count', 'store_and_fwd_flag', 
 'pickup_longitude', 'hour_of_day', 'day_of_week', 'center_geodesic_distances', 'pickup_cluster', 
 'pickup_latitude', 'dropoff_cluster', 'pickup_center_lat', 'dropoff_longitude', 'temperature', 
 'month', 'is_rush_hour', 'is_weekend', 'vendor_id', 'visibility', 'geodesic_distance', 
 'pickup_center_lon', 'bearing', 'dropoff_latitude'}

Intersection set:
 {'pickup_center_lat', 'dropoff_center_lon', 'dropoff_longitude', 'passenger_count', 'dropoff_center_lat', 
 'pickup_longitude', 'geodesic_distance', 'pickup_cluster', 'pickup_center_lon', 'center_geodesic_distances', 
 'bearing', 'dropoff_latitude', 'pickup_latitude'}

The intersection set has no feature related with climate feature. This confirm our previously assumption about the complex relationship between visibility and temperature with the target feature trip_duration. What it’s probably happening here is that mutual information can account for complex relation, but these features do not have a clear monotonic relationship necessary for a strong Spearman correlation. But we can still consider this features for our model, so we choose to use the union set of features to account both relevant feature in the two metrics.

def plot_corr_ellipses(data, figsize, **kwargs):
    """
    Plots a correlation matrix using ellipses to represent the correlations.

    Parameters:
    - data (pd.DataFrame): A 2D array or DataFrame containing the correlation matrix.
    - figsize: Tuple specifying the figure size.
    - kwargs: Additional keyword arguments for EllipseCollection.

    Returns:
    - A tuple containing the EllipseCollection object and the Axes object.

    """
    M = np.array(data)
    if not M.ndim == 2:
        raise ValueError('Data must be a 2D array.')

    # Mask the upper triangle of the matrix
    mask = np.triu(np.ones_like(M, dtype=bool), k=1)
    M[mask] = np.nan

    # Initialize the plot
    fig, ax = plt.subplots(1, 1, figsize=figsize, subplot_kw={'aspect': 'equal'})
    ax.set_xlim(-0.5, M.shape[1] - 0.5)
    ax.set_ylim(-0.5, M.shape[0] - 0.5)
    ax.invert_yaxis()
    ax.set_xticklabels([])
    ax.grid(False)

    # Determine xy locations of each ellipse center
    xy = np.indices(M.shape)[::-1].reshape(2, -1).T

    # Define ellipse properties
    w = np.ones_like(M).ravel() + 0.01  # Widths of ellipses
    h = 1 - np.abs(M).ravel() - 0.01   # Heights of ellipses
    a = 45 * np.sign(M).ravel()        # Rotation angles

    # Create and add the ellipse collection
    ec = EllipseCollection( widths=w, heights=h, angles=a, units='x', offsets=xy,
                            norm=Normalize(vmin=-1, vmax=1), transOffset=ax.transData, 
                            array=M.ravel(), **kwargs)
    ax.add_collection(ec)

    # Add a color bar for correlation values
    cb = fig.colorbar(ec, ax=ax, orientation='horizontal', fraction=0.047, pad=0.00)
    cb.ax.xaxis.set_ticks_position('bottom')
    cb.ax.xaxis.set_label_position('bottom')
    cb.ax.tick_params(top=False, labeltop=False)

    # Feature names on the diagonal
    if isinstance(data, pd.DataFrame):
        diagonal_positions = np.arange(M.shape[1])
        for i, label in enumerate(data.columns):
            ax.annotate(" -  " + label, (i - 0.4, i - 1), ha='left', va='bottom', rotation=90)
        ax.set_yticks(diagonal_positions)
        ax.set_yticklabels(data.index)

    # Hide the plot spines
    for spine in ax.spines.values():
        spine.set_visible(False)

    return ec, ax

Let’s plot a special kind of heatmap using ellipses to better see the correlation between each feature. We choose to use the spearman correlation. The orientation of the ellipse indicates whether two variables are positively correlated (ellipse is pointed to the top right) or negatively correlated (ellipse is pointed to the top left). The shading and width of the ellipse indicate the strength of the association: thinner and darker ellipses correspond to stronger relationships. We choose to use the union set of features to account both relevant feature in the two metrics:

df_train_union = df_train_large[list(union_mi_spearman)].copy()

correlation_matrix = df_train_union.corr(method='spearman')
# Use the function with the Spearman correlation matrix
ec, ax = plot_corr_ellipses(correlation_matrix, figsize=(10, 10), cmap='PuOr')
plt.tight_layout()
plt.show()

We can observe that some features are highly correlated, which may render certain features redundant for our model:

We have chosen to retain these features to evaluate their impact on the model’s performance.

5.3 Conclusion from EDA

The EDA revealed that Out of 39 features, with 28 having been crafted in the Feature Engineering section, a total of 24 were identified having some relevant correlation with the target feature or with another features. Let’s select the relevant features to train our machine learning models.

# Select relevant features from EDA and the target feature
relevant_features = list(union_mi_spearman) + ['trip_duration']

df_train_processed = df_train[relevant_features].copy()
df_val_processed = df_val[relevant_features].copy()
df_test_processed = df_test[relevant_features].copy()
df_train_large_processed = df_train_large[relevant_features].copy()

# Save the processed datasets with target feature
df_train_processed.to_csv('../data/processed/train.csv', index=False)
df_val_processed.to_csv('../data/processed/validation.csv', index=False)
df_test_processed.to_csv('../data/processed/test.csv', index=False)
df_train_large_processed.to_csv('../data/processed/train_large.csv', index=False)

We could see that the features geodesic_distance and center_geodesic_distances have a strong and a positive correlation, yet they do not have linear correlation. This align with our conclusion that longer distances generally require longer travel times, but the relationship is not linear due to factors like traffic conditions. Also, certain direction was preferred for shorter trips, probably accounting trips within Manhattan borough. This could be seen for the bearing feature, which shows a complex correlation with trip_duration. Those feature can be expected to be important for our model.

For Climate feature, most show a very weak correlation with trip_duration, except for visibility and temperature. Only this two features having a better correlation was contraintuitive, because was expected that more climate features could have a more strong correlation with trip_duration. Probably, because we only have records from the first 6 months of 2016, the impact of some climate features could not be fully captured, resulting in a weak correlation.

For temporal features, we could see that the hour_of_day, month, is_weekend, day_of_week, is_holliday show a better correlation with trip_duration, suggesting a monotonic relationship. This is expected because temporal features can increase or decrease traffic related to the time.

For pickup and dropoff locations features, we could see that the majority of the features show a good correlation with trip_duration. This is expected because the location features can indicate for our model traffic patterns related to the location.

6. Model Training and Validation

In this project, we choose to employ the XGBoost and Decision Tree Regressor models to train and predict the taxi_trip_duration. These models can capture complex relationships between the features and the target feature and they are robust against outliers. Even with our outliers handling, this kind of dataset can still have outliers due to unusual traffic conditions or other anomalies that we could not detect. In this section, We will use the df_train dataset to train the models and the df_val dataset to validate the predictions.

# Load the processed datasets for train, validation
df_train = pd.read_csv('../data/processed/train.csv')
df_val = pd.read_csv('../data/processed/validation.csv')

# Save Target feature
Y_train = df_train['trip_duration'].values
Y_val = df_val['trip_duration'].values
Y_train_log = np.log1p(Y_train)
Y_val_log = np.log1p(Y_val)

# Drop target feature
df_train.drop('trip_duration', axis=1, inplace=True)
df_val.drop('trip_duration', axis=1, inplace=True)
X_train = df_train.values
X_val = df_val.values
# Decision Tree Regressor
param_grid = {
    'max_depth': [ 2, 5, 10, None],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 3, 5, 6]
}
dtr = DecisionTreeRegressor(random_state=random_state)

# Grid Search using RMSE as scoring metric
rmse_scorer = make_scorer(lambda y_true, y_pred: np.sqrt(mean_squared_error(y_true, y_pred)), 
                            greater_is_better=False)
grid_search = GridSearchCV(estimator=dtr, param_grid=param_grid, scoring=rmse_scorer, cv= 5, n_jobs= 8)
grid_search.fit(X_train, Y_train_log)

# Best parameters and score
best_params = grid_search.best_params_
best_score = np.sqrt(-grid_search.best_score_)  

print("Best parameters found: ", best_params)
print("Best RMSE score from Grid Search: ", best_score)

Y_pred = grid_search.best_estimator_.predict(X_val)
rmse_val = np.sqrt(mean_squared_error(Y_val_log, Y_pred))

Y_pred = grid_search.best_estimator_.predict(X_train)
rmse_train = np.sqrt(mean_squared_error(Y_train_log, Y_pred))

print(f'\n Train-rmse:{rmse_train} \t\t Validation-rmse:{rmse_val}')
Best parameters found:  {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5}
Best RMSE score from Grid Search:  0.6045244076485581

 Train-rmse:0.35810464512423246 		 Validation-rmse:0.3645839928468323
dmatrix_train = xgb.DMatrix(X_train, label = Y_train_log)
dmatrix_val  = xgb.DMatrix(X_val , label = Y_val_log)
watchlist = [(dmatrix_train, 'train'), (dmatrix_val, 'validation')]


parms = {   'max_depth':8,
            'objective':'reg:squarederror',
            'eta'      :0.3,
            'subsample':0.7,
            'lambda'  :4,  
            'colsample_bytree':0.9,
            'colsample_bylevel':1,
            'min_child_weight': 10,
            'nthread'  :8,
            'eval_metric': "rmse",
            'seed': random_state}

evals_test_val = {}
xgb_model = xgb.train(  parms, dmatrix_train, num_boost_round=1000, evals = watchlist, 
                        early_stopping_rounds=30, maximize=False, verbose_eval=50, 
                        evals_result = evals_test_val)

print(f'\n Score = {xgb_model.best_score:1.5f}\n n_boost_round = {xgb_model.best_iteration:d}')
[0]	train-rmse:0.56721	validation-rmse:0.56764
[50]	train-rmse:0.29455	validation-rmse:0.30990
[100]	train-rmse:0.27962	validation-rmse:0.30561
[150]	train-rmse:0.26882	validation-rmse:0.30426
[200]	train-rmse:0.26045	validation-rmse:0.30408
[246]	train-rmse:0.25376	validation-rmse:0.30415

 Score = 0.30402
 n_boost_round = 217

7. Model Evaluation on Test Set

The XGBoost model outperforms the Decision Tree Regressor in both the training and validation datasets. To verify that our model truly has predictive power and isn’t just overfitting, we’ll assess it against the df_test set. Additionally, we’ll compare its performance with the df_train_large set for a more comprehensive evaluation. Testing on the df_train_large set, helps confirm that our model’s effectiveness isn’t limited to the smaller df_train dataset. This dual approach of testing against the unseen data (df_test) and a larger training set (df_train_large) can confirm the predictive power of the model.

# Load the processed datasets for train large and validation
df_train_large = pd.read_csv('../data/processed/train_large.csv')
df_test = pd.read_csv('../data/processed/test.csv')

# Save Target feature
Y_train_large = df_train_large['trip_duration'].values
Y_test = df_test['trip_duration'].values
Y_train_large_log = np.log1p(Y_train_large)
Y_test_log = np.log1p(Y_test)

# Drop target feature
df_train_large.drop('trip_duration', axis=1, inplace=True)
df_test.drop('trip_duration', axis=1, inplace=True)
X_train_large = df_train_large.values
X_test = df_test.values

dmatrix_train_large = xgb.DMatrix(X_train_large, label = Y_train_large_log)
dmatrix_test  = xgb.DMatrix(X_test   , label = Y_test_log)
watchlist = [(dmatrix_train_large, 'train_large'), (dmatrix_test, 'test')]

parms = {   'max_depth':8, 
            'objective':'reg:squarederror',
            'eta'      :0.3,
            'subsample':0.7,
            'lambda'  :4,  
            'colsample_bytree':0.9,
            'colsample_bylevel':1,
            'min_child_weight': 10,
            'nthread'  :8,
            'eval_metric': "rmse",
            'seed': random_state}

evals_test_train= {}
xgb_model = xgb.train(    parms, dmatrix_train_large, num_boost_round=1000, evals = watchlist,
                            early_stopping_rounds=30, maximize=False, verbose_eval=50, 
                            evals_result = evals_test_train)

print(f'\n Score = {xgb_model.best_score:1.5f}\n n_boost_round = {xgb_model.best_iteration:d}')
[0]	train_large-rmse:0.56734	test-rmse:0.56765
[50]	train_large-rmse:0.29908	test-rmse:0.30796
[100]	train_large-rmse:0.28683	test-rmse:0.30243
[150]	train_large-rmse:0.27813	test-rmse:0.30017
[200]	train_large-rmse:0.27132	test-rmse:0.29918
[250]	train_large-rmse:0.26580	test-rmse:0.29838
[300]	train_large-rmse:0.26079	test-rmse:0.29799
[350]	train_large-rmse:0.25593	test-rmse:0.29770
[400]	train_large-rmse:0.25169	test-rmse:0.29765
[412]	train_large-rmse:0.25070	test-rmse:0.29760

 Score = 0.29759
 n_boost_round = 382

8. Conclusion

rmse_train = evals_test_val['train']['rmse']
rmse_val = evals_test_val['validation']['rmse']

rmse_train_large = evals_test_train['train_large']['rmse']
rmse_test = evals_test_train['test']['rmse']

# Calculating the number of epochs for all datasets
epochs_train = range(len(rmse_train))
epochs_val = range(len(rmse_val))
epochs_train_large = range(len(rmse_train_large))
epochs_test = range(len(rmse_test))

# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Plotting Train and Validation RMSE
axes[0].plot(epochs_train, rmse_train, label='Train', color = orange, linewidth=3)
axes[0].plot(epochs_val, rmse_val, label='Validation', linestyle='--', color = yellow, linewidth=3)
axes[0].set_xlabel('Epochs')
axes[0].set_ylabel('RMSE')
axes[0].set_title('Train vs Validation')
axes[0].grid()
axes[0].legend()

# Plotting Train Large and Test RMSE
axes[1].plot(epochs_train_large, rmse_train_large, label='Train Large', color = orange, linewidth=3)
axes[1].plot(epochs_test, rmse_test, label='Test', linestyle='--', color = yellow, linewidth=3)
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('RMSE')
axes[1].set_title('Train Large vs Test')
axes[1].grid()
axes[1].legend()

plt.tight_layout()
plt.show()

The plots show the RMSE score over the number of epochs for the df_train and df_val on the left, and for the df_train_large and df_test on the right. There’s a small gap of approximately 0.05 between the training and validation/test RMSEs, which is typical as models tend to perform better on the data they have seen (training data). Both validation/test sets reach a plateau around 0.3 RMSE, which is expected with data the model has not seen before.

Overall, the model performs well on the test/validation set with a RMSE of approximately 0.30. The plots suggest that the model has predictive power and is learning without showing signs of overfitting or data leakage.

feature_importances = xgb_model.get_score(importance_type='gain')

# Map indices to names
feature_importances_named = {relevant_features[int(k[1:])]: v for k, v in feature_importances.items()}
df_feature_importance = pd.DataFrame(list(feature_importances_named.items()), 
                        columns=['features', 'importance']).sort_values(by='importance', ascending=False)


fig, ax = plt.subplots(1, 1, figsize=(15, 8))
plot_feature_importance(df_feature_importance, x='features', y='importance', ax=ax, threshold= 4, 
                        title="XGBoost Feature Importance", xlabel=" ", 
                        ylabel="Information Gain", palette=[yellow, orange])

plt.tight_layout(pad=5.0)
plt.show()

The feature importance using Information Gain from the XGBoost model validates our EDA conclusions. The geodesic_distance feature has the highest information gain as we would expect. On the other hand, the center_geodesic_distance has very low information gain, as we previously suggested due to the redundant information. When features are highly correlated, tree-based models favor one over the other during the training process.

The hour_of_day, is_weekend, is_holiday, and day_of_week indicate that temporal features are important for our model. These features probably can indicate an increase or decrease in traffic related to the time. For example, is_weekend can indicate to our model that traffic patterns for weekends can be quite different from weekdays, having more traffic at different hours of the day, as we observed in the Feature Visualization section.

The dropoff_center_lat among other dropoff and pickup features was chosen by the model as the most important feature. This was also expected given the high correlation between the pickup and dropoff location and the pickup and dropoff for center of the clusters.

Overall, the XGBoost feature importance plot seems to validate our EDA findings, which is a good sign that the model is learning and making predictions based on the data. This is also a good sign that the model is not overfitting or presenting any indications of data leakage.

# Prediction  on test set the trained model
Y_test_pred_log = xgb_model.predict(dmatrix_test)


fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))

# distributions
sns.kdeplot(Y_test_pred_log, ax=axes[0], color=yellow, edgecolor= grey ,alpha=0.4, fill=True)
sns.kdeplot(Y_test_log, ax=axes[0], color=orange, edgecolor= grey, alpha=0.4, fill=True)

axes[0].grid()
axes[0].set_title('Trip Duration Distribution')
axes[0].legend(['Predicted on Test', 'Actual on Test'])

# CDFs
sns.ecdfplot(Y_test_pred_log, ax=axes[1], label="Predict on Test", color=yellow, alpha=1, linewidth=2)
sns.ecdfplot(Y_test_log, ax=axes[1], label="Actual on Test", color=orange, alpha=1, linewidth=2)

axes[1].grid()
axes[1].set_title('Trip Duration Cumulative Distribution')
axes[1].legend()


plt.tight_layout(pad=5.0)
plt.show()

Finally, for a better visualization of how well our model is making predictions, we plot the distribution and the cumulative distribution of trip duration for the actual values in Y_test and the predicted values from the dataset df_test. Note that both plots consider the log1p of trip_duration. We can conclude that our model is making very good predictions, with an RMSE of 0.30, and a good distribution of the predicted values.

9. Deployment

# Training logic inside the script train_model.py
xgb_model = train_xgboost('../data/processed/train_large.csv', 'trip_duration')

# Save the model
xgb_model.save_model('../models/xgb_taxi_trip.json')

9.1 Testing Local and Web Service

df_test = pd.read_csv('../data/processed/test.csv').drop('trip_duration', axis=1)
url_local = "http://127.0.0.1:9696/predict"
# Sample one trip to make prediction
trip = df_test.sample(n=1).to_dict(orient='records')[0]
# Make a POST request to predict trip duration
requests.post(url_local, json = trip).json()
{'Trip Duration (hours:minutes:seconds)': '1:1:18'}
# Render Clound URL
url_clound = "https://taxi-trip-predict.onrender.com/predict"
trip = df_test.sample(n=1).to_dict(orient='records')[0]
requests.post(url_clound, json = trip).json()
{'Trip Duration (hours:minutes:seconds)': '0:20:16'}