Optimizing Ride Matching

A Statistical Comparison of Distances for an Optimal Trip Allocation

Author

Paula LC

Published

February 6, 2025

Project Overview

At our mobility company, the ride-matching algorithm assigns trips based on the closest available driver. Currently, we calculate distance using Haversine distance. However, this does not consider road networks, traffic conditions, or travel time, which may result in sub-optimal assignments. Engineering team proposes switching to an external real-time maps API to compute road distance, aiming to improve ride efficiency. While this approach is expected to enhance trip allocation, it introduces API query costs and additional system complexity. To determine whether this transition is justified, the Data Science team has designed an A/B test across multiple cities. This project evaluates the impact of road distance on operational efficiency, customer experience, and financial feasibility.

Objectives

  1. Evaluate the impact of switching to road distance on ride assignment efficiency.
  2. Estimate the maximum feasible cost per API query to justify the investment.
  3. Assess experiment design improvements and propose enhancements.

Understanding the Data

Data Preprocessing

In this section, our objective is to preprocess the data before starting the analysis. This involves initially loading the required libraries and importing the data available. Subsequently, the dataset will be filtered to include only the necessary information, and identify and address any instances of missing data.

Loading libraries & data

As first step, we import the necessary libraries to ensure required tools are readily available for analysis and load the data. After that, we define ggplot theme to customize the data visualization of the report following the brand colors.

Code
# Data Manipulation
library(dplyr)
library(tidyr)
library(lubridate) # Dates
library(stringr) # Strings

# Data visualization & tables
library(ggplot2) # Graphs
library(reactable) # Tables
library(showtext) # Fonts

# Non-parametric methods
library(sm)

# Read data
df_intervals <- read.csv('data.csv')

# Create a vector of brand colors
brand_colors <- c("#f54251", "#4b4b8f", "#41CC94", "#f97b72",  "#E68310", "#3969AC", "#F2B701")
brand_colors <- rep(brand_colors, 10)

# Define colour and fill scales 
scale_colour_discrete <- function(...) scale_colour_manual(values = brand_colors)
scale_fill_discrete <- function(...) scale_fill_manual(values = brand_colors)

# Define brand font
brand_font <- "Lato, sans-serif"

# Add Poppins font
font_add_google("Lato")

# Help ggplot load the new font
showtext_auto()

# Define brand ggplot theme
brand_ggplot_theme <- function() {
  
  font_family <- "Lato"

  theme_minimal() +
    
  theme(
    # Panel
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.grid.major = element_line(linewidth = 0.25),
    panel.spacing = unit(1.25, "lines"),
    panel.background = element_blank(),
    
    # Axis
    axis.title = element_text(family = font_family),
    axis.text = element_text(family = font_family),
    axis.ticks.x = element_line(),
    
    # Legend
    legend.position = 'bottom',
    
    # Facets
    strip.text = element_text(family = font_family, 
                              face = "bold", hjust = 0.5, size = 12),
    # Title
    plot.title = element_text(family = font_family,
                             face = "bold", hjust = 0.5, size = 18)
 )
}

# Set brand ggplot theme
theme_set(brand_ggplot_theme())

Filtering data of interest

The goal of the exercise is to evaluate the performance of the distance metric to the pick up point. Then, the data should be contain only going_to_pickup events.

Code
df_pickup <- df_intervals |> 
  filter(type == "going_to_pickup")

df_pickup |> 
  count(n_distinct(trip_id), n_distinct(vehicle_id)) |>
  reactable()

After filtering the dataset we get 58510 rows with 58468 unique trips and 4745 unique vehicles.

Detecting missing data

The dataset could contain missing values. It is important to identify and remove them to ensure quality of the data.

Code
missing_values <- sum(is.na(df_pickup$duration))
missing_values / nrow(df_pickup)
[1] 0.005110238

Missing values represent 0.5% of the dataset. Since these data don´t add any valuable information to the exercise, they will be removed.

Code
df_pickup |> na.omit() -> df_pickup

Let´s deep dive into the data to understand the main structure and features available. One key in this case study is to investigate special cases to ensure the quality of the information and conclusions drawn. Moreover, plot and analyse the distribution of the variables will give us a perspective of the shape of data and the multivariate relationships between the different features.

Exploring Trips

One trip, one vehicle

One first interesting case, it is to check if each trip has assigned one vehicle only.

Code
df_pickup |>
  group_by(trip_id) |>
  count() |>
  ungroup() |>
  count(n_vehicles = n) |>
  rename(n_trips = n) |>
  reactable()

We detect 29 trips of 2 vehicles, one trip of 3, and another one of 9 vehicles. Multiple vehicles are probably assigned when the number of passengers to travel is higher than the number of seats of one vehicle, then several vehicles has to come to the pick up point. From the data engineering point of view, we don´t need to apply any specific action here.

Fleeting Journies

Code
# Show rows where duration = 0
df_pickup |> 
  filter(duration == 0) |>
  reactable(columns = list(
    trip_id = colDef(minWidth = 350),
    vehicle_id = colDef(minWidth = 350),
    type = colDef(minWidth = 150),
    started_at = colDef(minWidth = 150)
  )) 

22 events has duration 0 seconds.

Stationary Trips

Code
# Show top rows where distance = 0
df_pickup |> 
  filter(distance == 0) |>
  reactable(columns = list(
    trip_id = colDef(minWidth = 350),
    vehicle_id = colDef(minWidth = 350),
    type = colDef(minWidth = 150),
    started_at = colDef(minWidth = 150)
  )) 

There are 1421 (2.4 %) events where the distance is 0. As the trips with duration null, these data don´t provide any useful insight to the experiment so these rows can be removed from the dataset.

Code
# Filter trips with positive distance and duration
df_pickup |> 
  filter(distance > 0 & duration > 0) -> df_pickup

Speed Limits

One way of exploring the speed of the trips is to plot the distance by duration and observe how is the distribution of the data across the three cities.

Code
ggplot(df_pickup, aes(x = duration, y = distance)) +
  geom_point(alpha = 0.5) +
  facet_grid(. ~ city_id) +
  labs(x = "Duration (s)", y = "Distance (m)")
Figure 1: Plot shows duration versus distance of the going to pick up events by three cities after filtering events where duration and distance > 0. No transformations were applied.

We observe about 8 trips where distance is too high (distance > 500km) with duration lower than 2500 seconds ~ 40 minutes. Let´s investigate if there are more cases like these in more detail.

Note

There is no specific information about the traffic rules of the cities included, so we apply the Spanish regulation to determine the maximum legal velocity for a vehicle in the road. In Spain this velocity is 120km/h~33.3m/s. All vehicles travelling faster than this velocity will not be considered for the exercise since they would add noise to the analysis, in addition to violate the safety standards, one of the key values of the company.

Code
# Calculate velocity in m/s
df_pickup |>
  mutate(velocity = round(distance / duration,1)) -> df_pickup

# Filter cases where velocity is higher than 33 m/s ~ 120 km/h
df_pickup |>
  filter(velocity > 33) |>
  arrange(-velocity) |>
  head() |>
  reactable(columns = list(
    trip_id = colDef(minWidth = 350),
    vehicle_id = colDef(minWidth = 350),
    type = colDef(minWidth = 150),
    started_at = colDef(minWidth = 150)
  )) 

There are 591 trips (1%) with speed higher than 120 km/h. The craziest case is a vehicle travelling 3349 m/s, 10 times higher than the supersonic speed. Evidently, these cases will be removed.

Code
# Filter out cases where velocity is higher than 33 m/s
df_pickup |> 
  filter(velocity <= 33) -> df_pickup_filtered

# Plot duration Vs distance without high speed trips
ggplot(df_pickup_filtered, aes(x = duration, y = distance)) +
  geom_point(alpha = 0.5) +
  facet_grid(. ~ city_id) +
  labs(x = "Duration (s)", y = "Distance (m)")
Figure 2: Plot shows duration versus distance of the going to pick up events by three cities after filtering speed <= 120 km/h

The data distribution now presents a more realistic representation.

Exploring Features

After study the observations individually, it is time to explore the distribution and relationship of the different features, focused on the main one of the analysis, the distance type. Do they have any particular shape? Are they Normally distributed? This is key to choose the right test to compare the metrics.

Let´s assign first the experiment distance type labels to the new variable distance_type.

Code
# Extract the first character of the trip_id string and assign road or linear 
df_pickup_filtered |>
  mutate(start_trip_id = str_sub(df_pickup_filtered$trip_id, 1, 1),
         distance_type = factor(if_else(start_trip_id %in% 0:8, 'road', 'linear'))) -> df_pickup_filtered

Now that the main variable was added, the duration and distance scatter plot is repeated by distance type to discover patterns into the data.

Code
ggplot(df_pickup_filtered, aes(x = duration, y = distance, color = distance_type)) +
  geom_point(alpha = 0.5) +
  facet_grid(. ~ city_id) +
  labs(x = "Duration (s)", y = "Distance (m)")
Figure 3: Plot shows duration versus distance of the going to pick up events by three cities and distance type

The graph doesn´t reveal any specific pattern. Probably, due to the high amount of data. A 2d density plot can be a better option to extract insights when there is a large amount of data.

Code
# Plot density 2d 
ggplot(df_pickup_filtered, aes(x = duration, y = distance, color = distance_type)) +
  geom_density_2d(alpha = 0.5) +
  facet_grid(. ~ city_id) +
  labs(x = "Duration (s)", y = "Distance (m)")
Figure 4: Density 2D plot of duration and distance of the going to pick up events by three cities and distance type. No transformations were applied. Pay attention to the axis limits.

City Insights

From a business point of view, most of the trips are a maximum of 750 sec ~ 12.5 minutes long for city Astra and between 500-600 sec (8-10 min) for the cities Vera and Mina. Regarding distance, Astra trips are tipically shorter than 3 km, and Vera and Mina ones, less than 2 km. This is an indicator of Astra being larger than the other cities. The narrowed area of the 2D density graph of Vera suggests an easier mobility across this city compared with Mina.

Statistical Analysis

From a statistical point of view, all distributions are condensed in the first values of duration and distance, that means that the distribution is skewed. A one-dimensional density plot for both variable will support this hypothesis.

Correlation. The shape of the areas is straight and with a slanting direction. This indicates a correlation between both variables, duration and distance, in the three cities.

Distance type. Differences between distance types are more detectable for larger duration and distance, i.e. outermost areas, as it could be expected. In longer distances would be easier to detect bigger effects of distance type road versus linear.

Note

Although it can be interesting to visualize the relationship between duration and distance by city and distance type to understand the behaviour and interaction of the data, the key of the analysis is understand the waiting time of passengers, the duration for any given distance, so let´s continue the analysis focused only on the variable duration.

Code
# Density plot for duration by city and distance type
ggplot(df_pickup_filtered, aes(x = duration, color = distance_type, fill = distance_type)) +
  geom_density(alpha = 0.3) +
  facet_grid(. ~ city_id) + 
  labs(xlab = "Duration (s)", ylab = "Density")
Figure 5: Density plot of going to pick up events duration by three cities and distance type. No transformations were applied.

Skewed distributions make hard to extract conclusions about the visualization. Let´s apply a logarithmic transformation to enhance the understanding of the data.

Code
# Calculate the mean across cities and distance types
df_pickup_filtered |>
  group_by(city_id, distance_type) |>
  summarise(log_mean = mean(log(duration)),
            mean = mean(duration)) -> duration_means

# Plot the logarithmic transformation of the duration by distance type and city
ggplot(df_pickup_filtered, aes(x = log(duration), color = distance_type, fill = distance_type)) +
  geom_density(alpha = 0.4) +
  facet_grid(city_id~.) +
  geom_vline(data = duration_means, aes(xintercept = log_mean, color = distance_type))
Figure 6: Density plot of going to pick up events duration by three cities and distance type. Log-transformation was applied to duration. The log mean is represented as a vertical line for each group.

Visually, we can see small difference between the means of linear and road in the cities Astra and Mina. In Mina, the road log duration is lower than the linear one, and the contrary happens in Astra. For Vera, the mean of the log duration is overlapped regarding the two groups of the experiment.

Warning

Take into account the log difference between groups will be different than the difference in the original scale.

Code
duration_means |>
  reactable(
    columns = list(
      log_mean = colDef(name="log-duration mean", format = colFormat(digits = 2)),
      mean = colDef(name = "duration mean", format = colFormat(digits = 2))
    )
  )

To continue the analysis, we generate new variables related with the datetime of the pick up. They could be useful to discover new insights in the data.

  • started_at: Reformat started_at as date time variable
  • hour_day: Hour of the day when the ride started at
  • week_day: Day of the week when the ride started at
  • moment_day: Moment of the day. This variable categorizes the day into four segments: ‘Night’ (midnight to 5 AM), ‘Morning’ (5 AM to 11 AM), ‘Afternoon’ (11 AM to 5 PM), and ‘Evening’ (5 PM to 11 PM).
Code
# To simplify coding we rename the main dataframe
df <- df_pickup_filtered

# Convert started_at to a Date or POSIXct object
df |>
  mutate(started_at = as.POSIXct(started_at, origin = "1970-01-01", tz = "UTC")) -> df

# Generate new variables: hour_day, week_day, moment_day
df |>
  mutate(hour_day = hour(started_at),
         week_day = wday(started_at, label = TRUE),
         moment_day = cut(hour_day, breaks = c(-Inf, 5, 11, 17, 23),
                          labels = c("Night", "Morning", "Afternoon", "Evening"),
                          include.lowest = TRUE)) -> df

Let´s check the amount of data available on each city per hour of a day.

Code
# Bar plot of data frequency by city, hour of the day
ggplot(df, aes(hour_day, fill = distance_type, color = distance_type)) +
  geom_bar(position = 'dodge', alpha = 0.4) +
  facet_grid(city_id ~distance_type) 
Figure 7: Bar plot of going to pick up events frequency by three cities and distance type.

City Vera has a higher amount of data compared with Mina and Astra. The hours of the day when there is a less amount of traffic is between 5 and 10 AM for all cities. Moreover, most of the trips are concentrated in the afternoon hours.

We can continue with the visualization of the duration of the pick up events by date time for the two distance types and the three cities.

Code
new_df <- df |> 
  group_by(city_id, new_hour = round(started_at, "hour"), distance_type) |>
  summarise(log_mean = mean(log(duration)))

ggplot(df, aes(started_at, log(duration), color = distance_type)) + geom_point(alpha = 0.1, size = 0.1) + 
  facet_grid(city_id~.) +
  geom_line(data = new_df, aes(as.POSIXct(new_hour), log_mean)) +
  scale_x_datetime(breaks = "4 hours", date_labels = "%d %b \n %H:%M")
Figure 8: Scatter and line plot of going to pick up events log duration by time for the three cities. Color is associated to distance type.

Mean log duration of road and linear is pretty similar between the two distance types. From 5Am to 10 AM, we observe a wider space between road and linear in Vera and Mina cities. Nevertheless, the number of trips is lower in that period of time compared with the rest of the day, so conclusions should be taken with caution.

Code
df |>
  group_by(city_id, moment_day, distance_type) |>
  mutate(n = n()) |>
ggplot(aes(distance_type, log(duration), fill = distance_type, color = distance_type)) + 
    geom_violin(aes(alpha = n), size = 0.5, linewidth = 0) +
    stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y..), width = 0.9) +
    facet_grid(city_id~moment_day) + 
  theme(legend.position = "none")
Figure 9: The plot shows the mean of the log duration by city and moment of the day. Violin areas represent the density of the data for each category, and the intensity of the color the amount of data for each group. The lines over the areas indicate the mean of the group.

The positive effect of road distance type highlight specially at night and evening. During morning, linear distance is better in terms of duration. In the afternoon, there are no differences observed, except for city Astra, where road distance is longer that linear. We don´t observe any difference between distributions of distance types.

Statistical Hypothesis Testing

Data exploration phase allowed us to understand in more detail the nature of the data, how is the relationship between variables, and how is the behaviour of distance type across different cities, hours of the day, and moment of the day.

We observed differences between road and linear duration time by city and moment of the day. Nevertheless, assuming that the service only can be acquired by city, and not by time, we can only test the effect of distance type over duration by city (not moment of the day).

In conclusion, the test that allow us to compare the means of two groups and determine if there is a significant difference between them is the t-test. There are some assumptions of the model that would be met in advance:

  1. Normality. Assumes that the populations from which the samples are drawn are normally distributed. This will be tested with a non-parametric test.

  2. Homogeneity of Variance. Assumes that the variances of the two groups being compared are roughly equal. We can assume this due to the nature of the experiment.

  3. Independence. Assumes that the observations in one group are independent of the observations in the other group. Independence can be assumed due to the assignation of each group was random.

Then, t-test follows the formula:

\[ t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

Where:

  • \(\bar{X}_1\) and \(\bar{X}_2\) are the sample means of the log duration of distance type linear and road respectively.

  • \(s_1\) and \(s_2\) are the sample standard deviations of the log duration of distance type linear and road respectively.

  • \(n_1\) and \(n_2\) are the sample sizes of the distance type linear and road gorups respectively.

The degrees of freedom \(df\) for the t-test is calculated using the formula:

\[ df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2 - 1}} \]

Once t-statistic is calculated, the p-value is evaluated. P-value indicates the probability of obtaining the observed difference (or more extreme) if there is no true difference between the groups. If the p-value is below a chosen significance level (commonly 0.05), we may reject the null hypothesis and conclude that there is a significant difference between the geometric means of the two groups.

Non-parametric Normality test

We choose a non-parametric Normality Test to check if the log duration follows a Normal distribution since this bayesian approach doesn´t assume a specific parametric form for the underlying distribution. The sampling distribution is compared with the theoretical one using a kernel estimator to create a smooth estimation of the underlying distribution.

Code
df |>
  filter(city_id == "astra") |>
  pull(duration) |>
  log() |>
  sm.density(model = "Normal")

df |>
  filter(city_id == "vera") |>
  pull(duration) |>
  log() |>
  sm.density(model = "Normal")

df |>
  filter(city_id == "mina") |>
  pull(duration) |>
  log() |>
  sm.density(model = "Normal")
(a) City Astra
(b) City Vera
(c) City Mina
Figure 10: Plots show the density estimation for log duration of the data. The reference band is used to check the Normality of the sampling distribution. If the density function (black line) is under the band, then we can assume Normality of the data.

Despite all distributions are not exactly within the reference band of Normality, they are pretty close to it. In any case, t-test is quite robust to deviations from normality, especially with larger sample sizes, so we can assume a Normal distribution of log-duration for the three cities.

T-test

Code
ggplot(df, aes(duration, city_id, fill = distance_type, color = distance_type)) + 
    stat_summary(fun.y = mean, geom = "point") 

ggplot(df, aes(log(duration), city_id, fill = distance_type, color = distance_type)) + 
    stat_summary(fun.y = mean, geom = "point") 
Figure 11: Dumbbell plot of going to pick up events duration mean by city and distance type.
Figure 12: Dumbbell plot of going to pick up events log duration mean by city and distance type.

Test for city Astra

Code
test_Astra <- t.test(log(duration) ~ distance_type, data = df, subset = city_id == "astra")

test_Astra

    Welch Two Sample t-test

data:  log(duration) by distance_type
t = -0.91591, df = 5071.7, p-value = 0.3598
alternative hypothesis: true difference in means between group linear and group road is not equal to 0
95 percent confidence interval:
 -0.06763988  0.02456303
sample estimates:
mean in group linear   mean in group road 
            6.114839             6.136378 

Test for city Vera

Code
test_Vera <- t.test(log(duration) ~ distance_type, data = df, subset = city_id == "vera")

test_Vera

    Welch Two Sample t-test

data:  log(duration) by distance_type
t = 0.21489, df = 36351, p-value = 0.8299
alternative hypothesis: true difference in means between group linear and group road is not equal to 0
95 percent confidence interval:
 -0.01348184  0.01680213
sample estimates:
mean in group linear   mean in group road 
            5.320265             5.318605 

Test for city Mina

Code
test_Mina <- t.test(log(duration) ~ distance_type, data = df, subset = city_id == "mina")

test_Mina

    Welch Two Sample t-test

data:  log(duration) by distance_type
t = 1.8982, df = 11069, p-value = 0.0577
alternative hypothesis: true difference in means between group linear and group road is not equal to 0
95 percent confidence interval:
 -0.0009370167  0.0583292573
sample estimates:
mean in group linear   mean in group road 
            5.511966             5.483270 

For Vera and Astra cities, the p-value is greater than 0.05, the common threshold that allows reject the null hypothesis, so for these two cities null hypothesis cannot be rejected and then, we can not assume there is an significant improvement of road versus linear.

Nevertheless, for Mina city the p-value is close to 0.05 ~ 0.0577. Due to the nature of the data and external factors that are out of our control on real samples, we consider this p-value enough to reject the null hypothesis of the difference of the means of the log-transformed duration. Hereafter, we will try to interpret what this effect means in terms of the original variable.

Effect of distance type over duration

Let´s transform the log duration to the original scale. For this, we take the exponential of the mean difference of the log-transformed values obtaining the ratio of the geometric means.

Note

The interpretation of geometric means is different to the commonly used sample means we are use to, so we need to be careful in the conclusions extracted.

The estimated difference between the means is 0.029, which after applying the exponential transformation is exp(0.029) ~ 1.03. This is the ratio of geometric means between the linear and road groups. In the original scale, these results suggest that the geometric mean of the linear distance type is approximately 2.90% ~ 3% higher than the geometric mean of the road distance type. As an example, a 10-minute trip of a linear distance type, it would be 9 minutes and 42 seconds long using road distance type.

In summary
  • For cities Vera and Astra, we can not assume there is an significant improvement of road versus linear

  • In Mina, road distance improves a 3% the duration of trips comparing with the linear distance

Optimal Cost

The average duration of pick up in the city Mina is 5 minutes 38 seconds with linear distance type and 5 minutes 29 seconds with road, an enhancement of 9 seconds on average. Taking into account there are 6000 trips in a day (we are assuming the data provided is equal to the total population), we are reducing 15 hours of commuting in a day.

Code
# Calculate mean duration per distance type in city Mina
df |> 
  filter(city_id == "mina") |> 
  group_by(distance_type) |> 
  summarise(mean(duration)) |> 
  reactable()
Code
# Calculate number of distinct trips per day in city Mina
df |> 
  filter(city_id == "mina") |> 
  group_by(date(started_at)) |> 
  summarise(n_distinct(trip_id)) |>
  reactable()
Code
# Calculate the time saving (in hours) in a day with road distance in Mina (on average)
6000 * 9 / 3600
[1] 15

Let´s assume:

  • Our company assigns the route to the driver, then we are responsible about the time of going to pick up. In that case we should pay to the driver a fare for the spent time going to pick up the passenger.

  • Let´s say the fare for the time spent going to pick up a passenger is the half of the time fare during trip. Assuming the time fare during trip is $ 0.10 / minute, then time fare going to pick up would be $ 0.05 / minute.

Under these assumptions, the max price that our company should pay per query (assuming it is only needed one query per trip) would be: $0.05/minute * 0.15 minutes of time saving on average = $ 0.0075 / query.

In case we cannot assume the company is responsible about the time of going to pick up, and the driver assume the cost of the time picking up the passenger, then we should analyse the problem from another perspective. Instead of measuring the time saving, it will more proper to evaluate the increase of the number of trips. In that case, we should detect a significant higher number of trips using the road distance compared with the linear one.

Coming back to the price per query concern, it is important to consider the potential impact of the new feature on the business. This might involve estimating how the observed difference in trip duration translates into real-world outcomes, such as cost savings, improved customer satisfaction, or increased efficiency.

For another hand, there are some costs associated with the implementation of the service, like for example how much time and effort will take for the engineering team to integrate this API on the current system.

In any case, reducing the time of the trips in general would be positive to the company since will increase both the customer and driver satisfaction as well as increase the customer retention, reduce the cancellation rates. And finally, it will help to one the key values of the company: being respectful with the environment, reducing the CO2 emissions when travelling with no passengers.

Insights & Recommendations

  • Road distance in Mina city allow to reduce 9 seconds per trip on average, 15 hours of commuting in a day

  • Assuming a fare of going to pick up of $ 0.05 / minute, the max price that we should pay per query is $ 0.0075 / query

  • In the case the company is not responsible about the time of going to pick up, it will more proper to evaluate the number of trips

  • In general, reducing the time of the going to pick up would be positive to the company since will increase customer and driver satisfaction and reduce cancellation rates

Suggestions & Next Steps

From the design point of view, the experiment was run in 5 days, despite data only shows 2. Even if we have had the 5 days available, it would be interesting to extend the experiment period to check if there is any effect of the day of the week, even of the year. We understand from the business perspective that keep an experiment during several months it is difficult in terms being able to isolate it from other experiments or external factors. So just having more weeks would be enough.

Distance type labels were assigned to the trip_id, which we suppose is generated randomly (we can proof that as well checking the frequency of the data by the starting character of the trip_id). Nevertheless, this labels were rarely assigned 9 groups to distance road and 7 groups to distance linear. Consequently the sample sizes are different on each group.

Moreover, measuring two distances to calculate the time of arrival living at the same time in the same city could drive to an error since one probably influences the other. One better approach would be to assign during a specific time period the road distance, and during other time period linear distance and compare the volume of trips of both groups.

Finally, collect additional information such as customer satisfaction or cancellation rates. Gathering insights on user satisfaction is crucial for assessing the overall experience provided by the ride-hailing app. For example, this can be collected with the passenger feedback after each trip. By understanding the satisfaction levels, we can determine if the distance type has a positive effect over the service. On the other side, cancellation rates could help us to measure if road distance improve the reduction the cancellations for both, passengers, since they have to wait less time, and drivers, since they will arrive earlier to the pick up point with an optimal route. Finally, as part of the company’s commitment to sustainability, we could integrate a tracking mechanism to estimate and analyse the global CO2 emissions associated with each vehicle and evaluate the effect with the road distance type. By doing so, the decision on the best approach could then be selected based on a broader set of variables, following the company’s mission and commitments.