top of page

Google Data Analytics Capstone:
NYC Taxi Data Analysis

Project Goal and Stakeholders

​

On September 2022, I was able to complete the Google Data Analytics Certificate program that was offered by Coursera. Part of the requirements was to complete a Capstone Project that involved applying skills learned during the Certificate Course. Although I had completed the recommended Case Study, I had an opportunity to further practice my skills by working on an interactive dashboard and presentation using a publicly available dataset. For my project, I opted to use NYC Limo and Taxi Trips, a popular dataset that can be found on Google Big Query.

​

To prepare for this project, I wanted to put myself in the shoes of someone who was working for the NYC Taxi and Limousine Commission (TLC). As a result, I ended up doing a little bit of research about NYC taxis in general in regards to their history in NYC and the current challenges in regards to them. Doing so allowed me to enter the mindset in which I imagined I was tasked to report on how taxi performance was in the previous year in terms of revenue generated from fares and identify insights that could help NY TLC with its operations in the foreseeable future. 


After reviewing what data I could realistically gather from the dataset I chose, I was able to identify the following questions that would help guide my analysis: 

  1. Which boroughs generate the most income from Taxi Fares in NYC?

  2. At what point of the year, week, and day is it strategic for taxis to operate?

  3. Which zones in particular generate the most revenue and activity in terms of trips? 

​

Approach and Methodology

​

DATA CLEANING AND MANIPULATION​

​

In Google Big Query, the NYC Dataset was divided into multiple tables containing trip data on yellow taxis, green taxis and for-hire vehicles from as early as 2011. After looking at similar analyses that were conducted in the past, I opted to focus on limiting the scope of my analysis to yellow taxis in the span of one full calendar year. I felt that would be within my capabilities and still provide me with enough insight to answer the questions I had identified. 

​

I then took the time to review the different fields and their descriptions in the publicly available tables. I identified which fields in particular I wanted to work with in my dataset, and what extra information I needed to obtain through additional processing in my SQL queries. As such, I was able to devise a gameplan centered around the information found from 3 tables I wanted to specifically work with:
 

  1. Tlc_yellow_trips_2022 - Contains yellow taxi trip data from 2022

  2. Tlc_yellow_trips_2021 - Contains yellow taxi trip data from 2021

  3. Taxi_zone_geom - Contained the specific zone names and their corresponding ID and spatial data that can be used for map visualizations

​

My first step was to identify which fields in 2022 taxi data I was interested in obtaining information on and do some initial cleaning and processing. Through an initial inspection, I had found out that it only had completed data up until the middle of July 2022. That said, I was able to create an initial query that allowed me to create additional columns to compute for more granular computations such as the total trip time in minutes, or the weekday and hour when passengers are picked up for a trip: 

In the query above, I did some filtering to ensure that it only included the trips I was interested in looking at from January 2022 - June 2022, and to remove negative values I noticed in regards to passenger count, trip distance and the total amount paid. I also did some research and found that taxi drivers are mandated to drive at 12-hour shifts at most so I included a filter to ensure that it only showed data for trips ranging from 1 minute to 720 minutes (i.e. 12 hour trip) . 

​

The second part of my plan was to join this table with the geom data tables so that it could indicate the specific zones where taxis picked up and dropped off passengers. Initially, I wanted to also include spatial data as well, but due to challenges I had in processing and saving to Big Query, I opted to download the spatial data as a csv file and upload it directly to Tableau so that it could be connected later on to my main table. As a result, I was able to accomplish the join with the following query: 

​

​

​

​

​

​

​

​

​

​

Once I saw that I was able to query this new table successfully, I repeated the process with 2021 data to pull information from July 2021 - December 2021 and used Union to combine everything into one table. As a result, this is the final query I was able to generate:

MAIN ANALYSIS & VISUALIZATIONS

​

When it came to carrying out my analysis, I sought to ensure that any insights I obtained would allow me to answer the guide question I had originally identified, but that I would also still be able to continue exploring the data in different ways even if I had already had an initial answer. 

 

Which boroughs generate the most income from Taxi Fares in NYC?

​

Doing a quick review of the data immediately shows that Manhattan is the highest performer when it comes to the total amount generated in fares and the number of trips that occur within the Borough. 

Overwhelmingly, Manhattan had the highest Trip Volume and Amount Earned in Fares

I wanted to figure out what factors may have led to this so I started experimenting with the other fields I had access to. I began playing around with the idea of using a scatter plot to map out what the average fare in each zone would look like when you compared this with other factors such as the Trip Time. It was only when I added the extra layer of clustering the points in terms of Boroughs did I notice that a lower average fare and lower trip time characterized Manhattan trips. For me, this was particularly interesting to highlight because it builds on the fact that Manhattan has the highest number of trips and shows that the best strategy for increasing earnings through taxi trips isn’t necessarily by having high fares, but by being able to provide a high volume of trips that are cheaper and shorter in nature. 

It was only when I decided to cluster the zones in terms of Boroughs did the visual become more insightful.

At what point of the year, week, and day is it strategic for taxis to operate?

​

When answering this question, I had two key visuals I already had in mind that would help me approach this question. The first one was a straightforward line graph that showed taxi demand across the span of a full year. What I found interesting to explore were the peak periods in which demand for taxis is at their highest (i.e. October and March). While I initially looked at Manhattan, I also found that the peak periods and months when trip count decreases are generally the same across the different Boroughs. This led me to consider that perhaps seasonality could affect the volume of trip activity for taxis, as the peak months and dips can be associated with the 4 seasons in US weather. 

 

The second visual was a heatmap showing which times in the week there was a high intensity of taxi trips. Similar to my thought process in looking at the yearly data, it was important for me to consider what would usually happen during weekdays or weekends that could influence the amount of activity that’s seen on the road by taxis. This is how I was able to arrive at the insight that commuting to work on weekdays and traveling for leisure on weekends influenced a high need for taxis at given points of this week.

Showing how the demand changed throughout the year and throughout a given week was very enlightening in terms of identifying what peak periods were for taxi activity. 

In general, I saw these 2 insights as important to highlight because it showed that the number of trips conducted by yellow taxes tended to vary across multiple time periods even at an hourly level. 


Which zones in particular generate the most revenue and activity in terms of trips? 
 

I began comparing what were the Zones with the highest total fares generated and what immediately jumped out at me was the fact that the taxi zones with the highest earnings were actually originating from airports. I already knew that Queens was second to Manhattan in terms of number of trips and amount earned in Fares, but it was very enlightening to know that the main drivers of this performance were the 2 airports found in Queens: JFK and La Guardia.

​

With the help of Tableau-calculated fields, I was also able to uncover that the most popular routes that taxis took almost always involved traveling within the Manhattan area. Even if a trip didn’t originate in Manhattan, the destination was more often than not headed towards Manhattan.  
 

These 2 graphs helped reinforce how popular Manhattan was in terms of taxi travel. It surprised me though to see that only airports would overtake Manhattan Zone in terms of earnings. 

​When I started to think about what were some of the famous landmarks of NYC and compared this side by side with what I was able to generate on a map visualizing trip volume, I realized how much of a factor tourism plays in the actual trip activity experienced by drivers in Manhattan. This became important to highlight as it further reinforces how much of a hotspot Manhattan is in terms of tourism and how as a result, the demand for taxis and opportunity to earn from fares is greater alongside NYC’s most popular landmarks and attractions. 

I found it very important to highlight these 2 pictures side by side together to reinforce how much tourism plays into the Taxi Activity seen in the Manhattan area. 

DASHBOARD AND PRESENTATION DESIGN

​

When  building a visualization, I always try to think in terms of what story I want to be telling to my audience and how I can best show that information in a way that will help them understand. This usually prompts me to try to keep things as simple and minimalistic as possible with my charts since I want my audience to already understand what is being said even without having me explain it in detail. In individual slides, I try as much as possible to include a caption that adds an additional insight on top of what the audience can gather by just looking at the visual.  

I generally avoid putting chart Titles in slides since I find that they become redundant with the I indicate in my captions. Admittedly though, I am still figuring out how I can be better at customizing my visuals in Tableau. 

Imagining how I believe my audience will interact with the visualizations also influenced what I opted to include in the design. I knew that even if I had a presentation deck and an dashboard with the dataset, I needed to approach the 2 differently because of how I imagined my audience interacting with the information. This allowed me to make decisions on which visuals were more relevant to the story of the presentation but not the story of the dashboard and vice versa. 

 

When it came to the presentation, I arrived at the conclusion that a deck that contained static visuals would be more relevant to a more senior audience who is interested in a more aggregate view of the information and what the key findings are. This is why I opted to create simpler visuals in some slides since I knew adding complexity in terms of the visualizations or adding different variations of the same visual may not necessarily allow me to communicate my point clearly. 
 

These are examples of visuals I found relevant to include in a presentation, but not necessarily in a dashboard. As much as possible, I also tried to avoid crowding a slide with too many insights.

The dashboard on the other hand, I would imagine would be more useful for individuals who want to further explore the data (In terms of Boroughs or Zones) and would want to choose from a series of different views and options. That’s why I opted to create as much interactivity with the dashboard as I could. It was also intentional for me to split the dashboard into 2 separate sections that could be linked together with navigation buttons and that there were also prompts so people knew where or what to click. From my past experience working with dashboards, it always helped to make things more user-friendly especially for people who may not necessarily be the most data-savvy.  I also knew that even if dashboards were meant to consolidate multiple visuals and indicators, having too much information in one page can be overwhelming. 

As I wanted individuals to look at the data in terms of Boroughs in general, and then in terms of Zones, I felt that it was more appropriate to split it into 2 sections. 

 

 

Project Outcome

​

After much work, I was finally able to complete an interactive dashboard and a sample presentation as outputs of the my NYC Taxi Data analysis project. Although I may not have been able to present to the NYC Taxi and Limo commission, this was a great application of the skills I had picked up through the Google Data Analytics course!

​​

Here is a presentation deck consolidating my main insights from the data analysis. 

Here is the link to the actual dashboard in my Tableau public profile. However, I do recommend viewing it in desktop instead of mobile!

The entire Google Analytics Certificate course took me about 3 months to complete. I was very proud to have finally finished my Capstone Project since it was a good application of the skills I had gained from the course. 

bottom of page