New York Taxi and Limousine Commission release the cab ride data monthly, which include pick-up and drop-off times, locations, base fare, and tip.
I believe there is a lot to learn from this data that would make the cab rides more pleasant for both drivers and passengers.
It would also be of interest to many transportation companies and New York City in general.
I have downloaded and have done some small analysis on the data of cab rides took place in New York City in the year of 2015.
The corresponding data is over 25 GB. This represents about 77.9 million rides in the city (after trimming to trips originating and ending in Manhattan and winsorizing the extremes).
Ride frequencies
Number of rides in the whole 2015 is distributed over the hours of a day like below. A quick comparison to monthly ride frequencies clearly shows that people take more cabs in cold seasons. That is perfectly expected considering New Yorkers like to walk especially when it is warm.
Another interesting observation is that New Yorkers tend to take cabs more after work hours (6:00PM and later) when plethora of social activities of the city is underway. It is almost like people are not in a hurry to work in the mornings. Kidding aside, the difference between the broad peak at 6:00PM-to-11:00PM and narrow peak at 8:00AM-to-10:00AM corresponds to 940 thousand cab rides. This is pretty extraordinary and definitely calls for a deeper investigation to answer the question "Why?"
Fare and tip preferences
I move my focus on the cab base fare and tip distributions.
Looking at the frequency of the base cab fare and tip paid, we obtain gamma distributions with peaks at c. $6.25 and c. $1.5, respectively.
The base fare has broad peak with over 50% of passengers end up paying $5-$10 range; while, the tip paid has a narrower peak.
Although the frequencies are fluctuating significantly, the overall tip preferences do not show much variations as a function of the months. It is safe to say that, just like restaurant tipping, New Yorkers have internalized their cab-tipping standard.
The tip-percentage distribution (green curve) on the right describes that standard with a global maximum at 21% and two local maxima at 9-11% and 26%.
Distance traveled
An important measure in this exercise is the distance traveled.
A quick look at the distance data, one can immediately notice the more-than-25% drop in the distance 8:00AM through 6:00PM compared to midnight through 5:00AM. This can be attributed to continuous traffic rush in Manhattan during work hours, and passengers possibly not riding all the way to their destinations due to heavy traffic in major points of interests in the city.
One can add an extra measure to the exercise that is what I call the "true distance ratio", which basically is the ratio of the distance measured by the cab's odometer to the minimum great-circle-arc distance between the pick up and drop off locations. This minimum distance is not provided in the commission's data, but it is trivial to calculate using the Haversine Formula.
My calculations show that (red curve, below) this ratio is above the average (1.23) only between 11:00PM and 5:00AM.
I am going to speculate that this is due to cabs taking the highways (FDR and Hudson Parkway) in light traffic over the inner-city roads with lower speed limit and many traffic lights.
Geographical analysis
Ride frequencies discussed above reveal an interesting observation that there are approximately three thousand more rides per hour during 6:00PM to 10:00PM than the next busiest time interval (8:00AM-to-10:00AM, where~11,000 rides are taken every hour).
To understand this phenomenon, it might be useful to track where these rides originate and end.
My hypothesis for this exercise is that people take more cabs to social events and back to their homes afterwards. So, I decided to investigate drop off locations between 6:00PM and 8:00PM and 8:00PM and 10:00PM.
Resulting graphs reveal that drop offs during those times are concentrated in some social-event-rich (Time Square, Murray Hill, Chelsea, Greenwich Village) and residential (upper east and west sides) areas of Manhattan, which agrees well with our starting hypothesis.
Concluding remarks
It seems there are quite a bit of interesting patterns in these trip record data that deserve further processing, which in the end may help make the cab rides more pleasant for both drivers and passengers, and for the city as well.
In that matter, it may be quite interesting to extend this analysis through the entire database (2009-16).
Such an extended analysis may provide, for instance, weekday/weekend preferences of passengers. Work day versus vacation time comparison can also bring more insight into the during-work and after-work difference proposed above.