Citibike: Analyze yourself!


In the 5 years that I have lived in NYC, the biggest improvement anything’s made to my quality of life has been Citibike. I like being on wheels, I like being energy efficient, I like the wind in my face, I like my freedom of commute (a regular bike binds itself to you), and I like that it saves my trip data! I like to think I am a Fastridahhh, but we are not always what we think we are. So by baselining myself against this analysis, I wanted to see where I actually truly lie.

The task was simple, get the lat/long of my start and end citibike stations, use Google Distance Matrix API to figure out the biking distance between the two locations, use the time from my citibike data, and get my average speed and see how it changes with distance etc.

Once the stew was cooked, the results were fairly interesting.
My average speed was 13.7 kmph (8.5 mph), but I have done no outlier removal or anything because lazy. The median was pretty much exactly the same as average (13.56 kmph), indicating an even distro. This was a little disappointing frankly, I was only slightly faster than the overall average according to the wider analysis (8.3 mph) and my self image rightly got punched back into its place. I can make up excuses for why it is slower than I feel I ride at, but that is just being dishonest (I have already thought of three, including shifting the blame on someone else).

Most of my trips are short distances that are annoying to walk (1-2 km), with nearly 50% falling in that range, the rest are equally distributed across the other buckets.

Distance Range # Trips Average time (min) Average Speed
0 to 0.5K 8% 1.53 8.24
0.5K to 1K 21% 3.87 11.90
1K to 2K 48% 5.89 13.09
2K to 5K 22% 11.31 14.15
Greater than 5K 1% 16.12 18.84
Grand Total 100% 6.38 13.36

I also wanted to look at how speed varies with distance, as you would expect it to be highly correlated (they were, with a correlation coefficient of 0.4). But I also created a polynomial best fit line with an order of two (anything higher is just overfitting and the data seems to follow a parabolic pattern rather than linear). The R-squared sits nicely (for such unclean data with so few datapoints, especially for higher distances) at 0.265, which vindicates my hunch about it being a parabolic curve. Here’s the pretty chart for your pretty eyes to peruse:

Speed v Distance

If you want to do it too:
Please do, and let me know what your results look like.

Getting the lat/longs:
Surprisingly, Citibike doesn’t have any place that you can get the lat/longs of stations from unless you distill it from their trip data. So I did that. And here’s the file for all you guys (updated till March 2017, includes Jersey City citibike locations)
Citibike Stations Lat Long

Using the Google Distance Matrix API:
I prefer Excel for small analysis, so wanted to write a VBA code that would get me the distance between two lat/long pairs. Here’s what I made, with a little help from my friends over at StackOverflow given that I don’t really know VBA. You can create a new module, paste this code there save it, and then use the function called getDist to get the distance between the lat/long of the locations (which you’ll need to vlookup from the station lat/long map that I shared earlier)

(WordPress is being a prick and not letting me format this properly, so please insert tabs at appropriate places)


Public Function getDist(orig_lat As Double, orig_Double As Double, end_lat As Double, end_long As Double) As String

Dim xhrRequest As XMLHTTP60
Dim domDoc As DOMDocument60
Dim domDoc2 As DOMDocument60
Dim nodes As IXMLDOMNodeList
Dim node As IXMLDOMNode

'You must acquire a google api key and enter it here
Dim googleKey As String
googleKey = "XXX" 'your api key here

'Send a "GET" request for place/textsearch
Set xhrRequest = New XMLHTTP60

Dim requrl As String
requrl = "https://maps.googleapis.com/maps/api/distancematrix/xml?origins=" & orig_lat & "," & orig_Double & _
"&destinations=" & end_lat & "," & end_long & "&mode=bicycling" & "&key=" & googleKey

xhrRequest.Open "GET", requrl
xhrRequest.send

'Save the response into a document
Set domDoc = New DOMDocument60
domDoc.LoadXML xhrRequest.responseText

'Find the first node that is called "distance" and is the child of the "result" node. The SelectSingleNode will pick the first result, as google can return multiple options
Set ixnlDistanceNodes = domDoc.SelectSingleNode("//distance/value")

If Not ixnlDistanceNodes Is Nothing Then
getDist = ixnlDistanceNodes.Text
Else
getDist = -1
End If

Set domDoc = Nothing
Set xhrRequest = Nothing

End Function

2 comments

  1. Congrats on producing an interesting analysis. Not sure if the eyes in my case are pretty, but the graph certainly is.
    Would like to pose a question to push your thinking regarding a few considerations here, because I am a bit surprised to see the results (higher the distance, higher the avg speed for relatively small journeys upto 3km):

    Are we to infer that we tend to bike faster for a higher distance journey? Or is it other factors that is decreasing the average for a short distance journey?

    a. Is this because we spend (say 30 to 45 sec per ride?) getting the bike, putting our bag on it, accelerating to top speed and then at the end decelerating, possibly walking a block with the bike and putting it back into its dock? The effect of this time on the avg speed for short journeys will be immense and for long journeys will not be so much. What does the graph look like after this correction?

    b. Did the Google Distance Matrix API give you biking distance between lats & longs or did it give you the ‘as the crow flies’ distance? In case it is the latter, then the overprediction in avg speed of a longer distance journey could be very different as compared to the overprediction of avg speed of a shorter distance journey depending on what kind of journeys you typically ride.

    1. Google API gives the road distance, otherwise the speeds would have been much higher.

      My hypothesis for why longer trips takes lesser time is that 1. time spent docking/undocking, checking directions etc before you start biking punishes short trips more than longer, and 2. that traffic light stops also punish shorter trips more than longer, as the longer trips are usually in riverside bike lanes or bridges, and red light time becomes a fixed constant for bigger trips, while for smaller ones it can affect the speed dramatically (can go either way however).

Say something...

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s