A Quick Tale of Failure: Getting from Bikeshare Station A to Bikeshare Station B in Chicago without getting charged extra.

[Note: I've made a Jupyter Notebook (Python) for this so that you can mess around with a few of these ideas yourself. The figures come from this notebook.]

One warning about this notebook: I tried to do something fancy with indices in the first part, and I did not like what I did. I'll eventually redo it but, for now, just go with it. I tried to use the "station id" instead of the pandas index, which got pretty annoying pretty fast and wasn't clean. The second part fixes this.

The divvy_bike_dist_matrix file for the first part (which the notebook refers to) is given here.

The dist_matrix_2 file for the second part (which the notebook refers to) is given here.

⚫ ⚫ ⚫ ⚫

In Chicago we have a bikeshare system called Divvy Bikes. The gist is that you can pay per year a fairly reasonable rate and you can take out a bike whenever you want it — so long as you return it to some Divvy station within 30 minutes. You can extend your trips by just "checking in" at a station: if you want to go from A to C, you may go from A to B in 29 minutes, put back your bike at station B, take it out again (you now have 30 minutes to ride), go to station C. If you don't make it, you're charged a small fee.

The question is: Can I go from station A to station B in this way without getting charged, regardless of where A and B are?

My first thought here was to take the Divvy Data and use Google Maps Directions API to give me the distance between stations. We can't just use Euclidean distance because that'd go through buildings, and I didn't know a better way to figure out Manhattan distance than Google's Directions API. As of Q1Q2 2015, there were 424 stations, which means finding directions for $$470^{2} - 470 = 220,430$$ stations (remember, the distance from A to A is 0 but when biking the distance from A to B is the same as the distance from B to A; some roads are one-ways!). The Google Directions API Rate Limit: 2,500 free directions requests per day. Great, this small project will only take me some 89 days. Maybe there's another way.

I had an idea. Maybe we can use the Divvy Data to tell us if people can get from A to B! In the data we have the station that a person has come from, the station they went to, and the trip duration. If we took the median of these values, then we could find what stations can get to what stations in 30 minutes or less!

So, I embarked on this journey. Besides having to clean the data a bit (there's some very, very long rides) and deal with indices (the ids of the bikes skip numbers, they aren't necessarily in any reasonable ordering, and I didn't want to reindex them in case a Divvy station in the future is removed or added, thus affecting the index of the other bikes — ultimately, this was a choice I made, but I probably should have just reindexed), there wasn't much heavy lifting to do. It's above in a Jupyter Notebook, along with the necessary data file.

Here's one of the results. The gray pointer is the station I'm starting at. The red are stations I should not be able to get to, by what the data says.

A few things wrong here. Some of these stations one could certainly get to in 30 minutes or less. What gives? Here's what I realized when I was testing out my "great idea":

When there are a significant number of possibilities for a start and end with significantly different levels of popularity, even a large sample size may not include all combinations.

Hindsight is 20/20, of course, but let's apply this really quickly. From my biking experience in Chicago, I know that most of the "closer" red markers are mainly off main roads with better Divvy access (these people may opt to either go to the larger divvy stations, take a train, a bus, or walk). It may be the case that users see a complicated Divvy ride over a long distance (close to 30 minutes) cutting it close and would rather opt to do something else.

Choosing some other stations (like one close to a popular area, Belmont, station id = 296) we get more reasonable results. This is most likely because more people will ride from here to many other stations in many other places. But we have to wonder: are the "impossible to get to" stations actually impossible, or is it that no one goes there?

We could do further analysis on here; indeed, to do this is only to tweak a few lines of code. But it wouldn't get me closer to a reasonable answer using this method.

⚫ ⚫ ⚫ ⚫

Any other smart ideas, dummy?

Sometimes it might just be easier to do the simplest thing. Euclidean distance doesn't quite get us the exact numbers, but it might give us something close. (Note: actually, we'd want to use something more like Haversine's Formula if we wanted to be a tad bit more precise, since lat and lng don't directly translate nicely to miles).

The only problems here are that Euclidean distance is going to be pretty awful at predicting $$< 30$$ minutes for stations farther away than those which are closer; for those which are close, because of Chicago's grid-like streets, there's almost always a quick way from point A to point B. Euclidean distance divided by average speed might give us a time that's off a bit, but it probably won't be off by much. Similarly, if Euclidean distance estimates give us that a cyclist couldn't make a straight-line distance of 30 minutes or more, then it's probably the case that this is also going to be impossible to get to.

The two tricky parts here are:

  1. That small window where Euclidean distance gives us that the cyclist could make it in, say, 20 to 30 minutes. We're not sure if this is only possible straight-line distance, or in general.
  2. Related, on longer trips the grid system may apply less. For example, crossing east-west in Chciago when on the north side is much harder (and much more dangerous!) than going north-south. We can't trust our straight-line estimates on longer trips that may require detours, a significant number of stops at traffic lights, etc., or ones where the grid system breaks down.

Nevertheless, if we ignore this and do things in a naive way, we can make an approximation of this idea (also in Jupyter notebook).

Sure, so what?

This post was just a recording of some things that I tried to do with the Divvy bike data. I didn't dig in too deeply, but hopefully this gave some idea of some methods that one can use to initially look at data. I'll hopefully append things to this post in the future as I think of them.