March 7, 2012

Time for MTA Bus Time

In mid-January, MTA launched MTA Bus Time, its real-time bus tracking and customer information system. It's currently live on Staten Island and the B63 route in Brooklyn, is expanding to the Bronx this year, and will cover the whole city in 2014

It has web map, mobile web, and SMS text interfaces, along with a fully featured, standards-based developer API. Smartphone users can scan the QR code at each stop to be automatically directed to the Bus Time page for that stop.

Particularly compelling to the author is the approach we've taken to delivering this project quickly and inexpensively (for the short and long term). With MTA as the overall systems integrator, we've ensured that the system is constructed using open standards for all interfaces, and open source software licensing when appropriate. And, oh yeah, it runs on Amazon's cloud.

Please, let us know what you think.

Better late than never right? (this post, not the project...)

April 3, 2011


Please put your hands together for... OpenStreetBlock -- a web service for turning a given lat/lon coordinate (e.g. 40.737813,-73.997887) into a textual description of the actual city block to which the coordinate points (e.g. "West 14th Street bet. 6th Ave. & 7th Ave") using OpenStreetMap data.

There are likely many applications for such a service. It should be quite useful any time one might need to succinctly describe a given location (or set of locations) without using a map. I imagine it would be particularly helpful for field testing a real-time bus tracking and customer information system using a smartphone or other small mobile browser.

The basic concept of how it works is the following.

  1. Start with a given lat/lon coordinate (40.737813,-73.997887, for example).
  2. Find the street segment ("way" in OpenStreetMap terminology) physically closest to the given coordinate. Assume this is the street we are on: in this case, "14th St."
  3. Find the two intersections ("nodes" in OpenStreetMap terminology) closest to the given coordinate on the selected street. Assume these are the intersections we are between.
  4. For each of those intersections, find the streets passing through those intersections. Exclude any intersecting streets with the same name as the selected street (the one we are "on"). Use the remaining streets to name the given intersection (the ones we are "between"): in this case, 6th Avenue and 7th Avenue.
  5. OpenStreetBlock also uses a configurable threshold parameter to determine whether we are "at" a given intersecting street rather than "between" two intersections (this is the so-called "Corner Threshold"). If we are within this threshold of the nearest intersection, drop the other intersection: in this case, we are not.

I've been hacking OpenStreetBlock on and off for several months now, and I finally found the time to clean it up a bit for release as a free-standing open source project. To learn more, read about the web service itself or try some example searches (this instance is only populated with OpenStreetMap data for NYC less Staten Island).

This project wouldn't be nearly as easy as it was without: PostgreSQL, PostGIS, osm2pgsql, Osmosis, Apache, PHP (yes, I know), and OpenStreetMap.

May 19, 2010

The Dishes (and Thesis) are Done

At times, I find it easiest to express my momentary sentiment with a pithy pop culture reference. I don't think I've even seen "Don't Tell Mom the Babysitter's Dead" in its entirety, but this particular scene comes to mind right now (where the guys in question were asked by the babysitter to do the dishes):

And by "dishes" in this case I actually mean GRAD SCHOOL. When my thesis was signed yesterday, I became, for all intents and purposes, a Master of Science in Transportation and in Operations Research. I don't feel any different; do you?

As for the content of the thesis, entitled "Automatic Data for Applied Railway Management: Passenger Demand, Service Quality Measurement, and Tactical Planning on the London Overground Network," it's posted here for anyone to see. I recommend it only to (a) transit planners and researchers, and (b) people who need help sleeping.

The gist of the thing, really the gist of my entire graduate education, is that there are a million ways that transit planners and managers should take advantage of the glut of automatic data this is becoming ever more available. As per the abstract:

This thesis develops and tests methods to (i) estimate on-train loads from automatic measurements of train payload weight, (ii) estimate origin-destination matrices by combining multiple types of automatic data, (iii) study passenger incidence (station arrival) behavior relative to the published timetable, (iv) characterize service quality in terms of the difference between automatically measured passenger journey times and journey times implied by the published timetable. It does so using (i) disaggregate journey records from an entry- and exit-controlled automatic fare collection system, (ii) payload weight measurements from ``loadweigh'' sensors in train suspension systems, and (iii) aggregate passenger volumes from electronic station gatelines. The methods developed to analyze passenger incidence behavior and service quality using these data sources include new methodologies that facilitate such analysis under a wide variety of service conditions and passenger behaviors.

The above methods and data are used to characterize passenger demand and service quality on the rapidly growing, largely circumferential London Overground network in London, England. A case study documents how a tactical planning intervention on the Overground network was influenced by the application of these methods, and evaluates the outcomes of this intervention. The proposed analytical methods are judged to be successful in that they estimate the desired quantities with sufficient accuracy and are found to make a positive contribution to the Overground's tactical planning process.

One aspect of the analyses in the research was to "assign" passengers to individual scheduled services depending on when and where they entered the system and where they were going (all recorded by the Oyster smartcard ticketing system). Something I really enjoyed was implementing this using the open source trip-planning software Graphserver. I just:

  1. converted the Overground's schedules to GTFS (with a little hacked up Perl script)
  2. fed the GTFS schedules into Graphserver
  3. queried Graphserver to "plan" a trip for each of the Oyster journey data records (with another little hacked up Perl script)

The result was full information about the least-travel time scheduled path for that journey. In other words, Graphserver did all the algorithmic heavy lifting typically associated with "schedule-based assignment" and all I had to do was a little work around the edges. Bam! This was just one small part of the thesis, but it was great to be able to leverage the open source and open standards-based transit-related tools being developed these days. Naturally, all the statistical, numerical, and graphical analysis was done with R

With that, and after a brief upcoming vacation to a land of cheese, wine, and (edible) Oysters, the page turns. As of July 6, I will be taking what I have learned in the past decade as a hacker (in the most benevolent sense of the word) for the finance, internet, and transit industries and putting it to work for my new employer, the MTA, on some strategic internally and externally facing technology projects. Watch out NYC.

April 23, 2010

The Existential Lean

There I was at Bergen St waiting for my train to work. I looked at the sign when I came in, it said 4 minutes, I said cool and turned up my ipod. A minute letter I found myself about to lean over the tracks to look for the next train. But then I remembered the sign, didn't lean, and felt a somewhat unpleasant feeling evaporate. A weight on my shoulders lifted and a knot in my gut untangled itself. Nirvana! (Maybe the existential lean is not so existential...)

I spent the next 2 minutes until my train watching people come through the turnstile, ignore the snazzy new real‑time sign, go right to leaning over the tracks, and get that sour "where's my damn train already?!?!" face.

I think there is going to a period of adjustment while New Yorkers (at least those of us using the IRT) adjust to the new reality. But when we do, it will be a whole new day. I wonder what other kinds of behavior changes we will see. Maybe if we can quickly look at the sign as we run down the stairs we won't feel the need to stick out foot in the closing door if we can see the next train is 3 minutes behind. Or maybe we will be more compelled if we see the next train is 10 minutes away.

My personal hope is that whatever positive benefits we experience from the real‑time information on the signs (and hopefully on our cell phones!) will motivate us all to get the systems in place to create same information on the rest of the subway system.

August 9, 2009

What's Capacity got to do with my City?

Recently given an advance copy of the official 2008 subway passenger counts, I found myself wondering -- what would it take in terms of auto facilities to replace the morning rush hour carrying capacity of the NYC subway?

This is an important question because the cost (be it financial, environmental, etc) of building, operating, and maintaining a transportation facility is generally determined by the maximum capacity it is expected to provide. To avoid ruining any surprises, all calculations here are derived from the publicly available 2007 Hub Bound Report, and implemented in this spreadsheet. The "hub" here is Manhattan below 60th Street -- New York City's official CBD.

Just to get warmed up, chew on this -- from 8:00AM to 8:59 AM on an average Fall day in 2007 the NYC Subway carried 388,802 passengers into the CBD on 370 trains over 22 tracks. In other words, a train carrying 1,050 people crossed into the CBD every 6 seconds. Breathtaking if you ask me.

Over this same period, the average number of passengers in a vehicle crossing any of the East River crossings was 1.20. This means that, lacking the subway, we would need to move 324,000 additional vehicles into the CBD (never mind where they would all park).

What does it take to move that many additional vehicles? Well, it depends. Different auto facilities in the city appear to have different capacities (as expressed in vehicles per hour per lane):

Facility  Inbound Lanes  Max Hourly Inbound Traffic  Veh/Lane/Hr  Lanes Needed 
Queens Midtown Tunnel23,8821,941167
FDR Drive35,4251,808179
Brooklyn Battery Tunnel23,0171,509215
Brooklyn Bridge34,2621,421228
West Side Highway44,8251,206269
2nd Ave64,739790410
5th Avenue51,712342946

At best, it would take 167 inbound lanes, or 84 copies of the Queens Midtown Tunnel, to carry what the NYC Subway carries over 22 inbound tracks through 12 tunnels and 2 (partial) bridges. At worst, 200 new copies of 5th Avenue. Somewhere in the middle would be 67 West Side Highways or 76 Brooklyn Bridges. And this neglects the Long Island Railroad, Metro North, NJ Transit, and PATH systems entirely.

Of course, at 325 square feet per parking space, all these cars would need over 3.8 square miles of space to park, about 3 times the size of Central Park. At that point, who would want to go to Manhattan anyway?

Without the NYC Subway, I'm pretty sure this is what it would look like if we provided for to everyone commute by car:

Update: people seem to be taking this map quite seriously, whereas I would call it ... conceptual. The blocks in Manhattan below 57th Street are the theoretical parking lots, whereas everything else is the additional roads, bridges, and tunnels we would need to move the cars.

May 7, 2009

Spark it Up

This post is literally 2 years in the making. In the Spring of 2007, Jeff "You Don't Mess With the" Zupan gave me a spreadsheet with the annual 'registrations' (i.e. recorded entries) at each station in the NYC subway system going back to the beginning (1905). At the time, I was heavy into the new open source geo stack, as is reflected in the main piece of work I did at RPA. Hammer in hand, I of course saw this spreadsheet as a bucket of nails.

The result, after much whacking, is, I think, compelling, but you'll have to see for yourself. The general idea it that the history of subway ridership tells a story about the history of a neighborhood that is much richer than the overall trend. An example, below, shows the wild comeback of inner Williamsburg, and how the growth decays at each successive stop away from Manhattan on the L train:

This is somewhat in contrast to the South Bronx, which is yet to see the resurgence in ridership, other than at Yankee Stadium and the Grand Concourse:

The stations around Wall Street tell a totally different story, in which the ups and downs of each dep/recession have more immediate but temporary effects:

My first stab at visualizing this data was a traditional cartographic approach, showing the overall growth from 1977 to 2006 at each station. This told an approximate story at the level of the whole the city, but did not leave much room for detailed exploration. Thanks to geoserver's awesome new(ish) dynamic symbolizers functionality, it was trivial to plot the station-by-station time series sparklines (generated in R of course) onto the interactive online map. (Originally I the plots were produced in Perl and placed onto the map with a Javascript WFS layer, but that is so 2005.)

For all this, I never really felt like this little experiment was ready for an audience. That all changed when OpenGeo put up its Open StreetMap base layer for the web, giving fancy overlays like this one the context they need.


At least 2 people have taken the data I put out there and used it to make some zippier interactive flash apps:

  • The first is very polished, but I think the designer is quite misled in his desire to not plot dots on a map, and thus to plot what looks like a network flow diagram but with totally bogus data
  • The second is a little rougher around the edges, but I'd say is much more honest, and thus useful

Not sure if anyone knows, but I also have GIS files for the subway here:

February 28, 2009

It's Not a Tumor

For reasons not worth mentioning, I had a brain MRI in recent weeks. As for prognosis, Arnold definitely said it best:

Not that I ever really thought it would be. What I did think about was the pictures I would get back. And they are money (perhaps even more so than the nap I took in the machine). Here's the best of the crop (click for enlargements, or go here):

Images | Video

Images | Video

Images | Video


I don't think I should have expected to recognize myself from a cross-section of my brain, eyeballs, and sinuses. In fact, I think I look damn silly.

I do recognize my crooked nose in this one, but I didn't expect the crookedness to continue inside the skull. That huge white ball inside my right cheek -- is that mucous? Is there a doctor in the house?

I can't resist this one: peep the brain stem.

November 27, 2008

What's in a Schedule?

I owe somebody what amounts to this blog post. Pardon the lack of illustrative diagrams.

I have been thinking about mass transit trip planning software for the web and for mobile devices. Between the individual efforts of agencies around the world, and Google's efforts towards open sharing of structured transit system data, we seem to be on the right track, institutionally speaking. As a user, however, I am perpetually frustrated by the focus that every transit trip planner I have ever used puts on the supposed schedule, even for services that are high frequency and/or less-than-perfectly reliable.

This general feeling, combined with two recent and exciting meetings I have had, leave me with a few nagging questions:

  • In providing transit users with such software, how useful is the schedule by which the transit provider has planned their operations?
  • When are expected waiting and travel times more useful than precise trip-by-trip itineraries?
  • What effect do randomness and unreliability have on those expectations?
  • Should the passenger plan her trip differently if she has to be on time than if her schedule is flexible?
  • Finally, does real-time information obviate the need for any or all of these other inputs?

The answer: it depends. The actual schedule (R trains leave Union St at 8:13, 8:25, 8:37 arriving at Union Square at 8:39, 8:51, 9:03, etc) is only relevant to the degree to which operations follow the plan. And even in the face of near-perfect operations, I only care about the schedule of departures when I have something to lose by ignoring it (i.e. when there's not always another train or bus in tolerably few minutes).

Expectations implied by the schedule (I should wait 6 minutes on average, but never longer than 12, and the ride is expected to take 26 minutes) are meaningful even when the precise schedule isn't, but only if those expectations are reasonable. For example, a simple model shows that as the service becomes even slightly variable, expectation of waiting time increases, as does the maximum. Of course, many things that cause some passengers to wait longer are experienced by other passengers as delays along the way.

Let's now think specifically about trip planning software for relatively high frequency urban transit services with normal amounts of variability. I don't want to be bothered with exact but fairly useless times of scheduled departures and arrivals. I just want to know how long I can realistically expect to have to wait, and how long the trip is likely to take. And when I have a hard timeline, like getting to a meeting or a catching an airplane, I want to know the (approximately) worst case scenario.

Current levels of unreliability in our transit systems are not something we should have to live with. More funding, saner public policy, and better management can go a long way towards fixing some problems. I am not focusing here on the sources of unreliability, but suffice it to say they are many, some debatably the provider's responsibility (eg missing drivers, faulty equipment) and some debatably not (eg on-street traffic, passenger behavior). But given that they are here today, would you rather think a trip will be fast and have it end up being slow, or would you prefer to have the best information possible when making your own decisions?

The copious amounts of real service data collected by transit providers from bus GPS and rail signaling systems are of great value here. They allow us to fairly easily and cheaply describe distributions of waiting and travel times, and thus estimate expectations and approximate maximums for use in trip planning software.

Often, those systems were in fact installed to provide real time data, with historical performance analysis a secondary or accidental purpose. The notion of an expected waiting time changes radically when real-time "next-vehicle" information is provided, assuming the real-time predictions are in fact accurate. However, even perfect real-time data doesn't protect from problems from occurring down the line or reduce the variability inherently introduced by successive transfers.

In the next generation of (open source?) web and mobile transit trip planning, please:

  • Give me the option to use the schedule or to use expected values, but try to be smart about the default.
  • When not using the schedule, please allow me to plan depending on how flexible my own schedule is.
  • Use real performance data to generate realistic expected and worst case scenarios.
  • When possible, especially when the trip is imminent, use real time data to reduce uncertainty in my trip plan, but make use of realistic expectations for forecasting the balance of the trip.

To implement such a trip planner, a number of open questions remain:

  • Even for a perfectly reliable system, where exactly is that threshold between using the schedule and using expectations?
  • How does this threshold change as a function of normal or excessive variability in operations?
  • What is the best way to integrate real-time data (of varying predicative quality) with realistic expectations for trip planning on-the-go?

If you're still awake, and have comments or questions, let's talk. The fact that this post found its way onto your computer makes it highly likely you already know how to get in touch.

November 6, 2008

You Know What I Did Last Summer?

I spent 10 weeks last Summer as an intern on the strategy team of Transport for London's (TfL) London Rail division. This part of TfL is responsible for the London Overground, the Docklands Light Railway, and Tramlink, is the presumptive operator of Crossrail (if and when...), and serves as TfL's interface with the National Rail network. My general task was to help London Rail start to make use of the oceans of data spewing out of the Oyster smartcard ticketing system, but I spent the bulk of my time working on a project that came to be titled Oyster-Based Performance Metrics for the London Overground. I've posted my final report and slides and outline for the presentation I gave to TfL executive management.

Rather than try to explain the work, I've just cut and pasted the executive summary from the report and included some of my favorite figures (with no explanation). It's not a terrible paraphrasing, but if there is a lot of really good meat in the document if you are bored and hungry. Snooze on...

The London Overground is a pre-existing rail service in London whose operating responsibility and revenue risk were recently granted to Transport for London (TfL). Here we discuss the prospect of using data from the Oyster smartcard ticketing system to evaluate the performance of the London Overground explicitly from a passenger’s perspective.

The core idea behind our approach is to directly measure end-to-end individual journey times by taking the difference between entry and exit transactions stored by the Oyster system. The focus of this study is Excess Journey Time (EJT), calculated on a trip-by-trip basis as the difference between the observed journey time and some standard. In this case, the standard is determined for each trip with reference to published timetables, indicating how long the trip should have taken under right-time operations. A positive EJT indicates that the journey took longer than was expected.

Excess Journey Time is interpreted as the delay experienced by passengers as a result of services not running precisely to schedule. The distribution of EJT indicates reliability. We validate these interpretations using a detailed graphical analysis, and then aggregate them to the line and network level over a variety of time periods. Our analysis is conducted on large samples of Oyster data covering several months and millions of Overground trips in 2007.

At the aggregate level, relative values of Excess Journey Time are largely in line with expectations. The North London Line has the highest average Excess Journey Time of all lines on the London Overground, around 3 minutes, and the widest distributions (i.e. least passenger reliability). On all lines, there is significant day-to-day variability of Excess Journey Time. For the whole London Overground, and for the North London Line in particular, Excess Journey Time is worst in the AM and PM Peak timebands.

The current performance regime for the London Overground is the Public Performance Measure (PPM), which measures the fraction of scheduled vehicle trips arriving at their destinations fewer than five minutes late. Over time, EJT shows a strong correlation to PPM. There is clear additional variation in EJT, indicating that it captures certain information about passenger experiences that PPM does not. This variation tends to increase as PPM decreases, particularly in the AM and PM peak timebands, which suggests that the effectiveness of PPM as a measure of the passenger experience decreases as service deteriorates.

Another quantity of interest derivable from Oyster data is the time between passenger arrival at the station and the scheduled departure of the following train. The spread of this distribution of this quantity indicates the degree to which passengers arrive randomly (i.e. "turn up and go") rather than time their arrivals according to schedules. We have found that on the North London Line, especially during the AM, interpeak, and PM peak periods, passengers tend to arrive randomly. This is apparently in contrast to conventional wisdom for National Rail services, and has distinct implications for crowding levels and timetabling practice. In an appendix to this report we look at this in detail, and recommend that even headways be prioritized in timetabling the North London Line.

The Overground is, by design, part of a larger integrated multimodal network. Oyster data, by nature, is somewhat ambiguous in representing passenger trips on such a network that involve transfers or multiple routing options. This poses certain problems to our methodology, but also presents the opportunity to quantify and understand the experience of passengers across the entire network. We discuss these problems, potential solutions, and opportunities at length, as well as other applications for this methodology, and future research directions.

We have concluded that Oyster-based metrics are effective for monitoring and identifying problems as experienced by passengers on the London Overground. They may be even more effective for use across the whole of London's public transport network, particularly as Oyster is in the process of being rolled out to all National Rail services in the Greater London Area.

July 16, 2008

It's the distribution, stupid

Never thought I'd see this in print, but the MTA let NY Times publish a distribution of Metrocard usage for monthly passes (see below).

While the caption of image points out that "some riders use the $81 passes for 40 or fewer rides," it fails to point out that anyone making 46* or fewer trips is losing money on their pass. The calculus of "losing" or "making" money on a monthly pass is of course fraught with nuance (e.g. how much is it worth to me to not have to think about paying on each trip? are any of these passes subsidized?) but the article doesn't touch on it at all.

It's no secret that I love NYC Transit and transit in general, but that doesn't mean people should be buying passes when it's far from beneficial. Don't even ask about the London case...

* 46 trips is the breakeven point when the monthly pass cost $81 and individual trips cost $1.74 (after the bonus)